Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unwanted row number column appears in csv / excel while scheduling in email. #22981

Open
rohitpawar2811 opened this issue Feb 3, 2023 · 11 comments
Assignees
Labels
alert-reports Namespace | Anything related to the Alert & Reports feature #bug Bug report

Comments

@rohitpawar2811
Copy link

rohitpawar2811 commented Feb 3, 2023

Hello

I am scheduling a chart(in table form) in email as csv form , I am getting an extra column which has a count from 0 to n-row .
Which I should not want.(And I think no one wants in CSV because we get numbering in csv opening platforms)

Like :

Screenshot from 2023-02-03 19-10-58

Findings :

  • When I directly download csv from chart I didn't getting that extra one column
  • And one scenario : In the case of Email it avoids the extra column add in csv only if the chart is not in table form.
    Exactly I would say when the chart is in the form of bar/any other visualization except table then it will not add an extra column in csv while sending in email.

Using : superset-2.0.0-dev

My-Question :

  • It is intentionally did ?
  • Or any thing I am forgetting to set by which i can avoid this extra-column(with numbering)
  • Or this thing is resolved in the latest releases?

@michael-s-molina @codyml

@rohitpawar2811 rohitpawar2811 added the #bug Bug report label Feb 3, 2023
@rusackas rusackas added the alert-reports Namespace | Anything related to the Alert & Reports feature label Feb 3, 2023
@rohitpawar2811
Copy link
Author

rohitpawar2811 commented Feb 6, 2023

def get_data(self, df: pd.DataFrame) -> Union[str, List[Dict[str, Any]]]:
        if self._query_context.result_format == ChartDataResultFormat.CSV:
            include_index = not isinstance(df.index, pd.RangeIndex)
            columns = list(df.columns)
            verbose_map = self._qc_datasource.data.get("verbose_map", {})
            if verbose_map:
                df.columns = [verbose_map.get(column, column) for column in columns]
            result = csv.df_to_escaped_csv(
                df, index=include_index, **config["CSV_EXPORT"]
            )
            return result or ""

        return df.to_dict(orient="records")

I have tracked the issue but I am here stuck that both in the case of
csv -direct-download
csv -email
both execute this function and genrate csv which has no index-colmn because Include_index is always false here
but some-how in case of csv-email from any else method index column is created again

But got nothing , any suggestion realted to this bug

@rohitpawar2811
Copy link
Author

rohitpawar2811 commented Feb 7, 2023

@eschutho
This bug is created because of PostProcessing logic while downloading csv througth email-scheduling.

Can we stop PostProcessing from setting or from any other thing without modifing inside the code.?
@rusackas
I have to ask that is it really bug,Can I make pull request for removing index column in email-csv also
By manipulating postProcessing logic.

#Post-process the data so it matches the data presented in the chart. # This is needed for sending reports based on text charts that do the # post-processing of data, eg, the pivot table. if result_type == ChartDataResultType.POST_PROCESSED: result = apply_post_process(result, form_data,datasource)

@rohitpawar2811
Copy link
Author

rohitpawar2811 commented Feb 7, 2023

Proposed Changes : It will facilitate a user user to chose that it wants index in csv or not while emailing through
CSV_INDEX={"index":False} //In superset_config.py
Changes did In superset/charts/post_processing.py
1.

from superset import (
    app
)

config = app.config

2.
(From this to ) processed_df.to_csv(buf)-> processed_df.to_csv(buf, **config["CSV_INDEX"])

def apply_post_process(
    result: Dict[Any, Any],
    form_data: Optional[Dict[str, Any]] = None,
    datasource: Optional["BaseDatasource"] = None,
) -> Dict[Any, Any]:
    form_data = form_data or {}

    viz_type = form_data.get("viz_type")
    if viz_type not in post_processors:
        return result

    post_processor = post_processors[viz_type]

    for query in result["queries"]:
        if query["result_format"] == ChartDataResultFormat.JSON:
            df = pd.DataFrame.from_dict(query["data"])
        elif query["result_format"] == ChartDataResultFormat.CSV:
            df = pd.read_csv(StringIO(query["data"]))
        else:
            raise Exception(f"Result format {query['result_format']} not supported")

        processed_df = post_processor(df, form_data, datasource)

        query["colnames"] = list(processed_df.columns)
        query["indexnames"] = list(processed_df.index)
        query["coltypes"] = extract_dataframe_dtypes(processed_df, datasource)
        query["rowcount"] = len(processed_df.index)

        # Flatten hierarchical columns/index since they are represented as
        # `Tuple[str]`. Otherwise encoding to JSON later will fail because
        # maps cannot have tuples as their keys in JSON.
        processed_df.columns = [
            " ".join(str(name) for name in column).strip()
            if isinstance(column, tuple)
            else column
            for column in processed_df.columns
        ]
        processed_df.index = [
            " ".join(str(name) for name in index).strip()
            if isinstance(index, tuple)
            else index
            for index in processed_df.index
        ]

        if query["result_format"] == ChartDataResultFormat.JSON:
            query["data"] = processed_df.to_dict()
        elif query["result_format"] == ChartDataResultFormat.CSV:
            buf = StringIO()
           **#Changed Line**
            processed_df.to_csv(buf, **config["CSV_INDEX"])
            buf.seek(0)
            query["data"] = buf.getvalue()

    return result 

@eschutho
Copy link
Member

eschutho commented Feb 9, 2023

Thank you @Rohit-pawar902 for the issue and also the fix! We can implement what you suggested here, but you're also welcome to put up your own PR if you like.

rohitpawar2811 added a commit to rohitpawar2811/supersetMaster that referenced this issue Feb 22, 2023
…l which is caused by post_processing.py and now it is Fixed (\apache#22981)
@rohitpawar2811
Copy link
Author

PR : #23155

@Antonio-RiveroMartnez
Copy link
Member

Antonio-RiveroMartnez commented Feb 22, 2023

Hey @Rohit-pawar902 ! Thanks for addressing the issue 👏 . I do have a couple of thoughts regarding this issue and how to address it though:

  1. There's a bug reported already for the extra column in the CSV: Not possible to email a raw csv via "Alerts and Reports" #21568 , so, not sure if we want to treat this as an opt in/out config? I mean, if it's treated as a bug, you shouldn't be able to opt out from the fix.
  2. If we decided to go with the opt in/out, should we instead use a Feature Flag instead? Since there's no extra configs in it just a True/False. Or do we see more configs being added here?

@rohitpawar2811
Copy link
Author

rohitpawar2811 commented Feb 24, 2023

@Antonio-RiveroMartnez
Yep I am convinced with your 1 St statement that it's bug and we don't have to give opt out thing.

Actually previously I am not sure that it's really a bug or it is intentionally written like that , keeping this in mind I gave the opt in/out , so that someone wants the raw CSV or with index they can do that

Now I would make changes keeping you issue in mind and treating as bug.

@rohitpawar2811
Copy link
Author

@Antonio-RiveroMartnez
Can you help me that is I have to create new PR ? for changes(in which option in/out not there directly we would make index= false ) ,Or I can change the current PR

@Antonio-RiveroMartnez
Copy link
Member

Antonio-RiveroMartnez commented Feb 24, 2023

Hey @Rohit-pawar902 ! Yes, you can make your changes in the same PR, there's no need for a new one, make sure tho, you update whatever feels necessary in the PR Template to fit your new scope, i.e, title, description, etc.

@rusackas
Copy link
Member

@rohitpawar2811 / @Antonio-RiveroMartnez This one seems to have gotten away from it. Is there any work left to be done on the linked PR that would close this issue?

@rusackas
Copy link
Member

rusackas commented Jul 9, 2024

It looks like the PR by @SkinnyPigeon is close to what we want here, but might need a volunteer to move the choice for including/excluding row indices from a config variable to a report modal config option.

@sfirke sfirke changed the title Redundant column problem comes in csv while scheduling in email. Unwanted row number column appears in csv / excel while scheduling in email. Jul 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
alert-reports Namespace | Anything related to the Alert & Reports feature #bug Bug report
Projects
None yet
Development

No branches or pull requests

4 participants