Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: set columns numeric datatypes when exporting to excel #27229

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

squalou
Copy link

@squalou squalou commented Feb 23, 2024

SUMMARY

When exporting to Excel, currently datatypes are not specified, leading to decimal numbers being stored as strings, which may lead to issues when opening the file.

Depending on the locale, the issue may not even be visible, Excel managing to convert things. (that's the case in us/en locale). When using another locale (say fr), then numbers, output with '.' as decimal separator and stored as strings won't be usable as numbers in Excel.

The idea here is to use the same "datatype guessing" method already existing, and use it to convert dataframe columns types when required before exporting.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

not applicable

TESTING INSTRUCTIONS

  • Create a Table dashboard with numercical and decimal numbers in it. Ideally add some strings and dates.
  • Export to Excel
  • you can then unzip the .xlsx file, and open sheet1.xml in a text editor to check the export
  • Numbers appear directly in xml cell reference
  • Strings are not visible directly, instead a <s> markup is used containing an id to the string

Editing the sheet1.xml file is the safest way to check the issue, due to magical operations softwares like Excel, LibreOffice Calc or others perform when opening files that may hide the issue.

ADDITIONAL INFORMATION

  • Has associated issue: Fixes Datatype not correctly exported to xlsx - with patch suggestion #23375
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@john-bodley john-bodley added the review:checkpoint Last PR reviewed during the daily review standup label Feb 23, 2024
@rusackas
Copy link
Member

Approving CI run🤞

Copy link

codecov bot commented Feb 23, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 69.71%. Comparing base (ce0b70c) to head (b641e13).
Report is 30 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #27229      +/-   ##
==========================================
+ Coverage   67.34%   69.71%   +2.37%     
==========================================
  Files        1909     1909              
  Lines       74622    74633      +11     
  Branches     8324     8324              
==========================================
+ Hits        50257    52034    +1777     
+ Misses      22312    20546    -1766     
  Partials     2053     2053              
Flag Coverage Δ
hive 53.76% <81.25%> (?)
mysql 78.00% <81.25%> (+0.02%) ⬆️
postgres 78.10% <81.25%> (+<0.01%) ⬆️
presto 53.71% <81.25%> (?)
python 83.15% <100.00%> (+4.91%) ⬆️
sqlite 77.62% <81.25%> (+<0.01%) ⬆️
unit 56.55% <81.25%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rusackas rusackas changed the title #23375 set columns numeric datatypes when exporting to excel Fix(23375): set columns numeric datatypes when exporting to excel Feb 23, 2024
@rusackas
Copy link
Member

Looks like you need to run the pre-commit hook for formatting/linting.

@michael-s-molina michael-s-molina changed the title Fix(23375): set columns numeric datatypes when exporting to excel fix: set columns numeric datatypes when exporting to excel Feb 23, 2024
@michael-s-molina
Copy link
Member

Thank you for the fix @squalou. Could you add a test for it to avoid future regressions?

@squalou
Copy link
Author

squalou commented Feb 24, 2024

Looks like you need to run the pre-commit hook for formatting/linting.

yeah sorry about that. Should be fixed now, my IDE for this project is not on "auto format", totally forgot.

Thank you for the fix @squalou. Could you add a test for it to avoid future regressions?

I'd love to ! really. But I admit I' a bit lost in superset architecture and test framework, I'm not a python pro dev to begin with which does not help.

I've tried to guess where 'export csv' might be tested and enhance from there but did not find anything. Got myself lost in viz_tests.py and didn't go any further.

Any help / pointer to where to begin with would be appreciated.

@michael-s-molina
Copy link
Member

Any help / pointer to where to begin with would be appreciated.

You can create a test in tests/unit_tests/common similarly to test_get_aggregated_join_column.py

@rusackas rusackas removed the review:checkpoint Last PR reviewed during the daily review standup label Feb 26, 2024
@squalou
Copy link
Author

squalou commented Feb 27, 2024

Any help / pointer to where to begin with would be appreciated.

You can create a test in tests/unit_tests/common similarly to test_get_aggregated_join_column.py

Thank you ! will have a look asap (on vacation break this weak :D )

@squalou
Copy link
Author

squalou commented Mar 5, 2024

[EDIT] simply writing things down here helped me realize I should break the tested function in two, and only test for the changes really introduced.

I submitted "something".

Now ... I do not test in these tests that the excel export really works (it does, see further), and that's still a shame, maybe ?

Leaving below for "excel export" testability issues I encountered.

(original post below line)


Hi,

I manage to run tests locally and I'm trying to add one... but I have (at last) two issues.

You'll see from these that ... I'm full of good will but a desperate python newbie :)

First : BaseDatasource

Following test example I initiate this

query_context_processor = QueryContextProcessor(
    QueryContext(
        datasource=BaseDatasource(),
        queries=[],
        result_type=ChartDataResultType.COLUMNS,
        form_data={},
        slice_=None,
        result_format=ChartDataResultFormat.XLSX,
        cache_values={},
    )
)

But then, in context_query_processor.py, method get_data(), theres' a call

verbose_map = self._qc_datasource.data.get("verbose_map", {})

that ends up calling "uid", and finally ... raise a lovely NotImplementedError. Seing the code - unless there's an object overriding teh type method somewhere - I have no idea what to do to avoid this.

    @property
    def type(self) -> str:
        raise NotImplementedError()

    @property
    def uid(self) -> str:
        """Unique id across datasource types"""
        return f"{self.id}__{self.type}"
superset/common/query_context_processor.py:569: in get_data
   verbose_map = self._qc_datasource.data.get("verbose_map", {})
superset/connectors/sqla/models.py:376: in data
   "uid": self.uid,
superset/connectors/sqla/models.py:266: in uid
   return f"{self.id}__{self.type}"
   
self = <superset.connectors.sqla.models.BaseDatasource object at 0x74ce68123fa0>

   @property
   def type(self) -> str:
>       raise NotImplementedError()
E       NotImplementedError

So, in order to go on a bit, I put locally a try/catch around verbose_map thing, which may be a solution eventually ?

Which leads to my even bigger issue ...

Second : how to check

the method I must test returns this

                result = excel.df_to_excel(ndf, **config["EXCEL_EXPORT"])
            return result or ""

which is not very testable.

a log output gave me as expected a binary content that looks like this, so, yeah, the test runs.

 2024-03-05 18:46:01,425:INFO:root:b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x00\x00?\x00a]I:O\x01\x00\x00\x8f\x04\x00\x00\x13\x00\x00\x00[Content_Types].xml\xad\x94\xcbn\xc20\x10E\xf7\xfd\x8a\xc8\xdb*1tQU\x15\x81E\x1f\xcb\x16\xa9\xf4\x03\\{B,\x

Is there a way to "mock" or "override" the excel.df_to_excel call so that the test can get comparable datas ?

Or else, is it advisable to somehow refactor the method, extract a more testable one before the call to excel conversion ?

thanks for any advice !

regards

@pull-request-size pull-request-size bot added size/L and removed size/S labels Mar 5, 2024
@squalou squalou force-pushed the master branch 2 times, most recently from 0aa2191 to 2cf4443 Compare March 6, 2024 07:26
@squalou
Copy link
Author

squalou commented Mar 6, 2024

(damn, import sort was not set, fixed and re-pushed)

@@ -571,11 +574,22 @@ def get_data(self, df: pd.DataFrame) -> str | list[dict[str, Any]]:
df, index=include_index, **config["CSV_EXPORT"]
)
elif self._query_context.result_format == ChartDataResultFormat.XLSX:
result = excel.df_to_excel(df, **config["EXCEL_EXPORT"])
ndf = self.copy_df_for_excel_export(df, coltypes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@squalou Should this change be actually implemented in superset.utils.excel.df_to_excel to ensure consistent behavior? You could move your tests to tests/unit_tests/utils/excel_tests. We already have one there called test_timezone_conversion which is a similar type of conversion.

Copy link
Author

@squalou squalou Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes a lot of sense, sure.

What would be better,
add "coltypes" as optional parameter in excel.df_to_excel(...) method ?
Or move the method as-is in superset.utils.excel. package.

First option seems better but I'm not at ease with changing method signatures there without an advice, I really have no global overview of what may call what and consequences.

EDIT : from a quick check the method is called in one place and one test so ... let's go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked that coltypes are extracted from the DataFrame here.

output = io.BytesIO()

# timezones are not supported
for column in df.select_dtypes(include=["datetimetz"]).columns:
df[column] = df[column].astype(str)

ndf = copy_df_with_datatype_conversion(df, coltypes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@squalou Can the column types be extracted from the DataFrame like we are doing for timezones

for column in df.select_dtypes(include=["datetimetz"]).columns:
        df[column] = df[column].astype(str)

or are they not matching what's defined in coltypes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thing is : you could have data with leading zeros for instance, defined as string in the original datasource, and you want them as string in the end, or else excel would omit leading zeros, maybe round things, whatever.

I don't know much about DataFrame object, and the whole issue I'm trying to solve is to get the datatype as it was defined as the column type to begin with.

If you known of a way to get this data from the DataFrame itself, it sure would be MUCH more elegant than forwarding the column_types all way down to excel export.

To be honest, I discovered panda objects only to make up this patch ... so if you think this metadata can be stored / handled differently, I'll be glad to learn.

I also did not want to take any risk of changing the original dataframe object as I have no idea of global superset architecture.

Copy link
Author

@squalou squalou Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I almost forgot the most important part : I have absolutely no idea how to know what dtypes are 'stored' in the DataFrame at the beginning :) so I used the 'coltypes' that seemed to be there for a reason :)



def test_timezone_conversion() -> None:
"""
Test that columns with timezones are converted to a string.
"""
df = pd.DataFrame({"dt": [datetime(2023, 1, 1, 0, 0, tzinfo=timezone.utc)]})
contents = df_to_excel(df)
contents = df_to_excel(df, [])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
contents = df_to_excel(df, [])
contents = df_to_excel(df, [GenericDataType.TEMPORAL])

@@ -571,11 +574,22 @@ def get_data(self, df: pd.DataFrame) -> str | list[dict[str, Any]]:
df, index=include_index, **config["CSV_EXPORT"]
)
elif self._query_context.result_format == ChartDataResultFormat.XLSX:
result = excel.df_to_excel(df, **config["EXCEL_EXPORT"])
ndf = self.copy_df_for_excel_export(df, coltypes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked that coltypes are extracted from the DataFrame here.


return output.getvalue()


def convert_df_with_datatypes(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we could make it more explicit that the DataFrame can be modified and avoid the copy overhead. To do that, we need to separate column conversion from exporting. Something like:

def df_to_excel(df: pd.DataFrame, **kwargs: Any) -> bytes:
    output = io.BytesIO()

    # pylint: disable=abstract-class-instantiated
    with pd.ExcelWriter(output, engine="xlsxwriter") as writer:
        df.to_excel(writer, **kwargs)

    return output.getvalue()


def apply_column_types(
    df: pd.DataFrame, column_types: list[GenericDataType]
) -> pd.DataFrame:
    for column, column_type in zip(df.columns, column_types):
        if column_type == GenericDataType.NUMERIC:
            df[column] = pd.to_numeric(df[column])
        elif pd.api.types.is_datetime64tz_dtype(df[column]):
            # timezones are not supported
            df[column] = df[column].astype(str)
        else:
            df[column] = df[column].astype(column_type)
    return df

Copy link
Author

@squalou squalou Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably over-conservative but I have one remaining concern

  • else : df[column] = df[column].astype(column_type) ... again I'm a bit scared of the effect on excel export side, because I don't know how it works. Id' rather "leave untouched" data not to be converted, but here again you probably know better.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addendum : a quick test regarding the "else" conversion, it fails with this error

TypeError: Cannot interpret '<GenericDataType.STRING: 1>' as a data type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pd.to_numeric may raise exception, from documentation the "ignore error" param is deprecated and it's recommended to catch exceptions.

Good point. Here we have two options:

  • Catch and convert to a default. We also need to alert the user that some values were modified when exporting.
  • Don't catch and show a good message clearly indicating the column/row that is preventing the export.

else : df[column] = df[column].astype(column_type) ... again I'm a bit scared of the effect on excel export side, because I don't know how it works. Id' rather "leave untouched" data not to be converted, but here again you probably know better.

I'm not sure what you mean by untouched data but I'm assuming that when exporting users want the data to comply with what is defined as the column type. That was your original problem for numeric types, so I'm assuming that other non-numerical columns that are not matching their defined types should also be converted.

Let me ask for @betodealmeida and @eschutho input here as they have more context on this feature and might help us.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As en end user, I much more prefer having an xlsx file with data I have to manually reformat, than having nothing at all.

Catch and convert to a default. We also need to alert the user that some values were modified when exporting.

probably same as "dates with timezone" fix : df[column].astype(str)

Alerting the user is faaaaaaaar beyond anything I would be able to do, and agani as an end user I think I would'nt care. Nothing is said about timezone conversion either and I bet nobody cares.

Don't catch and show a good message clearly indicating the column/row that is preventing the export.

As said, as an end user I would rather have a not-perfectly formatted file, but will all data in it, than no data at all.

  • About "untouched fields" :

I don't really remember all the details (because it dates a bit and it's not my passion tbh :) ), but at the time, I manually opened the produced xlsx files, meaning : unzip it and see the xml-ish data inside and try to understand the differences when converted data or not. From there, open in libreoffice or MS and see the behaviors, and try to find the best.
Numeric values are stored "in the cells", while others have references in the cell and values mapped somewhere else. Or maybe it's the other way'round - can't even remember which is which. Anyway there's a clear separation between numeric and 'other'. Then, inside the 'other' category of course there may be further differences.

From my understanding then, the best thing one can do to expect any office software to deal with these files is to make sure numerics are numerics, anything doesn't matter that much and autodetection of strings work.

That's probably why df[column] = df[column].astype(str) in the date conversion has no real bad effect on usability.

That's all I know, which is not much for such a long text :)

@squalou
Copy link
Author

squalou commented Mar 7, 2024

damn, I still have a minor pylint issue ("exception as e" as a bad habit, moreover e is not used :) )

I have an issue though with an error in a file I did not change

superset/commands/base.py:61:0: I0021: Useless suppression of 'too-few-public-methods' (useless-suppression)

that's weird.

I rebased just in case.

@pull-request-size pull-request-size bot added size/M and removed size/L labels Mar 7, 2024
@squalou
Copy link
Author

squalou commented Mar 8, 2024

damn ! my "black" local version doesn't behave exactly the same as the one on CI ! Will be fixed on next run.

@squalou
Copy link
Author

squalou commented Mar 11, 2024

CI Run Fail :
Python-Integration / test-mysql (3.9) (pull_request) Failing after 6m

hm, no idea how this could be related with the PR, could it be failing on master before the changes ?

=================================== FAILURES ===================================
__________________ TestPostChartDataApi.test_chart_data_async __________________

self = <tests.integration_tests.charts.data.api_tests.TestPostChartDataApi testMethod=test_chart_data_async>

    @with_feature_flags(GLOBAL_ASYNC_QUERIES=True)
    @pytest.mark.usefixtures("load_birth_names_dashboard_with_slices")
    def test_chart_data_async(self):
        self.logout()
        app._got_first_request = False
        async_query_manager_factory.init_app(app)
        self.login("admin")
        rv = self.post_assert_metric(CHART_DATA_URI, self.query_context_payload, "data")
>       self.assertEqual(rv.status_code, 202)
E       AssertionError: 200 != 202

tests/integration_tests/charts/data/api_tests.py:707: AssertionError

@rusackas
Copy link
Member

hm, no idea how this could be related with the PR, could it be failing on master before the changes ?

Agreed, seems unrelated... I'll rerun it in hopes that it's just a flaky test.

@squalou
Copy link
Author

squalou commented Mar 15, 2024

I've been away a bit, I'm back. Is there anything I should do ?
Maybe try a rebase master and shoot again ?

right now I'll wait for further instructions, rather than adding mess to the situation :)

@rusackas
Copy link
Member

Hah... I'm rerunning the failing test now in hopes that it's just flaky. Fingers crossed.

@squalou
Copy link
Author

squalou commented Mar 19, 2024

Hah... I'm rerunning the failing test now in hopes that it's just flaky. Fingers crossed.

damn, from what I remember previous crash was not this one. (do we have history of previous run somewhere to be sure?)

SqlLab query tabs
  
  We detected that the Chromium Renderer process just crashed.

@rusackas
Copy link
Member

We detected that the Chromium Renderer process just crashed.

It seems that GitHub Actions runners are random machines from different clusters... these things are hard to nail down, but certain machines don't match, and have troubles. It makes the tests look flaky, which is really annoying. I'm re-running that 1/3 of the E2E tests now 🤞

@squalou
Copy link
Author

squalou commented Apr 2, 2024

should I try a rebase or is it useless ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Datatype not correctly exported to xlsx - with patch suggestion
4 participants