fix: set columns numeric datatypes when exporting to excel #27229

squalou · 2024-02-23T13:01:27Z

SUMMARY

When exporting to Excel, currently datatypes are not specified, leading to decimal numbers being stored as strings, which may lead to issues when opening the file.

Depending on the locale, the issue may not even be visible, Excel managing to convert things. (that's the case in us/en locale). When using another locale (say fr), then numbers, output with '.' as decimal separator and stored as strings won't be usable as numbers in Excel.

The idea here is to use the same "datatype guessing" method already existing, and use it to convert dataframe columns types when required before exporting.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

not applicable

TESTING INSTRUCTIONS

Create a Table dashboard with numercical and decimal numbers in it. Ideally add some strings and dates.
Export to Excel
you can then unzip the .xlsx file, and open sheet1.xml in a text editor to check the export
Numbers appear directly in xml cell reference
Strings are not visible directly, instead a <s> markup is used containing an id to the string

Editing the sheet1.xml file is the safest way to check the issue, due to magical operations softwares like Excel, LibreOffice Calc or others perform when opening files that may hide the issue.

ADDITIONAL INFORMATION

Has associated issue: Fixes Datatype not correctly exported to xlsx - with patch suggestion #23375
Required feature flags:
Changes UI
Includes DB Migration (follow approval process in SIP-59)
- Migration is atomic, supports rollback & is backwards-compatible
- Confirm DB migration upgrade and downgrade tested
- Runtime estimates and downtime expectations provided
Introduces new feature or API
Removes existing feature or API

rusackas · 2024-02-23T19:53:15Z

Approving CI run🤞

codecov · 2024-02-23T19:58:48Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 69.71%. Comparing base (ce0b70c) to head (b641e13).
Report is 30 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #27229      +/-   ##
==========================================
+ Coverage   67.34%   69.71%   +2.37%     
==========================================
  Files        1909     1909              
  Lines       74622    74633      +11     
  Branches     8324     8324              
==========================================
+ Hits        50257    52034    +1777     
+ Misses      22312    20546    -1766     
  Partials     2053     2053

Flag	Coverage Δ
hive	`53.76% <81.25%> (?)`
mysql	`78.00% <81.25%> (+0.02%)`	⬆️
postgres	`78.10% <81.25%> (+<0.01%)`	⬆️
presto	`53.71% <81.25%> (?)`
python	`83.15% <100.00%> (+4.91%)`	⬆️
sqlite	`77.62% <81.25%> (+<0.01%)`	⬆️
unit	`56.55% <81.25%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

rusackas · 2024-02-23T20:07:14Z

Looks like you need to run the pre-commit hook for formatting/linting.

michael-s-molina · 2024-02-23T20:10:42Z

Thank you for the fix @squalou. Could you add a test for it to avoid future regressions?

squalou · 2024-02-24T09:40:37Z

Looks like you need to run the pre-commit hook for formatting/linting.

yeah sorry about that. Should be fixed now, my IDE for this project is not on "auto format", totally forgot.

Thank you for the fix @squalou. Could you add a test for it to avoid future regressions?

I'd love to ! really. But I admit I' a bit lost in superset architecture and test framework, I'm not a python pro dev to begin with which does not help.

I've tried to guess where 'export csv' might be tested and enhance from there but did not find anything. Got myself lost in viz_tests.py and didn't go any further.

Any help / pointer to where to begin with would be appreciated.

michael-s-molina · 2024-02-26T13:34:23Z

Any help / pointer to where to begin with would be appreciated.

You can create a test in tests/unit_tests/common similarly to test_get_aggregated_join_column.py

squalou · 2024-02-27T19:04:16Z

Any help / pointer to where to begin with would be appreciated.

You can create a test in tests/unit_tests/common similarly to test_get_aggregated_join_column.py

Thank you ! will have a look asap (on vacation break this weak :D )

squalou · 2024-03-05T17:52:46Z

[EDIT] simply writing things down here helped me realize I should break the tested function in two, and only test for the changes really introduced.

I submitted "something".

Now ... I do not test in these tests that the excel export really works (it does, see further), and that's still a shame, maybe ?

Leaving below for "excel export" testability issues I encountered.

(original post below line)

Hi,

I manage to run tests locally and I'm trying to add one... but I have (at last) two issues.

You'll see from these that ... I'm full of good will but a desperate python newbie :)

First : BaseDatasource

Following test example I initiate this

query_context_processor = QueryContextProcessor(
    QueryContext(
        datasource=BaseDatasource(),
        queries=[],
        result_type=ChartDataResultType.COLUMNS,
        form_data={},
        slice_=None,
        result_format=ChartDataResultFormat.XLSX,
        cache_values={},
    )
)

But then, in context_query_processor.py, method get_data(), theres' a call

verbose_map = self._qc_datasource.data.get("verbose_map", {})

that ends up calling "uid", and finally ... raise a lovely NotImplementedError. Seing the code - unless there's an object overriding teh type method somewhere - I have no idea what to do to avoid this.

    @property
    def type(self) -> str:
        raise NotImplementedError()

    @property
    def uid(self) -> str:
        """Unique id across datasource types"""
        return f"{self.id}__{self.type}"

superset/common/query_context_processor.py:569: in get_data
   verbose_map = self._qc_datasource.data.get("verbose_map", {})
superset/connectors/sqla/models.py:376: in data
   "uid": self.uid,
superset/connectors/sqla/models.py:266: in uid
   return f"{self.id}__{self.type}"
   
self = <superset.connectors.sqla.models.BaseDatasource object at 0x74ce68123fa0>

   @property
   def type(self) -> str:
>       raise NotImplementedError()
E       NotImplementedError

So, in order to go on a bit, I put locally a try/catch around verbose_map thing, which may be a solution eventually ?

Which leads to my even bigger issue ...

Second : how to check

the method I must test returns this

                result = excel.df_to_excel(ndf, **config["EXCEL_EXPORT"])
            return result or ""

which is not very testable.

a log output gave me as expected a binary content that looks like this, so, yeah, the test runs.

 2024-03-05 18:46:01,425:INFO:root:b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x00\x00?\x00a]I:O\x01\x00\x00\x8f\x04\x00\x00\x13\x00\x00\x00[Content_Types].xml\xad\x94\xcbn\xc20\x10E\xf7\xfd\x8a\xc8\xdb*1tQU\x15\x81E\x1f\xcb\x16\xa9\xf4\x03\\{B,\x

Is there a way to "mock" or "override" the excel.df_to_excel call so that the test can get comparable datas ?

Or else, is it advisable to somehow refactor the method, extract a more testable one before the call to excel conversion ?

thanks for any advice !

regards

squalou · 2024-03-06T07:27:23Z

(damn, import sort was not set, fixed and re-pushed)

michael-s-molina · 2024-03-06T18:44:40Z

superset/common/query_context_processor.py

@@ -571,11 +574,22 @@ def get_data(self, df: pd.DataFrame) -> str | list[dict[str, Any]]:
                    df, index=include_index, **config["CSV_EXPORT"]
                )
            elif self._query_context.result_format == ChartDataResultFormat.XLSX:
-                result = excel.df_to_excel(df, **config["EXCEL_EXPORT"])
+                ndf = self.copy_df_for_excel_export(df, coltypes)


@squalou Should this change be actually implemented in superset.utils.excel.df_to_excel to ensure consistent behavior? You could move your tests to tests/unit_tests/utils/excel_tests. We already have one there called test_timezone_conversion which is a similar type of conversion.

Makes a lot of sense, sure.

What would be better,
add "coltypes" as optional parameter in excel.df_to_excel(...) method ?
Or move the method as-is in superset.utils.excel. package.

First option seems better but I'm not at ease with changing method signatures there without an advice, I really have no global overview of what may call what and consequences.

EDIT : from a quick check the method is called in one place and one test so ... let's go.

I checked that coltypes are extracted from the DataFrame here.

michael-s-molina · 2024-03-07T13:26:59Z

superset/utils/excel.py

    output = io.BytesIO()

    # timezones are not supported
    for column in df.select_dtypes(include=["datetimetz"]).columns:
        df[column] = df[column].astype(str)

+    ndf = copy_df_with_datatype_conversion(df, coltypes)


@squalou Can the column types be extracted from the DataFrame like we are doing for timezones

for column in df.select_dtypes(include=["datetimetz"]).columns: df[column] = df[column].astype(str)

or are they not matching what's defined in coltypes?

Thing is : you could have data with leading zeros for instance, defined as string in the original datasource, and you want them as string in the end, or else excel would omit leading zeros, maybe round things, whatever.

I don't know much about DataFrame object, and the whole issue I'm trying to solve is to get the datatype as it was defined as the column type to begin with.

If you known of a way to get this data from the DataFrame itself, it sure would be MUCH more elegant than forwarding the column_types all way down to excel export.

To be honest, I discovered panda objects only to make up this patch ... so if you think this metadata can be stored / handled differently, I'll be glad to learn.

I also did not want to take any risk of changing the original dataframe object as I have no idea of global superset architecture.

I almost forgot the most important part : I have absolutely no idea how to know what dtypes are 'stored' in the DataFrame at the beginning :) so I used the 'coltypes' that seemed to be there for a reason :)

michael-s-molina · 2024-03-07T14:41:15Z

tests/unit_tests/utils/excel_tests.py



 def test_timezone_conversion() -> None:
    """
    Test that columns with timezones are converted to a string.
    """
    df = pd.DataFrame({"dt": [datetime(2023, 1, 1, 0, 0, tzinfo=timezone.utc)]})
-    contents = df_to_excel(df)
+    contents = df_to_excel(df, [])


Suggested change

contents = df_to_excel(df, [])

contents = df_to_excel(df, [GenericDataType.TEMPORAL])

michael-s-molina · 2024-03-07T14:42:33Z

superset/common/query_context_processor.py

@@ -571,11 +574,22 @@ def get_data(self, df: pd.DataFrame) -> str | list[dict[str, Any]]:
                    df, index=include_index, **config["CSV_EXPORT"]
                )
            elif self._query_context.result_format == ChartDataResultFormat.XLSX:
-                result = excel.df_to_excel(df, **config["EXCEL_EXPORT"])
+                ndf = self.copy_df_for_excel_export(df, coltypes)


I checked that coltypes are extracted from the DataFrame here.

michael-s-molina · 2024-03-07T16:48:05Z

superset/utils/excel.py


    return output.getvalue()
+
+
+def convert_df_with_datatypes(


I believe we could make it more explicit that the DataFrame can be modified and avoid the copy overhead. To do that, we need to separate column conversion from exporting. Something like:

def df_to_excel(df: pd.DataFrame, **kwargs: Any) -> bytes: output = io.BytesIO() # pylint: disable=abstract-class-instantiated with pd.ExcelWriter(output, engine="xlsxwriter") as writer: df.to_excel(writer, **kwargs) return output.getvalue() def apply_column_types( df: pd.DataFrame, column_types: list[GenericDataType] ) -> pd.DataFrame: for column, column_type in zip(df.columns, column_types): if column_type == GenericDataType.NUMERIC: df[column] = pd.to_numeric(df[column]) elif pd.api.types.is_datetime64tz_dtype(df[column]): # timezones are not supported df[column] = df[column].astype(str) else: df[column] = df[column].astype(column_type) return df

I'm probably over-conservative but I have one remaining concern

else : df[column] = df[column].astype(column_type) ... again I'm a bit scared of the effect on excel export side, because I don't know how it works. Id' rather "leave untouched" data not to be converted, but here again you probably know better.

Addendum : a quick test regarding the "else" conversion, it fails with this error

TypeError: Cannot interpret '<GenericDataType.STRING: 1>' as a data type

pd.to_numeric may raise exception, from documentation the "ignore error" param is deprecated and it's recommended to catch exceptions.

Good point. Here we have two options:

Catch and convert to a default. We also need to alert the user that some values were modified when exporting.

Don't catch and show a good message clearly indicating the column/row that is preventing the export.

else : df[column] = df[column].astype(column_type) ... again I'm a bit scared of the effect on excel export side, because I don't know how it works. Id' rather "leave untouched" data not to be converted, but here again you probably know better.

I'm not sure what you mean by untouched data but I'm assuming that when exporting users want the data to comply with what is defined as the column type. That was your original problem for numeric types, so I'm assuming that other non-numerical columns that are not matching their defined types should also be converted.

Let me ask for @betodealmeida and @eschutho input here as they have more context on this feature and might help us.

As en end user, I much more prefer having an xlsx file with data I have to manually reformat, than having nothing at all.

Catch and convert to a default. We also need to alert the user that some values were modified when exporting.

probably same as "dates with timezone" fix : df[column].astype(str)

Alerting the user is faaaaaaaar beyond anything I would be able to do, and agani as an end user I think I would'nt care. Nothing is said about timezone conversion either and I bet nobody cares.

Don't catch and show a good message clearly indicating the column/row that is preventing the export.

As said, as an end user I would rather have a not-perfectly formatted file, but will all data in it, than no data at all.

About "untouched fields" :

I don't really remember all the details (because it dates a bit and it's not my passion tbh :) ), but at the time, I manually opened the produced xlsx files, meaning : unzip it and see the xml-ish data inside and try to understand the differences when converted data or not. From there, open in libreoffice or MS and see the behaviors, and try to find the best.
Numeric values are stored "in the cells", while others have references in the cell and values mapped somewhere else. Or maybe it's the other way'round - can't even remember which is which. Anyway there's a clear separation between numeric and 'other'. Then, inside the 'other' category of course there may be further differences.

From my understanding then, the best thing one can do to expect any office software to deal with these files is to make sure numerics are numerics, anything doesn't matter that much and autodetection of strings work.

That's probably why df[column] = df[column].astype(str) in the date conversion has no real bad effect on usability.

That's all I know, which is not much for such a long text :)

squalou · 2024-03-07T16:53:15Z

damn, I still have a minor pylint issue ("exception as e" as a bad habit, moreover e is not used :) )

I have an issue though with an error in a file I did not change

superset/commands/base.py:61:0: I0021: Useless suppression of 'too-few-public-methods' (useless-suppression)

that's weird.

I rebased just in case.

… proper catch instead

squalou · 2024-03-08T07:34:47Z

damn ! my "black" local version doesn't behave exactly the same as the one on CI ! Will be fixed on next run.

… default to str

squalou · 2024-03-11T21:13:49Z

CI Run Fail :
Python-Integration / test-mysql (3.9) (pull_request) Failing after 6m

hm, no idea how this could be related with the PR, could it be failing on master before the changes ?

=================================== FAILURES ===================================
__________________ TestPostChartDataApi.test_chart_data_async __________________

self = <tests.integration_tests.charts.data.api_tests.TestPostChartDataApi testMethod=test_chart_data_async>

    @with_feature_flags(GLOBAL_ASYNC_QUERIES=True)
    @pytest.mark.usefixtures("load_birth_names_dashboard_with_slices")
    def test_chart_data_async(self):
        self.logout()
        app._got_first_request = False
        async_query_manager_factory.init_app(app)
        self.login("admin")
        rv = self.post_assert_metric(CHART_DATA_URI, self.query_context_payload, "data")
>       self.assertEqual(rv.status_code, 202)
E       AssertionError: 200 != 202

tests/integration_tests/charts/data/api_tests.py:707: AssertionError

rusackas · 2024-03-12T00:27:18Z

hm, no idea how this could be related with the PR, could it be failing on master before the changes ?

Agreed, seems unrelated... I'll rerun it in hopes that it's just a flaky test.

squalou · 2024-03-15T16:40:34Z

I've been away a bit, I'm back. Is there anything I should do ?
Maybe try a rebase master and shoot again ?

right now I'll wait for further instructions, rather than adding mess to the situation :)

rusackas · 2024-03-15T18:15:36Z

Hah... I'm rerunning the failing test now in hopes that it's just flaky. Fingers crossed.

squalou · 2024-03-19T07:35:07Z

Hah... I'm rerunning the failing test now in hopes that it's just flaky. Fingers crossed.

damn, from what I remember previous crash was not this one. (do we have history of previous run somewhere to be sure?)

SqlLab query tabs
  
  We detected that the Chromium Renderer process just crashed.

rusackas · 2024-03-19T14:49:39Z

We detected that the Chromium Renderer process just crashed.

It seems that GitHub Actions runners are random machines from different clusters... these things are hard to nail down, but certain machines don't match, and have troubles. It makes the tests look flaky, which is really annoying. I'm re-running that 1/3 of the E2E tests now 🤞

squalou · 2024-04-02T09:14:05Z

should I try a rebase or is it useless ?

pull-request-size bot added the size/S label Feb 23, 2024

squalou mentioned this pull request Feb 23, 2024

fix(excel export): #23375 set columns numeric datatypes on excel export #23476

Closed

9 tasks

john-bodley added the review:checkpoint Last PR reviewed during the daily review standup label Feb 23, 2024

john-bodley requested a review from betodealmeida February 23, 2024 18:00

rusackas changed the title ~~#23375 set columns numeric datatypes when exporting to excel~~ Fix(23375): set columns numeric datatypes when exporting to excel Feb 23, 2024

michael-s-molina changed the title ~~Fix(23375): set columns numeric datatypes when exporting to excel~~ fix: set columns numeric datatypes when exporting to excel Feb 23, 2024

squalou force-pushed the master branch from 24d5270 to b4b10e7 Compare February 24, 2024 09:34

rusackas removed the review:checkpoint Last PR reviewed during the daily review standup label Feb 26, 2024

pull-request-size bot added size/L and removed size/S labels Mar 5, 2024

squalou force-pushed the master branch 2 times, most recently from 0aa2191 to 2cf4443 Compare March 6, 2024 07:26

michael-s-molina reviewed Mar 6, 2024

View reviewed changes

michael-s-molina reviewed Mar 7, 2024

View reviewed changes

squalou added 4 commits March 7, 2024 17:59

fix apache#23375 set columns numeric datatypes when exporting to excel

623abfb

fix apache#23375 isolate changed code, add some unit tests

cef5a71

fix apache#23375 move changes to utils/excel

b6f43ff

fix apache#23375 typo + panda.to_numeric ignore errors is deprecated,…

8d38974

… proper catch instead

squalou force-pushed the master branch from 267c9fe to 8d38974 Compare March 7, 2024 17:00

michael-s-molina requested review from rusackas, john-bodley and eschutho March 7, 2024 17:33

fix apache#23375 separate df data conversion from excel export

bf0c308

pull-request-size bot added size/M and removed size/L labels Mar 7, 2024

fix apache#23375 put back try/catch around panda data conversion with…

b641e13

… default to str

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: set columns numeric datatypes when exporting to excel #27229

fix: set columns numeric datatypes when exporting to excel #27229

squalou commented Feb 23, 2024

rusackas commented Feb 23, 2024

codecov bot commented Feb 23, 2024 •

edited

rusackas commented Feb 23, 2024

michael-s-molina commented Feb 23, 2024

squalou commented Feb 24, 2024 •

edited

michael-s-molina commented Feb 26, 2024

squalou commented Feb 27, 2024

squalou commented Mar 5, 2024 •

edited

squalou commented Mar 6, 2024

michael-s-molina Mar 6, 2024

squalou Mar 6, 2024 •

edited

michael-s-molina Mar 7, 2024

michael-s-molina Mar 7, 2024

squalou Mar 7, 2024

squalou Mar 7, 2024 •

edited

michael-s-molina Mar 7, 2024

michael-s-molina Mar 7, 2024

michael-s-molina Mar 7, 2024

squalou Mar 7, 2024 •

edited

squalou Mar 7, 2024

michael-s-molina Mar 7, 2024

squalou Mar 8, 2024

squalou commented Mar 7, 2024 •

edited

squalou commented Mar 8, 2024

squalou commented Mar 11, 2024 •

edited

rusackas commented Mar 12, 2024

squalou commented Mar 15, 2024

rusackas commented Mar 15, 2024

squalou commented Mar 19, 2024

rusackas commented Mar 19, 2024

squalou commented Apr 2, 2024

	contents = df_to_excel(df, [])
	contents = df_to_excel(df, [GenericDataType.TEMPORAL])

fix: set columns numeric datatypes when exporting to excel #27229

Are you sure you want to change the base?

fix: set columns numeric datatypes when exporting to excel #27229

Conversation

squalou commented Feb 23, 2024

SUMMARY

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

rusackas commented Feb 23, 2024

codecov bot commented Feb 23, 2024 • edited

Codecov Report

rusackas commented Feb 23, 2024

michael-s-molina commented Feb 23, 2024

squalou commented Feb 24, 2024 • edited

michael-s-molina commented Feb 26, 2024

squalou commented Feb 27, 2024

squalou commented Mar 5, 2024 • edited

squalou commented Mar 6, 2024

Choose a reason for hiding this comment

squalou Mar 6, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squalou Mar 7, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squalou Mar 7, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squalou commented Mar 7, 2024 • edited

squalou commented Mar 8, 2024

squalou commented Mar 11, 2024 • edited

rusackas commented Mar 12, 2024

squalou commented Mar 15, 2024

rusackas commented Mar 15, 2024

squalou commented Mar 19, 2024

rusackas commented Mar 19, 2024

squalou commented Apr 2, 2024

codecov bot commented Feb 23, 2024 •

edited

squalou commented Feb 24, 2024 •

edited

squalou commented Mar 5, 2024 •

edited

squalou Mar 6, 2024 •

edited

squalou Mar 7, 2024 •

edited

squalou Mar 7, 2024 •

edited

squalou commented Mar 7, 2024 •

edited

squalou commented Mar 11, 2024 •

edited