Support ERROR cell type when using inferSchema=true #343

derianpt · 2021-02-10T06:19:13Z

Fixes #208

When using inferSchema option, and the excel file we are reading has ERROR type cells, we will have a runtime error:
scala.MatchError: ERROR (of class shadeio.poi.ss.usermodel.CellType)

This PR adds support to have the option of treating ERROR cells as string cells and outputting the error messages instead (e.g. #N/A, #NULL!)

derianpt · 2021-02-10T06:28:34Z

@nightscape I'm afraid it's not possible to individually set CellType.ERROR cells as null due to the following reasons:

If we set the SQL type of that cell to NullType, the InferSchema.apply function will still finalise the type of the entire column as StringType after merging.
Other cells for the same column might have valid values (non ERROR), so the final inferred schema would likely be something other than NullType, depending on the data (e.g. StringType, DoubleType, BooleanType)

Please see the commented out test case for more info. Appreciate if you have any idea on how to output individual ERROR cells as null.

If it's not possible, we can just go with the default behaviour of processing ERROR cells as strings, which will at least prevent the runtime error & adds support for them.

nightscape · 2021-02-12T23:51:04Z

@derianpt overall great job!
I didn't find the time to check the tests yet, will hopefully get to it this weekend or early next week.

derianpt · 2021-02-17T03:04:02Z

@derianpt overall great job!
I didn't find the time to check the tests yet, will hopefully get to it this weekend or early next week.

Hi @nightscape, have you had the time to check out the tests yet? 🙇

nightscape · 2021-02-17T17:14:31Z

I just had a look. I think the problem is that the second column in your example xlsx file contains only errors.
If you put in another row with e.g. a double value the schema should be inferred correctly.

derianpt · 2021-02-18T13:12:08Z

Thanks for the tip @nightscape. Turns out I was missing the code change to update how cell values should being extracted for ERROR cell types. The tests are now passing.

However, there is a side effect when we treat errors as strings: the entire column will then be interpreted as string type. Understandable and acceptable behaviour imo. This is reflected in the first test case.

I've also put a note of this in the README.

nightscape · 2021-02-18T15:18:49Z

If I understand correctly, the type of the column changes to String only if there is an error in that column, and then only if it is within the first excerptSize rows which are used for schema inference.
If an error happens somewhere after excerptSize rows it would probably fail at runtime because the column was inferred to be e.g. Double type from the error-free first rows and then the error string can't be converted into a Double.
I've read up a little on how error handling is done in other Spark datasources, and their approach is to have different error handling modes (Permissive, Fail Fast, Drop Malformed). In Permissive mode, bad records are stored in a __corrupt_record column. I'm wondering if we should apply the same strategy here?

Something else that is worth knowing is that there is the possibility to add metadata to a column. It would be possible to use this to specify a sensible fallback value for one specific column (e.g. false, 0, "", ...).

derianpt · 2021-02-19T07:24:36Z

If I understand correctly, the type of the column changes to String only if there is an error in that column, and then only if it is within the first excerptSize rows which are used for schema inference.
If an error happens somewhere after excerptSize rows it would probably fail at runtime because the column was inferred to be e.g. Double type from the error-free first rows and then the error string can't be converted into a Double.

I have modified the test file & confirmed this behaviour. Looks like blanket converting to String will not work, we need to conform to the inferred column type. However, setting as null still works.

I suggest this strategy;
By default, we convert ERROR cells to null, but we have a boolean option setErrorCellsToFallbackValues:

.option("setErrorCellsToFallbackValues", "true") // Optional, default: false, where errors will be converted to null. If true, any ERROR cell values (e.g. #N/A) will be converted to the zero values of the column's data type.

For the sensible fallback values, we can refer to the range of values defined in spark docs

This is similar to PERMISSIVE mode but less complex to implement + we won't need metadata.

EDIT: Please see the latest code changes

nightscape · 2021-02-19T13:09:51Z

Looks good! I'll merge once it's ready from your side 👍

derianpt · 2021-02-20T08:15:21Z

Yup PR is ready to be merged from my side

nightscape · 2021-02-20T13:33:45Z

Hi @derianpt, great job!
Would you be interested in become a project contributor?

derianpt · 2021-02-20T16:44:15Z

Sure @nightscape

derianpt · 2021-02-21T05:52:05Z

@nightscape Could I trouble you to release a new version with this fix? I would like to use it in my app 🙇

nightscape · 2021-02-21T19:49:44Z

I just fixed the SNAPSHOT release mechanism, so every commit to main gets a release.
The newest one is 0.13.6+11-1eaf2896-SNAPSHOT which contains your changes.

I would also like to add you as contributor to the project, then you could create a 0.13.7 release.
Do you have 2FA enabled on your account?

derianpt · 2021-02-22T02:58:46Z

Yes I have 2FA enabled.

Derian Tungka added 6 commits February 9, 2021 17:23

add README

d7693b7

Before tests

1534b8f

fix typo in ERROR example

813a883

Add tests

496d65b

scalastyle

e996e88

comment out failing test

1e22ee1

Fix tests & improve example file

0237731

derianpt force-pushed the feature/process-error-cells branch from 8e497af to 0237731 Compare February 18, 2021 11:57

add side effect warning to readme

edac613

derianpt marked this pull request as ready for review February 18, 2021 13:12

Derian Tungka added 2 commits February 19, 2021 17:31

all ok except fixing timestamp zone creation

ed839e0

Create timestamp for test data in local time zone

7e52c8d

nightscape merged commit 72aba15 into nightscape:main Feb 20, 2021

derianpt deleted the feature/process-error-cells branch February 21, 2021 05:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support ERROR cell type when using inferSchema=true #343

Support ERROR cell type when using inferSchema=true #343

derianpt commented Feb 10, 2021

derianpt commented Feb 10, 2021 •

edited

Loading

nightscape commented Feb 12, 2021

derianpt commented Feb 17, 2021

nightscape commented Feb 17, 2021

derianpt commented Feb 18, 2021

nightscape commented Feb 18, 2021 •

edited

Loading

derianpt commented Feb 19, 2021 •

edited

Loading

nightscape commented Feb 19, 2021

derianpt commented Feb 20, 2021

nightscape commented Feb 20, 2021

derianpt commented Feb 20, 2021

derianpt commented Feb 21, 2021

nightscape commented Feb 21, 2021

derianpt commented Feb 22, 2021

Support ERROR cell type when using inferSchema=true #343

Support ERROR cell type when using inferSchema=true #343

Conversation

derianpt commented Feb 10, 2021

derianpt commented Feb 10, 2021 • edited Loading

nightscape commented Feb 12, 2021

derianpt commented Feb 17, 2021

nightscape commented Feb 17, 2021

derianpt commented Feb 18, 2021

nightscape commented Feb 18, 2021 • edited Loading

derianpt commented Feb 19, 2021 • edited Loading

nightscape commented Feb 19, 2021

derianpt commented Feb 20, 2021

nightscape commented Feb 20, 2021

derianpt commented Feb 20, 2021

derianpt commented Feb 21, 2021

nightscape commented Feb 21, 2021

derianpt commented Feb 22, 2021

derianpt commented Feb 10, 2021 •

edited

Loading

nightscape commented Feb 18, 2021 •

edited

Loading

derianpt commented Feb 19, 2021 •

edited

Loading