Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ERROR cell type when using inferSchema=true #343

Merged
merged 10 commits into from
Feb 20, 2021

Conversation

derianpt
Copy link
Collaborator

Fixes #208

When using inferSchema option, and the excel file we are reading has ERROR type cells, we will have a runtime error:
scala.MatchError: ERROR (of class shadeio.poi.ss.usermodel.CellType)

This PR adds support to have the option of treating ERROR cells as string cells and outputting the error messages instead (e.g. #N/A, #NULL!)

@derianpt
Copy link
Collaborator Author

derianpt commented Feb 10, 2021

@nightscape I'm afraid it's not possible to individually set CellType.ERROR cells as null due to the following reasons:

  1. If we set the SQL type of that cell to NullType, the InferSchema.apply function will still finalise the type of the entire column as StringType after merging.
  2. Other cells for the same column might have valid values (non ERROR), so the final inferred schema would likely be something other than NullType, depending on the data (e.g. StringType, DoubleType, BooleanType)

Please see the commented out test case for more info. Appreciate if you have any idea on how to output individual ERROR cells as null.

If it's not possible, we can just go with the default behaviour of processing ERROR cells as strings, which will at least prevent the runtime error & adds support for them.

@nightscape
Copy link
Owner

@derianpt overall great job!
I didn't find the time to check the tests yet, will hopefully get to it this weekend or early next week.

@derianpt
Copy link
Collaborator Author

@derianpt overall great job!
I didn't find the time to check the tests yet, will hopefully get to it this weekend or early next week.

Hi @nightscape, have you had the time to check out the tests yet? 🙇

@nightscape
Copy link
Owner

I just had a look. I think the problem is that the second column in your example xlsx file contains only errors.
If you put in another row with e.g. a double value the schema should be inferred correctly.

@derianpt
Copy link
Collaborator Author

Thanks for the tip @nightscape. Turns out I was missing the code change to update how cell values should being extracted for ERROR cell types. The tests are now passing.

However, there is a side effect when we treat errors as strings: the entire column will then be interpreted as string type. Understandable and acceptable behaviour imo. This is reflected in the first test case.

I've also put a note of this in the README.

@derianpt derianpt marked this pull request as ready for review February 18, 2021 13:12
@nightscape
Copy link
Owner

nightscape commented Feb 18, 2021

If I understand correctly, the type of the column changes to String only if there is an error in that column, and then only if it is within the first excerptSize rows which are used for schema inference.
If an error happens somewhere after excerptSize rows it would probably fail at runtime because the column was inferred to be e.g. Double type from the error-free first rows and then the error string can't be converted into a Double.
I've read up a little on how error handling is done in other Spark datasources, and their approach is to have different error handling modes (Permissive, Fail Fast, Drop Malformed). In Permissive mode, bad records are stored in a __corrupt_record column. I'm wondering if we should apply the same strategy here?

Something else that is worth knowing is that there is the possibility to add metadata to a column. It would be possible to use this to specify a sensible fallback value for one specific column (e.g. false, 0, "", ...).

@derianpt
Copy link
Collaborator Author

derianpt commented Feb 19, 2021

If I understand correctly, the type of the column changes to String only if there is an error in that column, and then only if it is within the first excerptSize rows which are used for schema inference.
If an error happens somewhere after excerptSize rows it would probably fail at runtime because the column was inferred to be e.g. Double type from the error-free first rows and then the error string can't be converted into a Double.

I have modified the test file & confirmed this behaviour. Looks like blanket converting to String will not work, we need to conform to the inferred column type. However, setting as null still works.

I suggest this strategy;
By default, we convert ERROR cells to null, but we have a boolean option setErrorCellsToFallbackValues:

.option("setErrorCellsToFallbackValues", "true") // Optional, default: false, where errors will be converted to null. If true, any ERROR cell values (e.g. #N/A) will be converted to the zero values of the column's data type.

For the sensible fallback values, we can refer to the range of values defined in spark docs

This is similar to PERMISSIVE mode but less complex to implement + we won't need metadata.

EDIT: Please see the latest code changes

@nightscape
Copy link
Owner

Looks good! I'll merge once it's ready from your side 👍

@derianpt
Copy link
Collaborator Author

Yup PR is ready to be merged from my side

@nightscape nightscape merged commit 72aba15 into nightscape:main Feb 20, 2021
@nightscape
Copy link
Owner

Hi @derianpt, great job!
Would you be interested in become a project contributor?

@derianpt
Copy link
Collaborator Author

Sure @nightscape

@derianpt derianpt deleted the feature/process-error-cells branch February 21, 2021 05:51
@derianpt
Copy link
Collaborator Author

@nightscape Could I trouble you to release a new version with this fix? I would like to use it in my app 🙇

@nightscape
Copy link
Owner

I just fixed the SNAPSHOT release mechanism, so every commit to main gets a release.
The newest one is 0.13.6+11-1eaf2896-SNAPSHOT which contains your changes.

I would also like to add you as contributor to the project, then you could create a 0.13.7 release.
Do you have 2FA enabled on your account?

@derianpt
Copy link
Collaborator Author

Yes I have 2FA enabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scala Match Error
2 participants