Skip to content

Conversation

@rashtao
Copy link
Collaborator

@rashtao rashtao commented Dec 7, 2021

Allow setting bad record handling policy via config parameter, i.e.:

spark.read
    .option("mode", "PERMISSIVE|DROPMALFORMED|FAILFAST")

Review JacksonParser behavior and the cases in which it throws BadRecordException.Review official documentation of DataFrameReader#json (https://spark.apache.org/docs/3.1.2/api/java/org/apache/spark/sql/DataFrameReader.html#json-java.lang.String...-), in particular:

mode (default PERMISSIVE): allows a mode for dealing with corrupt records during parsing.

    PERMISSIVE : when it meets a corrupted record, puts the malformed string into a field 
      configured by columnNameOfCorruptRecord, and sets malformed fields to null. To keep corrupt 
      records, an user can set a string type field named columnNameOfCorruptRecord in an 
      user-defined schema. If a schema does not have the field, it drops corrupt records during 
      parsing. When inferring a schema, it implicitly adds a columnNameOfCorruptRecord field in 
      an output schema.
    DROPMALFORMED : ignores the whole corrupted records.
    FAILFAST : throws an exception when it meets corrupted records.

columnNameOfCorruptRecord (default is the value specified in spark.sql.columnNameOfCorruptRecord): 
  allows renaming the new field having malformed string created by PERMISSIVE mode. This overrides 
  spark.sql.columnNameOfCorruptRecord.

@sonarqubecloud
Copy link

sonarqubecloud bot commented Dec 8, 2021

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 2 Code Smells

0.0% 0.0% Coverage
5.6% 5.6% Duplication

@rashtao rashtao merged commit 33dadd1 into devel Dec 8, 2021
@rashtao rashtao deleted the feature/bad_records branch July 19, 2022 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants