<div style="text-align: center; line-height: 0; padding-top: 2px;">
  <img src="https://www.quantiaconsulting.com/logos/quantia_logo_orizz.png" alt="Quantia Consulting" style="width: 600px; height: 250px">
</div>

# Apache Spark: Handle Corrupt/bad Records

Most of the time writing ingestion jobs becomes very expensive when it comes to handling corrupt records. And in such cases, ingestion pipelines need a good solution to handle corrupted records. Because, larger the ingestion pipeline is, the more complex it becomes to handle such bad records in between. Corrupt data includes:
* Missing information
* Incomplete information
* Schema mismatch
* Differing formats or data types

Since ingestion pipelines are built to be automated, production-oriented solutions must ensure pipelines behave as expected. This means that data engineers must both expect and systematically handle corrupt records.

So, before proceeding to our main topic, let’s first know the pathway to ingestion pipeline & where comes the step to handle corrupted records.

![](https://www.quantiaconsulting.com/img/ETL-Process-3.png)

As, it is clearly visible that just before loading the final result, it is a good practice to handle corrupted/bad records. Now, the main question arises is How to handle corrupted/bad records? So, here comes the answer to the question.

## Let's prepare the environment

In [None]:
%load_ext autotime

import os
from pyspark.sql import SparkSession
import boto3
import io

baseUri = "/home/jovyan/materials/local-data/"

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.5 pyspark-shell'

spark = (SparkSession.builder 
    .master("local[*]")
    .appName("test")
    .getOrCreate()
        )

spark

## First let's recall the "all-good" case

To answer this question, we will see a complete example about how to play & handle the bad record present in JSON. But, let's first recall the "all-good" case.

Let’s say this is the **good** JSON data:

```
{"a": 1, "b":2, "c":3}
{"a": 10, "b":20, "c":30}
{"a": 100, "b":30, "c":300}
```

In [None]:
goodDf = (spark.read.json(baseUri+"good.json"))

In [None]:
goodDf.show()

## Handle Corrupt/bad records

Let’s say this is the **bad** JSON data:

```
{"a": 1, "b":2, "c":3}
{"a": 10, "b":20, "c":30}
{"a": 100, "b, "c":300}
```

In the above JSON data `{"a": 3, "b, "c":300}` is the bad record. 

Now the main target is how to handle this record?

We have three ways to handle this type of data-
A) To include this data in a separate column
B) To ignore all bad records
C) Throws an exception when it meets corrupted records

So, let’s see each of these 3 ways in detail.

### A) To include this data in a separate column

As per the use case, if a user wants us to store a bad record in separate column use option mode as `PERMISSIVE`. 

Example:

In [None]:
corruptDf = (spark.read
             .option("mode", "PERMISSIVE")
             .option("columnNameOfCorruptRecord", "_corrupt_record")
             .json(baseUri+"bad.json"))

Let's displayed `corruptDf`

In [None]:
corruptDf.show()

**NOTE**: the `_corrupt_record` column only appears if there is at least 1 corrupt record

In [None]:
(spark.read
 .option("mode", "PERMISSIVE")
 .option("columnNameOfCorruptRecord", "_corrupt_record")
 .json(baseUri+"good.json")
 .show())

How many corrupt records are there?

Directly counting them gives an error ...

In [None]:
from pyspark.sql.functions import *


corruptDf.filter(col("_corrupt_record").isNotNull()).count()

**note the error**: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column

in order to query the corrupt record column you need to cache the dataframe. In this way you explicitly tell that you are aware that the data are corrupt.

In [None]:
badRows = corruptDf.filter(col("_corrupt_record").isNotNull())
badRows.cache()

In [None]:
badRows.count()

### B) To ignore all bad records 

In this particular use case, if a user doesn’t want to include the bad records at all and wants to store only the correct records use the `DROPMALFORMED` mode.

Example:

In [None]:
cleanDf = (spark.read
             .option("mode", "DROPMALFORMED")
             .json(baseUri+"bad.json"))

Let's displayed `cleanDf`

In [None]:
cleanDf.show()

Hence, only the correct records will be stored & bad records will be removed.


### C) Throws an exception when it meets corrupted records

For this use case, if present any bad record will throw an exception. And the mode for this use case will be `FAILFAST`. And it’s a best practice to use this mode in a try-catch block.

Example:

In [None]:
try:
    anotherCorruptDf = (spark.read
        .option("mode", "FAILFAST")
        .json(baseUri+"bad.json")
    )
except Exception as e:  
    print(e)
    



Hence, will throw an error and no data is loaded.

In [None]:
anotherCorruptDf

### Acknowledgements 

This notebook is partially based on [Apache Spark: Handle Corrupt/Bad Records by Divyansh Jain, published onApril 5, 2020](https://blog.knoldus.com/apache-spark-handle-corrupt-bad-records/)

##### ![Quantia Tiny Logo](https://www.quantiaconsulting.com/logos/quantia_logo_tiny.png) Quantia Consulting, srl. All rights reserved.