# Corrupt Record Handling

Apache Spark&trade; and Azure Databricks&reg; provide ways to handle corrupt records.

-sandbox
## Working with Corrupt Data

ETL pipelines need robust solutions to handle corrupt data. This is because data corruption scales as the size of data and complexity of the data application grow. Corrupt data includes:  
<br>
* Missing information
* Incomplete information
* Schema mismatch
* Differing formats or data types
* User errors when writing data producers

Since ETL pipelines are built to be automated, production-oriented solutions must ensure pipelines behave as expected. This means that **data engineers must both expect and systematically handle corrupt records.**

In the roadmap for ETL, this is the **Handle Corrupt Records** step:
<img src="https://files.training.databricks.com/images/eLearning/ETL-Part-1/ETL-Process-3.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa"/>

Run the following cell to mount the data:

In [4]:
%run "./Includes/Classroom-Setup"

-sandbox
Run the following cell, which contains a corrupt record, `{"a": 1, "b, "c":10}`:

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> This is not the preferred way to make a DataFrame.  This code allows us to mimic a corrupt record you might see in production.

In [6]:
data = """{"a": 1, "b":2, "c":3}|{"a": 1, "b":2, "c":3}|{"a": 1, "b, "c":10}""".split('|')

corruptDF = (spark.read
  .option("mode", "PERMISSIVE")
  .option("columnNameOfCorruptRecord", "_corrupt_record")
  .json(sc.parallelize(data))
)
display(corruptDF)  

_corrupt_record,a,b,c
,1.0,2.0,3.0
,1.0,2.0,3.0
"{""a"": 1, ""b, ""c"":10}",,,


In the previous results, Spark parsed the corrupt record into its own column and processed the other records as expected. This is the default behavior for corrupt records, so you didn't technically need to use the two options `mode` and `columnNameOfCorruptRecord`.

There are three different options for handling corrupt records [set through the `ParseMode` option](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala#L34):

| `ParseMode` | Behavior |
|-------------|----------|
| `PERMISSIVE` | Includes corrupt records in a "_corrupt_record" column (by default) |
| `DROPMALFORMED` | Ignores all corrupted records |
| `FAILFAST` | Throws an exception when it meets corrupted records |

The following cell acts on the same data but drops corrupt records:

In [8]:
data = """{"a": 1, "b":2, "c":3}|{"a": 1, "b":2, "c":3}|{"a": 1, "b, "c":10}""".split('|')

corruptDF = (spark.read
  .option("mode", "DROPMALFORMED")
  .json(sc.parallelize(data))
)
display(corruptDF)

a,b,c
1,2,3
1,2,3


The following cell throws an error once a corrupt record is found, rather than ignoring or saving the corrupt records:

In [10]:
data = """{"a": 1, "b":2, "c":3}|{"a": 1, "b":2, "c":3}|{"a": 1, "b, "c":10}""".split('|')

corruptDF = (spark.read
  .option("mode", "FAILFAST")
  .json(sc.parallelize(data))
)
display(corruptDF)

### Recommended Pattern: `badRecordsPath`

Databricks Runtime has [a built-in feature](https://docs.azuredatabricks.net/spark/latest/spark-sql/handling-bad-records.html#handling-bad-records-and-files) that saves corrupt records to a given end point. To use this, set the `badRecordsPath`.

This is a preferred design pattern since it persists the corrupt records for later analysis even after the cluster shuts down.

In [12]:
data = """{"a": 1, "b":2, "c":3}|{"a": 1, "b":2, "c":3}|{"a": 1, "b, "c":10}""".split('|')

corruptDF = (spark.read
  .option("badRecordsPath", "/tmp/badRecordsPath")
  .json(sc.parallelize(data))
)
display(corruptDF)

a,b,c
1,2,3
1,2,3


-sandbox
See the results in `/tmp/badRecordsPath`.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Recall that `/tmp` is a directory backed by S3 or the Azure Blob available to all clusters.

In [14]:
display(spark.read.json("/tmp/badRecordsPath/*/*/*"))

reason,record
"com.fasterxml.jackson.core.JsonParseException: Unexpected character ('c' (code 99)): was expecting a colon to separate field name and value  at [Source: {""a"": 1, ""b, ""c"":10}; line: 1, column: 16]","{""a"": 1, ""b, ""c"":10}"


## Exercise 1: Working with Corrupt Records

### Step 1: Diagnose the Problem

Import the data used in the last lesson, which is located at `/mnt/training/UbiqLog4UCI/14_F/log*`.  Import the corrupt records in a new column `SMSCorrupt`.  <br>

Save only the columns `SMS` and `SMSCorrupt` to the new DataFrame `SMSCorruptDF`.

In [17]:
# TODO
from pyspark.sql.functions import col

path = "/mnt/training/UbiqLog4UCI/14_F/log*"

SMSCorruptDF = (spark.read
    .option("mode", "PERMISSIVE")
    .option("columnNameOfCorruptRecord", "SMSCorrupt")
    .json(path)
    .select("SMS", "SMSCorrupt")
    .filter(col("SMSCorrupt").isNotNull())
)# FILL_IN

display(SMSCorruptDF)

SMS,SMSCorrupt
,"{""SMS"":{""Address"":""+98912503####"",""type"":""1"",""date"":""12-24-2013 11:52:14"",""body"":""ANONYMIZED"",""Type"":""1"", ""metadata"": {""name"": ""mr Khojasteh""flash""""}}}"
,"{""SMS"":{""Address"":""+98912503####"",""type"":""1"",""date"":""12-24-2013 11:52:20"",""body"":""ANONYMIZED"",""Type"":""1"", ""metadata"": {""name"": ""mr Khojasteh""flash""""}}}"
,"{""SMS"":{""Address"":""+98912503####"",""type"":""1"",""date"":""12-24-2013 11:52:30"",""body"":""ANONYMIZED"",""Type"":""1"", ""metadata"": {""name"": ""mr Khojasteh""flash""""}}}"
,"{""SMS"":{""Address"":""+98912503####"",""type"":""1"",""date"":""12-24-2013 11:53:26"",""body"":""ANONYMIZED"",""Type"":""1"", ""metadata"": {""name"": ""mr Khojasteh""flash""""}}}"
,"{""SMS"":{""Address"":""+98912503####"",""type"":""1"",""date"":""12-24-2013 11:53:59"",""body"":""ANONYMIZED"",""Type"":""1"", ""metadata"": {""name"": ""mr Khojasteh""flash""""}}}"
,"{""SMS"":{""Address"":""+98912503####"",""type"":""1"",""date"":""12-24-2013 11:53:59"",""body"":""ANONYMIZED"",""Type"":""1"", ""metadata"": {""name"": ""mr Khojasteh""flash""""}}}"
,"{""SMS"":{""Address"":""+98912503####"",""type"":""1"",""date"":""12-24-2013 11:53:26"",""body"":""ANONYMIZED"",""Type"":""1"", ""metadata"": {""name"": ""mr Khojasteh""flash""""}}}"
,"{""SMS"":{""Address"":""+98912503####"",""type"":""1"",""date"":""12-24-2013 11:52:30"",""body"":""ANONYMIZED"",""Type"":""1"", ""metadata"": {""name"": ""mr Khojasteh""flash""""}}}"


In [18]:
# TEST - Run this cell to test your solution
cols = set(SMSCorruptDF.columns)
SMSCount = SMSCorruptDF.cache().count()

dbTest("ET1-P-06-01-01", {'SMS', 'SMSCorrupt'}, cols)
dbTest("ET1-P-06-01-02", 8, SMSCount)

print("Tests passed!")

-sandbox
Examine the corrupt records to determine what the problem is with the bad records.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Take a look at the name in metadata.

The entry `{"name": "mr Khojasteh"flash""}` should have single quotes around `flash` since the double quotes are interpreted as the end of the value.  It should read `{"name": "mr Khojasteh'flash'"}` instead.

The optimal solution is to fix the initial producer of the data to correct the problem at its source.  In the meantime, you could write ad hoc logic to turn this into a readable field.

### Step 2: Use `badRecordsPath`

Use the `badRecordsPath` option to save corrupt records to the directory `/tmp/corruptSMS`.

In [22]:
# TODO
SMSCorruptDF2 = (spark.read
    .option("badRecordsPath", "/tmp/corruptSMS")
    .json(path)
#    .select("SMS")
)# FILL_IN

display(SMSCorruptDF2)

Application,Bluetooth,Call,Location,SMS,WiFi
,,,,,"List(74:ea:3a:a2:e1:b4, Saba, [WPA-PSK-TKIP], 2437, -88, Thursday, January 9, 2014 11:56:44 PM Iran Standard Time)"
,,,,,"List(90:94:e4:83:57:83, Milad, [WPA-PSK-TKIP+CCMP][WPA2-PSK-TKIP+CCMP][WPS], 2472, -85, Thursday, January 9, 2014 11:56:44 PM Iran Standard Time)"
,,,,,"List(74:ea:3a:a2:e1:b4, Saba, [WPA-PSK-TKIP], 2437, -88, Thursday, January 9, 2014 11:57:19 PM Iran Standard Time)"
,,,,,"List(90:94:e4:83:57:83, Milad, [WPA-PSK-TKIP+CCMP][WPA2-PSK-TKIP+CCMP][WPS], 2472, -86, Thursday, January 9, 2014 11:57:19 PM Iran Standard Time)"
,,,,,"List(74:ea:3a:a2:e1:b4, Saba, [WPA-PSK-TKIP], 2437, -88, Thursday, January 9, 2014 11:57:19 PM Iran Standard Time)"
,,,,,"List(90:94:e4:83:57:83, Milad, [WPA-PSK-TKIP+CCMP][WPA2-PSK-TKIP+CCMP][WPS], 2472, -86, Thursday, January 9, 2014 11:57:19 PM Iran Standard Time)"
,,,,,"List(74:ea:3a:a2:e1:b4, Saba, [WPA-PSK-TKIP], 2437, -88, Thursday, January 9, 2014 11:57:44 PM Iran Standard Time)"
,,,,,"List(90:94:e4:83:57:83, Milad, [WPA-PSK-TKIP+CCMP][WPA2-PSK-TKIP+CCMP][WPS], 2472, -85, Thursday, January 9, 2014 11:57:44 PM Iran Standard Time)"
,,,,,"List(74:ea:3a:a2:e1:b4, Saba, [WPA-PSK-TKIP], 2437, -88, Thursday, January 9, 2014 11:58:19 PM Iran Standard Time)"
,,,,,"List(90:94:e4:83:57:83, Milad, [WPA-PSK-TKIP+CCMP][WPA2-PSK-TKIP+CCMP][WPS], 2472, -87, Thursday, January 9, 2014 11:58:19 PM Iran Standard Time)"


In [23]:
# TEST - Run this cell to test your solution
SMSCorruptDF2.count()
corruptCount = spark.read.json("/tmp/corruptSMS/*/*/*").count()

dbTest("ET1-P-06-02-01", True, corruptCount >= 8)

print("Tests passed!")

## Review
**Question:** By default, how are corrupt records dealt with using `spark.read.json()`?  
**Answer:** They appear in a collumn called `_corrupt_record`.

**Question:** How can a query persist corrupt records in separate destination?  
**Answer:** The Databricks feature `badRecordsPath` allows a query to save corrupt records to a given end point for the pipeline engineer to investigate corruption issues.

## Next Steps

Start the next lesson, [Loading Data and Productionalizing]($./07-Loading-Data-and-Productionalizing).

## Additional Topics & Resources

**Q:** Where can I get more information on dealing with corrupt records?  
**A:** Check out the Spark Summit talk on <a href="https://databricks.com/session/exceptions-are-the-norm-dealing-with-bad-actors-in-etl" target="_blank">Exceptions are the Norm: Dealing with Bad Actors in ETL</a>