In [4]:
%run "./Includes/Classroom-Setup"

### Idempotent Failure Recovery

Jobs can fail for any number of reasons.  The majority of job failures are caused by input/output (I/O) problems but other issues include schema evolution, data corruption, and hardware failures.  Recovery from job failure should be guided by the principle of *idempotence, or the property of operations whereby the operation can be applied multiple times without changing the results beyond the first application.*

More technically, the definition of idempotence is as follows where a function `f` applied to `x` is equal to that function applied to `x` two or more times:

&nbsp;&nbsp;&nbsp;&nbsp;`f(x) = f(f(x)) = f(f(f(x))) = ...`

In ETL job recovery, we need to be able to run a job multiple times and get our data into our target database without duplicates.  This can be accomplished in a few ways:<br><br>

* A **left antijoin** of new data on data already in a database will give you only the data that was not inserted
* Overwriting all data is a resource-intensive way to ensure that all data was written
* The transactionality of databases enable all-or-nothing database writes where failure of any part of the job will not result in any committed data
* Leveraging primary keys in a database will only write data where the primary key is not already present or upsert the data

### One Idempotent Strategy: Left Antijoin

In traditional ETL, a job recovery strategy where only partial data was written to database would look something as follow:

```
begin transaction;
  delete from production_table where batch_period = failed_batch_period;
  insert into production_table select * from staging_table;
  drop table staging_table;  
end transaction;
```

This won't work in a Spark environment because data structures are immutable.  One alternative strategy among the several listed in the cell above relies on a left antijoin, which returns all data in the left table that doesn't exist in the right table.

Run the follow cell to create a mock production and staging table. Create a staging table from parquet that contains log records and then create a production table that only has 20 percent of the records from staging.

In [9]:
from pyspark.sql.functions import col 

staging_table = (spark.read.parquet("/mnt/training/EDGAR-Log-20170329/enhanced/EDGAR-Log-20170329-sample.parquet/")
  .dropDuplicates(['ip', 'date', 'time']))

production_table = staging_table.sample(.2, seed=123)

Run the following cell to see that the `poduction_table` only has 20% of the data from `staging_table`

In [11]:
production_table.count() / staging_table.count()

Join the two tables using a left antijoin.

In [13]:
failedDF = staging_table.join(production_table, on=["ip", "date", "time"], how="left_anti")

Union `production_table` with the results from the left antijoin.

Append operations are generally not idempotent as they can result in duplicate records.  Streaming operations that maintain state and append to an always up-to-date parquet or Databricks Delta table are idempotent.

In [15]:
fullDF = production_table.union(failedDF)

The two tables are now equal.

In [17]:
staging_table.count() == fullDF.count()

### Monitoring Jobs for Failure

Monitoring for job failure often entails a server or cluster that tracks job performance.  One common monitoring table to build using this server or cluster is as follows:<br><br>

1. `batchID`: A unique ID for each batch
2. `runID`: The ID that matches the API call to execute a job
3. `time`: Time of the query
4. `status`: Status of the job

On job failure, jobs can be retried multiple times before fully failing.  Inital tries can utilize spot instances to reduce costs and fall back to on-demand resources as needed.

## Review
**Question:** What is idempotence?  
**Answer:** For ETL jobs, idempotence is the ability to run the same job multiple times without getting duplicate data.  This is the primary axiom for ensuring that ETL workloads do not have any unexpected behavior.

**Question:** How can I accomplish idempotence in Spark jobs?  
**Answer:** There are a number of strategies for accomplishing this.  Doing an antijoin of your full data on already loaded data is one method.  This can be in the form of an incremental update script that would run on the case of job failure.  By counting the records at the beginning and end of a job, you can detect whether any unexpected behavior would demand the use of this incremental script.

**Question:** How can I detect job failure?  
**Answer:** This depends largely on the pipeline you're creating.  One common best practice is to have a monitoring job that periodically checks jobs for failure.  This can be tied to email or other alerting mechanisms.

In [21]:
%run "./Includes/Classroom-Cleanup"