In [4]:
%run "./Includes/Classroom-Setup"

### The Spark Approach

Spark offers a compute engine and connectors to virtually any data source. By leveraging easily scaled infrastructure and accessing data where it lives, Spark addresses the core needs of a big data application.

These principles comprise the Spark approach to ETL, providing a unified and scalable approach to big data pipelines: <br><br>

1. Databricks and Spark offer a **unified platform** 
 - Spark on Databricks combines ETL, stream processing, machine learning, and collaborative notebooks.
 - Data scientists, analysts, and engineers can write Spark code in Python, Scala, SQL, and R.
2. Spark's unified platform is **scalable to petabytes of data and clusters of thousands of nodes**.  
 - The same code written on smaller data sets scales to large workloads, often with only small changes.
2. Spark on Databricks decouples data storage from the compute and query engine.  
 - Spark's query engine **connects to any number of data sources** such as S3, Azure Blob Storage, Redshift, and Kafka.  
 - This **minimizes costs**; a dedicated cluster does not need to be maintained and the compute cluster is **easily updated to the latest version** of Spark.

### A Basic ETL Job

In this lesson you use web log files from the <a href="https://www.sec.gov/dera/data/edgar-log-file-data-set.html" target="_blank">US Securities and Exchange Commission website</a> to do a basic ETL for a day of server activity. You will extract the fields of interest and load them into persistent storage.

The Databricks File System (DBFS) is an HDFS-like interface to bulk data stores like Amazon's S3 and Azure's Blob storage service.

Pass the path `...csv` into `spark.read.csv`to access data stored in DBFS. Use the header option to specify that the first line of the file is the header.

In [10]:
path = "/mnt/training/*.csv"

logDF = (spark
  .read
  .option("header", True)
  .csv(path)
  .sample(withReplacement=False, fraction=0.3, seed=3) # using a sample to reduce data size
)

display(logDF)

ip,date,time,zone,cik,accession,extention,code,size,idx,norefer,noagent,find,crawler,browser
101.71.41.ihh,2017-03-29,00:00:00,0.0,1437491.0,0001245105-17-000052,xslF345X03/primary_doc.xml,301.0,687.0,0.0,0.0,0.0,10.0,0.0,
104.196.240.dda,2017-03-29,00:00:00,0.0,1270985.0,0001188112-04-001037,.txt,200.0,7619.0,0.0,0.0,0.0,10.0,0.0,
107.23.85.jfd,2017-03-29,00:00:00,0.0,1059376.0,0000905148-07-006108,-index.htm,200.0,2727.0,1.0,0.0,0.0,10.0,0.0,
107.23.85.jfd,2017-03-29,00:00:00,0.0,1059376.0,0000905148-08-001993,-index.htm,200.0,2710.0,1.0,0.0,0.0,10.0,0.0,
107.23.85.jfd,2017-03-29,00:00:00,0.0,1059376.0,0001104659-09-046963,-index.htm,200.0,2715.0,1.0,0.0,0.0,10.0,0.0,
107.23.85.jfd,2017-03-29,00:00:00,0.0,1364986.0,0000914121-06-002243,-index.htm,200.0,2786.0,1.0,0.0,0.0,10.0,0.0,
107.23.85.jfd,2017-03-29,00:00:00,0.0,1364986.0,0000914121-06-002251,-index.htm,200.0,2784.0,1.0,0.0,0.0,10.0,0.0,
108.240.248.gha,2017-03-29,00:00:00,0.0,1540159.0,0001217160-12-000029,f332scottlease.htm,200.0,49578.0,0.0,0.0,0.0,10.0,0.0,
108.59.8.fef,2017-03-29,00:00:00,0.0,732834.0,0001209191-15-017349,xslF345X03/doc4.xml,301.0,673.0,0.0,0.0,0.0,10.0,0.0,
108.91.91.hbc,2017-03-29,00:00:00,0.0,1629769.0,0001209191-17-023204,.txt,301.0,675.0,0.0,0.0,0.0,10.0,0.0,


Next, review the server-side errors, which have error codes in the 500s.

In [12]:
from pyspark.sql.functions import col

serverErrorDF = (logDF
  .filter((col("code") >= 500) & (col("code") < 600))
  .select("date", "time", "extention", "code")
)

display(serverErrorDF)

date,time,extention,code
2017-03-29,00:00:12,.txt,503.0
2017-03-29,00:00:16,-index.htm,503.0
2017-03-29,00:00:24,-index.htm,503.0
2017-03-29,00:00:44,-index.htm,503.0
2017-03-29,00:01:01,-index.htm,503.0
2017-03-29,00:01:01,-index.htm,503.0
2017-03-29,00:01:02,-index.htm,503.0
2017-03-29,00:01:03,-index.htm,503.0
2017-03-29,00:01:03,-index.htm,503.0
2017-03-29,00:01:04,-index.htm,503.0


### Data Validation

One aspect of ETL jobs is to validate that the data is what you expect.  This includes:<br><br>
* Approximately the expected number of records
* The expected fields are present
* No unexpected missing values

Take a look at the server-side errors by hour to confirm the data meets your expectations. Visualize it by selecting the bar graph icon once the table is displayed. <br><br>

In [16]:
from pyspark.sql.functions import from_utc_timestamp, hour, col

countsDF = (serverErrorDF
  .select(hour(from_utc_timestamp(col("time"), "GMT")).alias("hour"))
  .groupBy("hour")
  .count()
  .orderBy("hour")
)

display(countsDF)

hour,count
0,2030
1,1638
2,1123
3,1093
4,1118
5,1168
6,1089
7,1054
8,1055
9,1022


The distribution of errors by hour meets the expectations.  There is an uptick in errors around midnight, possibly due to server maintenance at this time.

### Saving Back to DBFS

A common and highly effective design pattern in the Databricks and Spark ecosystem involves loading structured data back to DBFS as a parquet file. Learn more about [the scalable and optimized data storage format parquet here](http://parquet.apache.org/).

Save the parsed DataFrame back to DBFS as parquet using the `.write` method.

In [19]:
targetPath = workingDir + "....parquet"

(serverErrorDF
  .write
  .mode("overwrite") # overwrites a file if it already exists
  .parquet(targetPath)
)

### Our ETL Pipeline

Here's what the ETL pipeline you just built looks like.  In the rest of this course you will work with more complex versions of this general pattern.

| Code | Stage |
|------|-------|
| `logDF = (spark`                                                                          | Extract |
| &nbsp;&nbsp;&nbsp;&nbsp;`.read`                                                           | Extract |
| &nbsp;&nbsp;&nbsp;&nbsp;`.option("header", True)`                                         | Extract |
| &nbsp;&nbsp;&nbsp;&nbsp;`.csv(<source>)`                                                  | Extract |
| `)`                                                                                       | Extract |
| `serverErrorDF = (logDF`                                                                  | Transform |
| &nbsp;&nbsp;&nbsp;&nbsp;`.filter((col("code") >= 500) & (col("code") < 600))`             | Transform |
| &nbsp;&nbsp;&nbsp;&nbsp;`.select("date", "time", "extention", "code")`                    | Transform |
| `)`                                                                                       | Transform |
| `(serverErrorDF.write`                                                                 | Load |
| &nbsp;&nbsp;&nbsp;&nbsp;`.parquet(<destination>))`                                      | Load |

This is a distributed job, so it can easily scale to fit the demands of your data set.

## Exercise 1: Perform an ETL Job

Write a basic ETL script that captures the 20 most active website users and load the results to DBFS.

### Step 1: Create a DataFrame of Aggregate Statistics

Create a DataFrame `ipCountDF` that uses `logDF` to create a count of each time a given IP address appears in the logs, with the counts sorted in descending order.  The result should have two columns: `ip` and `count`.

In [24]:
display(logDF)

ip,date,time,zone,cik,accession,extention,code,size,idx,norefer,noagent,find,crawler,browser
101.71.41.ihh,2017-03-29,00:00:00,0.0,1437491.0,0001245105-17-000052,xslF345X03/primary_doc.xml,301.0,687.0,0.0,0.0,0.0,10.0,0.0,
104.196.240.dda,2017-03-29,00:00:00,0.0,1270985.0,0001188112-04-001037,.txt,200.0,7619.0,0.0,0.0,0.0,10.0,0.0,
107.23.85.jfd,2017-03-29,00:00:00,0.0,1059376.0,0000905148-07-006108,-index.htm,200.0,2727.0,1.0,0.0,0.0,10.0,0.0,
107.23.85.jfd,2017-03-29,00:00:00,0.0,1059376.0,0000905148-08-001993,-index.htm,200.0,2710.0,1.0,0.0,0.0,10.0,0.0,
107.23.85.jfd,2017-03-29,00:00:00,0.0,1059376.0,0001104659-09-046963,-index.htm,200.0,2715.0,1.0,0.0,0.0,10.0,0.0,
107.23.85.jfd,2017-03-29,00:00:00,0.0,1364986.0,0000914121-06-002243,-index.htm,200.0,2786.0,1.0,0.0,0.0,10.0,0.0,
107.23.85.jfd,2017-03-29,00:00:00,0.0,1364986.0,0000914121-06-002251,-index.htm,200.0,2784.0,1.0,0.0,0.0,10.0,0.0,
108.240.248.gha,2017-03-29,00:00:00,0.0,1540159.0,0001217160-12-000029,f332scottlease.htm,200.0,49578.0,0.0,0.0,0.0,10.0,0.0,
108.59.8.fef,2017-03-29,00:00:00,0.0,732834.0,0001209191-15-017349,xslF345X03/doc4.xml,301.0,673.0,0.0,0.0,0.0,10.0,0.0,
108.91.91.hbc,2017-03-29,00:00:00,0.0,1629769.0,0001209191-17-023204,.txt,301.0,675.0,0.0,0.0,0.0,10.0,0.0,


In [25]:
# TODO
from pyspark.sql.functions import desc

ipCountDF = (logDF
  .select("ip")
  .groupBy("ip")
  .count()
  .orderBy(desc("count"))
)

display(ipCountDF)

ip,count
213.152.28.bhe,518548
158.132.91.haf,497361
117.91.6.caf,239912
132.195.122.djf,197267
117.91.2.aha,152731
173.52.208.ehd,146767
108.91.91.hbc,143232
117.91.7.hgh,133447
97.100.78.cjb,130156
217.174.255.dgd,123039


In [26]:
# TEST - Run this cell to test your solution
ip1, count1 = ipCountDF.first()
cols = set(ipCountDF.columns)

dbTest("ET1-P-02-01-01", "213.152.28.bhe", ip1)
dbTest("ET1-P-02-01-02", True, count1 > 500000 and count1 < 550000)
dbTest("ET1-P-02-01-03", True, 'count' in cols)
dbTest("ET1-P-02-01-03", True, 'ip' in cols)

print("Tests passed!")

### Step 2: Save the Results

Use your temporary folder to save the results back to DBFS as `workingDir + "/ipCount.parquet"`

In [28]:
# TODO
writePath = workingDir + "/ipCount.parquet"

(ipCountDF
  .write
  .mode("overwrite") # overwrites a file if it already exists
  .parquet(writePath)
)

In [29]:
# TEST - Run this cell to test your solution
from pyspark.sql.functions import desc

writePath = workingDir + "/ipCount.parquet"

ipCountDF2 = (spark
  .read
  .parquet(writePath)
  .orderBy(desc("count"))
)
ip1, count1 = ipCountDF2.first()
cols = ipCountDF2.columns

dbTest("ET1-P-02-02-01", "213.152.28.bhe", ip1)
dbTest("ET1-P-02-02-02", True, count1 > 500000 and count1 < 550000)
dbTest("ET1-P-02-02-03", True, "count" in cols)
dbTest("ET1-P-02-02-04", True, "ip" in cols)

print("Tests passed!")

Check the load worked by using listing the files in our **`writePath`**

Parquet divides your data into a number of files.

If successful, you see a `_SUCCESS` file as well as the data split across a number of parts.

In [31]:
display(dbutils.fs.ls(writePath))

path,name,size
dbfs:/user/jose.manuel.bustos.munoz@everis.com/etl_part_1/etl1_02_etl_process_overview_psp/ipCount.parquet/_SUCCESS,_SUCCESS,0
dbfs:/user/jose.manuel.bustos.munoz@everis.com/etl_part_1/etl1_02_etl_process_overview_psp/ipCount.parquet/_committed_1839735248138967219,_committed_1839735248138967219,3458
dbfs:/user/jose.manuel.bustos.munoz@everis.com/etl_part_1/etl1_02_etl_process_overview_psp/ipCount.parquet/_started_1839735248138967219,_started_1839735248138967219,0
dbfs:/user/jose.manuel.bustos.munoz@everis.com/etl_part_1/etl1_02_etl_process_overview_psp/ipCount.parquet/part-00000-tid-1839735248138967219-7db3e94c-f626-4ba5-9b8f-8f473d5e27f7-1135-1-c000.snappy.parquet,part-00000-tid-1839735248138967219-7db3e94c-f626-4ba5-9b8f-8f473d5e27f7-1135-1-c000.snappy.parquet,5372
dbfs:/user/jose.manuel.bustos.munoz@everis.com/etl_part_1/etl1_02_etl_process_overview_psp/ipCount.parquet/part-00001-tid-1839735248138967219-7db3e94c-f626-4ba5-9b8f-8f473d5e27f7-1136-1-c000.snappy.parquet,part-00001-tid-1839735248138967219-7db3e94c-f626-4ba5-9b8f-8f473d5e27f7-1136-1-c000.snappy.parquet,4963
dbfs:/user/jose.manuel.bustos.munoz@everis.com/etl_part_1/etl1_02_etl_process_overview_psp/ipCount.parquet/part-00002-tid-1839735248138967219-7db3e94c-f626-4ba5-9b8f-8f473d5e27f7-1137-1-c000.snappy.parquet,part-00002-tid-1839735248138967219-7db3e94c-f626-4ba5-9b8f-8f473d5e27f7-1137-1-c000.snappy.parquet,5037
dbfs:/user/jose.manuel.bustos.munoz@everis.com/etl_part_1/etl1_02_etl_process_overview_psp/ipCount.parquet/part-00003-tid-1839735248138967219-7db3e94c-f626-4ba5-9b8f-8f473d5e27f7-1138-1-c000.snappy.parquet,part-00003-tid-1839735248138967219-7db3e94c-f626-4ba5-9b8f-8f473d5e27f7-1138-1-c000.snappy.parquet,4991
dbfs:/user/jose.manuel.bustos.munoz@everis.com/etl_part_1/etl1_02_etl_process_overview_psp/ipCount.parquet/part-00004-tid-1839735248138967219-7db3e94c-f626-4ba5-9b8f-8f473d5e27f7-1139-1-c000.snappy.parquet,part-00004-tid-1839735248138967219-7db3e94c-f626-4ba5-9b8f-8f473d5e27f7-1139-1-c000.snappy.parquet,5108
dbfs:/user/jose.manuel.bustos.munoz@everis.com/etl_part_1/etl1_02_etl_process_overview_psp/ipCount.parquet/part-00005-tid-1839735248138967219-7db3e94c-f626-4ba5-9b8f-8f473d5e27f7-1140-1-c000.snappy.parquet,part-00005-tid-1839735248138967219-7db3e94c-f626-4ba5-9b8f-8f473d5e27f7-1140-1-c000.snappy.parquet,4873
dbfs:/user/jose.manuel.bustos.munoz@everis.com/etl_part_1/etl1_02_etl_process_overview_psp/ipCount.parquet/part-00006-tid-1839735248138967219-7db3e94c-f626-4ba5-9b8f-8f473d5e27f7-1141-1-c000.snappy.parquet,part-00006-tid-1839735248138967219-7db3e94c-f626-4ba5-9b8f-8f473d5e27f7-1141-1-c000.snappy.parquet,4337


## Review
**Question:** What does ETL stand for and what are the stages of the process?  
**Answer:** ETL stands for `extract-transform-load`
0. *Extract* refers to ingesting data.  Spark easily connects to data in a number of different sources.
0. *Transform* refers to applying structure, parsing fields, cleaning data, and/or computing statistics.
0. *Load* refers to loading data to its final destination, usually a database or data warehouse.

**Question:** How does the Spark approach to ETL deal with devops issues such as updating a software version?  
**Answer:** By decoupling storage and compute, updating your Spark version is as easy as spinning up a new cluster.  Your old code will easily connect to S3, the Azure Blob, or other storage.  This also avoids the challenge of keeping a cluster always running, such as with Hadoop clusters.

**Question:** How does the Spark approach to data applications differ from other solutions?  
**Answer:** Spark offers a unified solution to use cases that would otherwise need individual tools. For instance, Spark combines machine learning, ETL, stream processing, and a number of other solutions all with one technology.

In [34]:
%run "./Includes/Classroom-Cleanup"