<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://www.quantiaconsulting.com/logos/quantia_logo_orizz.png" alt="Databricks Learning" style="width: 600px; height: 240px">
</div>

# ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) Touching Spark - Part 2 - Exploring T&As in the Spark UI

## Getting Started

Let's start Creating SparkSession and useful variables

In [None]:
%load_ext autotime

In [None]:
import os
import qcutils
from pyspark.sql import SparkSession
import boto3
import io
from pyspark.sql.types import *
from pyspark.sql.functions import *

baseUri = "s3a://quantia-master/training/"

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.5 pyspark-shell'

spark = (SparkSession.builder 
    .master("local[*]")
    .appName("test")
    .getOrCreate()
        )
qcutils.init_spark_session(spark)

spark

## ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) Instructions
0. Run the cell below.
0. Answer each of the questions.
  * All the answers can be found in the **Spark UI**.
  * All aspects of the **Spark UI** may or may not have been reviewed with you - that's OK.
  * The goal is to get familiar with diagnosing applications.
0. Submit your answers for review.

**NOTE:** There is no real rhyme or reason to this code. It simply includes a couple of actions and a handful of narrow and wide transformations.

**WARNING:** Run the following cell only once. Running it multiple times will change some of the answers and make validation a little harder.

## ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) Data Ingestion

In [None]:
from pyspark.sql.functions import *
from pyspark.sql.types import *

initialDF = (spark                                                       
  .read                                                                     
  .parquet(baseUri+"wikipedia_pagecount_clean.parquet")   
  .cache()
)
initialDF.foreach(lambda x: None) # materialize the cache

## ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) Round #1 Questions
1. How many jobs were triggered?
1. Open the Spark UI and select the **Jobs** tab.
  1. What action triggered the first job?
  1. What action triggered the second job?
1. Open the details for the second job, how many MB of data was read in? Hint: Look at the **Input** column.
1. Open the details for the first stage of the second job, how many records were read in? Hint: Look at the **Input Size / Records** column.
1. Open the **SQL** tab
  1. How many stages are there for the job? Check the **DAG Visualization**
1. For the second job, the second stage, how many records (total) 
  1. How many stages are there?
  1. Open the **DAG Visualization**. Why do you suppose the first stage is grey?
1. Open the **Event Timeline** for the second stage of the second job.
  * Make sure to turn on all metrics under **Show Additional Metrics**.
  * Note that there were 200 tasks executed.

Let's move on...

In [None]:
someDF = (initialDF
  .withColumn("first", upper( col("article").substr(0,1)) )
  .where( col("first").isin("A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z") )
  .groupBy("project", "first").sum()
  .drop("sum(bytes_served)")
  .orderBy("first", "project")
  .select( col("first"), col("project"), col("sum(requests)").alias("total"))
  .filter( col("total") > 10000)
)

total = someDF.count()

## ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) Round #2 Questions
1. <div style="text-decoration:line-through">How many jobs were triggered?</div>
1. <div style="text-decoration:line-through">Open the Spark UI and select the **Jobs** tab.</div>

  1. <div style="text-decoration:line-through">What action triggered the first job?</div>
  1. <div style="text-decoration:line-through">What action triggered the second job?</div>
1. <div style="text-decoration:line-through">Open the details for the second job, how many MB of data was read in? Hint: Look at the **Input** column.</div>
1. <div style="text-decoration:line-through">Open the details for the first stage of the second job, how many records were read in? Hint: Look at the **Input Size / Records** column.</div>
1. Open the **SQL** tab
  1. How many stages are there for the job? Check the **DAG Visualization**
1. For the second job, the second stage, how many records (total) 
  1. How many stages are there?
  1. Open the **DAG Visualization**. Why do you suppose the first stage is grey?
1. Open the **Event Timeline** for the second stage of the second job.
  * Make sure to turn on all metrics under **Show Additional Metrics**.
  * Note that there were 200 tasks executed.

In [None]:
someDF.take(total)

## ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) (Optional) Round #3 Questions
1. Collectively, `someDF.count()` produced 2 jobs and 6 stages.  
However, `someDF.take(total)` produced only 1 job and 2 stages.  
  1. Why did it only produce 1 job?
  1. Why did the last job only produce 2 stages?
1. Look at the **Storage** tab. How many partitions were cached?
1. How many MB of data is being used by our cache?
1. How many total MB of data is available for caching?

##### ![Quantia Tiny Logo](https://www.quantiaconsulting.com/logos/quantia_logo_tiny.png) 2020 Quantia Consulting, srl. All rights reserved.