# Lab 3.2 : PySpark CORE APIs

In [None]:
try:
    spark
except NameError:
    # initialize Spark Session
    import os
    import sys
    top_dir = os.path.abspath(os.path.join(os.getcwd(), "../"))
    if top_dir not in sys.path:
        sys.path.append(top_dir)

    from init_spark import init_spark
    spark = init_spark()

print('Spark UI running on port ' + spark.sparkContext.uiWebUrl.split(':')[2])
spark

## Step 1 - Load sample file

In [None]:
f = spark.read.text("../data/text/twinkle/sample.txt")

## for cloud accounts use this
#f = spark.read.text("s3://elephantscale-public/data/text/twinkle/sample.txt")
#f = spark.read.text("https://raw.githubusercontent.com/elephantscale/datasets/master/text/twinkle/sample.txt")

print (f)

After executing the above...  

**=> Goto Spark shell UI (4040+)**  
**=> Inspect the 'Jobs'  and 'Stages' section in the UI.**  
**=> How is the filter executed? Can you explain the behavior?**  


In [None]:
print(f.count())
f.show()

**==> Now do you see jobs executed in the Spark UI?**

## Step 2 - Filter

Note we are referring to 'value'

In [None]:
# both f.value and f["value"] will work
# later version recommended

filtered = f.filter(f.value.contains("twinkle"))
filtered = f.filter(f["value"].contains("twinkle"))

**==> Are filters executed yet?**

In [None]:
print(filtered.count())
filtered.show()

**==>How about now?**

## Step 3 - See the DAG visualizations

<img src="../assets/images/3.1c.png" style="border: 5px solid grey; max-width:100%;"/>


## Step 4 - Generate Large Dataset

In [None]:
# this works on native spark install

! echo "creating data..."

! [ ! -r ../data/text/twinkle/500M.data ] && cd  ../data/text/twinkle &&   ./create-data-files.sh

! echo "DONE"

! ls -lSrh ../data/text/twinkle
# sorted by size (smallest --> largest)

## Step 5 - Load larger dataset

In [None]:
f = spark.read.text("../data/text/twinkle/100M.data")

## for cloud accounts use this
#f = spark.read.text("s3://elephantscale-public/data/text/twinkle/100M.data")


print (f.count())
f.show()

### Discussion

- Inspect the 'count' job
- How many tasks?  Can you explain the number of tasks?
- how many stages?  Can you explain?
- drill into into the count stage.  
   - How many tasks are operating?  Can you explain?
   - look at the data size for each task
   - how long each task is running?

## Step 6 - Inspect Job Details
How many tasks were allocated to the latest job?  Can you figure out why?  
Hint : get number of partitions

In [None]:
f.rdd.getNumPartitions()

## Step 7 - Save filtered data

In [None]:
f = spark.read.text("../data/text/twinkle/100M.data")

## for cloud accounts use this
#f = spark.read.text("s3://elephantscale-public/data/text/twinkle/100M.data")

print ("f.count : ", f.count() )

filtered = f.filter(f["value"].contains("twinkle"))

print ("filtered.count : ", filtered.count())

filtered.write.mode('overwrite').text("out2")
print ('done')

In [None]:
## TODO : inspect the output dir
# What do you see?

! ls -lh out2

### Discussion
- Explain the nubmer of files
- Explain about `_SUCCESS`  file

## Step 8 - Bonus  -  Merging multiple partitions into ONE
When we saved data in the above section, there are multiple files created in output directory. Can you just create one output file?

Hint : see the API for `coalesce` or `repartition`

In [None]:
## TODO : how many partitions do we want?
## HINT : start with 1
num_partitions = ???
x = filtered.repartition(num_partitions)
x.write.mode('overwrite').text("out3")

# you can also do it all in one line
# f.repartition(1).write.mode('overwrite').text("out3")

In [None]:
## inspect the output dir
## how many files do you see?
! ls -lh out3

## Class Discussion

Instructor please do a detailed walk through the Spark UI.  Explain the following
- Jobs UI
- Jobs --> Stages drill down
- Understanding **dataflow**  (explain input/output/shuffle datasizes)
- **Task level analysis**  (time taken, data read, written, shuffled)
- Expand the **event timeline** at Stage UI and explain what is going on

## Questions for the Class

Plesae tell me how you will debug the following issues:

### Long running job

A Spark job used run in 20 minutes.  Now it takes 1hr+.  
- What could be the causes?
- How can you debug and isolate the cause?


### Job running out of memory

After a recent update, a job started crashing with 'Out of Memory'.  You are tasked to find out what is going on.  How will you approach this.

### Isoloating a 'slow' machine

You suspect your program is running slowly, due to hardware issue on a machine in the cluster of 10 machines.  
How do you isolate the machine that is causing the slowdown?