---
## Lab Number : 0

### Title : *Introduction to Spark* 

### Goal : Spark basics. Getting famililar with Datasets / RDDs / Transformations and Actions  

### In this Lab we will cover the:


1. Spark Datasets and RDDs 
3. Datasets Transformations and actions
2. Lambda functions
3. More on Dataset actions
4. More on Dataset transformations
5. Lazy Evaluation 

### References:

1. Spark API reference : https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD

### Dataset reference:

https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

---

#### Virtual Environments

Python virtual environments are a very useful feature that allows the creation of a controlled (and contained) working environment with tailored package installations for each analysis scenario.
Think about it!
For example : if you needed to use different python versions , you can create specific environments for each version
How : 
% conda-create env -n <name>
% source activate <name>

In [84]:
# %python3 -m venv course-env python=3.6
# %source activate course-env 

ERROR:root:Line magic function `%python3` not found (But cell magic `%%python3` exists, did you mean that instead?).


#### Getting a SparkSession object

In [85]:
from pyspark.sql import SparkSession
spark = SparkSession \
        .builder \
        .appName("Lab0") \
        .getOrCreate()

In [87]:
# what is it?
spark

#### Note:  Inspect the Spar UI Link (above) to get familiar with the Spark architecture

#### 1.  Spark Datasets and RDDs

In [88]:
import os
datasets_path= os.environ.get('HOME')+'/spark-course/data/'

In [89]:
# Dataset reference : https://grouplens.org/datasets/movielens/
movies = datasets_path+'movies/ml-latest-small/movies.csv'
# Create an RDD called lines
lines = sc.textFile(movies)

In [90]:
# What is lines ?
lines

/home/asier/spark-course/data/movies/ml-latest-small/movies.csv MapPartitionsRDD[71] at textFile at NativeMethodAccessorImpl.java:0

In [91]:
# how many lines we have - perform an action
lines.count()

9126

In [92]:
# check contents of first 5 lines
lines.take(5)

['movieId,title,genres',
 '1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy',
 '2,Jumanji (1995),Adventure|Children|Fantasy',
 '3,Grumpier Old Men (1995),Comedy|Romance',
 '4,Waiting to Exhale (1995),Comedy|Drama|Romance']

#### 2. RDDs transformations and actions

In [93]:
# build a dataset of (Int, String, String) tuples.
# each entry (row) correspond to : movieId , title , genres

# Explained : 
# 1. Use a transformation (map) to split the data wherever it finds a comma.
# 2. Use a transformation (map) to assign each of these results onto a python list for each row
tuples = lines.map(lambda line: line.split(",")) \
             .map(lambda row: [row[0],row[1],row[2]])

In [94]:
tuples

PythonRDD[74] at RDD at PythonRDD.scala:48

In [95]:
tuples.take(5)

[['movieId', 'title', 'genres'],
 ['1', 'Toy Story (1995)', 'Adventure|Animation|Children|Comedy|Fantasy'],
 ['2', 'Jumanji (1995)', 'Adventure|Children|Fantasy'],
 ['3', 'Grumpier Old Men (1995)', 'Comedy|Romance'],
 ['4', 'Waiting to Exhale (1995)', 'Comedy|Drama|Romance']]

In [78]:
# Count how many Comedy cataloged movies are there in this dataset
# We perform a transformation (filter) followed by an action count()
comedy=tuples.filter(lambda r : 'Comedy' in r[2]).count()
drama=tuples.filter(lambda r : 'Drama' in r[2]).count()
romac=tuples.filter(lambda r : 'Romance' in r[2]).count()
child=tuples.filter(lambda r : 'Children' in r[2]).count()
scifi=tuples.filter(lambda r : 'Sci' in r[2]).count()

In [79]:
print('comedies :-) %d %.1f ' %(comedy,comedy/lines.count()))
print('dramas   :-( %d %.1f ' %(drama ,drama/lines.count()))
print('scifi    X-J %d %.1f ' %(scifi ,scifi/lines.count()))

comedies :-) 2611 0.3 
dramas   :-( 3264 0.4 
scifi    X-J 656 0.1 


In [80]:
# Print the DAG
tuples.toDebugString()

b'(2) PythonRDD[36] at RDD at PythonRDD.scala:48 []\n |  /home/asier/spark-course/data/movies/ml-latest-small/movies.csv MapPartitionsRDD[27] at textFile at NativeMethodAccessorImpl.java:0 []\n |  /home/asier/spark-course/data/movies/ml-latest-small/movies.csv HadoopRDD[26] at textFile at NativeMethodAccessorImpl.java:0 []'

In [None]:
tuples.

In [5]:
bank_data=datasets_path+'bank.csv'
# Use it to load some data
df= spark \
    .read \
    .option("header","true") \
    .csv(bank_data)

In [6]:
# What is df ?
df

DataFrame["age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"": string]

In [7]:
# ok , but this is not very ... telling , we want to see some of the data also
df.take(5)

[Row("age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y""='30;"unemployed";"married";"primary";"no";1787;"no";"no";"cellular";19;"oct";79;1;-1;0;"unknown";"no"'),
 Row("age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y""='33;"services";"married";"secondary";"no";4789;"yes";"yes";"cellular";11;"may";220;1;339;4;"failure";"no"'),
 Row("age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y""='35;"management";"single";"tertiary";"no";1350;"yes";"no";"cellular";16;"apr";185;1;330;1;"failure";"no"'),
 Row("age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y""='30;"management";"married";"tertiary";"no";1476;"y

In [8]:
# You can se how a Spark DataFrame is actually a Dataset[Row] abstraction
# Let's analyze some data
# First let's check the schema
df.printSchema()

root
 |-- "age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"": string (nullable = true)



In [10]:
# but there seems to be something odd here there is only the 'root' node and then a flat leaf 
# with everything recorded as strings , even stuff that is certainly numeric
# so .. let's provide ourselves the schema

#### Manually specifying data schema

In [25]:
# we can specify the schema ourselves
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql import Row
fields = [ \
          StructField("age", DoubleType(), True), \
          StructField("job", StringType(), True), \
          StructField("marital", StringType(), True), \
          StructField("education", StringType(), True), \
          StructField("default", StringType(), True), \
          StructField("balance", DoubleType(), True), \
          StructField("housing", StringType(), True), \
          StructField("loan", StringType(), True), \
          StructField("contact", StringType(), True), \
          StructField("day", StringType(), True), \
          StructField("month", StringType(), True), \
          StructField("duration", IntegerType(), True), \
          StructField("campaign", IntegerType(), True), \
          StructField("pdays", IntegerType(), True), \
          StructField("previous", IntegerType(), True), \
          StructField("poutcome", StringType(), True)]

custom_schema=StructType(fields)

In [26]:
mdf= spark \
    .read \
    .option("header","true") \
    .schema(custom_schema) \
    .csv(bank_data)

In [27]:
mdf.printSchema()

root
 |-- age: double (nullable = true)
 |-- job: string (nullable = true)
 |-- marital: string (nullable = true)
 |-- education: string (nullable = true)
 |-- default: string (nullable = true)
 |-- balance: double (nullable = true)
 |-- housing: string (nullable = true)
 |-- loan: string (nullable = true)
 |-- contact: string (nullable = true)
 |-- day: string (nullable = true)
 |-- month: string (nullable = true)
 |-- duration: integer (nullable = true)
 |-- campaign: integer (nullable = true)
 |-- pdays: integer (nullable = true)
 |-- previous: integer (nullable = true)
 |-- poutcome: string (nullable = true)



#### Inferring the schema ?