---
# Lab Number : 0

## Title : *Introduction to Spark* 

## Goal : Spark basics. Getting famililar with Datasets / RDDs / Transformations and Actions  

## In this Lab we will cover the:


1. Spark Datasets and RDDs 
2. Datasets Transformations and actions
3. Lambda functions
4. More on Dataset actions
5. More on Dataset transformations
6. Lazy Evaluation 

## References:

1. Spark API reference : https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD

## Dataset reference:

https://grouplens.org/datasets/movielens/

---

### SparkSession and SparkContext objects

In [75]:
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf

spark = SparkSession \
        .builder \
        .appName("Lab0") \
        .getOrCreate()

sc = spark.sparkContext

In [76]:
# what is it?
spark

In [77]:
# Is SparkContext also available?
sc

#### Note:  
Inspect the Spar UI Link (open link in a new tab) to get familiar with the Spark User Interface

###  1. Spark RDDs and Datasets Creation

In [62]:
import os
#
# If you need to read some environment var  e.g. HOME : 
# os.environ.get('HOME')
#
datasets_path='/spark-course/data/'
movies = datasets_path+'movies/ml-latest-small/movies.csv'

#### Create an RDD by using the SparkContext object

In [63]:
rdd = sc.textFile(movies)

In [64]:
rdd

/spark-course/data/movies/ml-latest-small/movies.csv MapPartitionsRDD[45] at textFile at NativeMethodAccessorImpl.java:0

#### Create a  Dataframe (Dataset[Row]) by using the SparkSession object

In [65]:
df = spark.read.text(movies)

In [66]:
df

DataFrame[value: string]

### 2. RDDs transformations and actions

In [67]:
# check contents of first 5 lines using the RDD
rdd.take(5)

['movieId,title,genres',
 '1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy',
 '2,Jumanji (1995),Adventure|Children|Fantasy',
 '3,Grumpier Old Men (1995),Comedy|Romance',
 '4,Waiting to Exhale (1995),Comedy|Drama|Romance']

In [68]:
# check contents of first 5 lines using the Dataframe

In [69]:
df.head(5)

[Row(value='movieId,title,genres'),
 Row(value='1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy'),
 Row(value='2,Jumanji (1995),Adventure|Children|Fantasy'),
 Row(value='3,Grumpier Old Men (1995),Comedy|Romance'),
 Row(value='4,Waiting to Exhale (1995),Comedy|Drama|Romance')]

In [70]:
# how many lines we have - perform an action
rdd.count()

9126

In [71]:
df.count()

9126

### 3. Lambda Functions

In [80]:
# build a dataset of (Int, String, String) tuples.
# each entry (row) correspond to : movieId , title , genres

# Explained : 
# 1. Use a transformation (map) to split the data wherever it finds a comma.
# 2. Use a transformation (map) to assign each of these results onto a python list for each row
tuples_rdd = rdd \
            .map(lambda line: line.split(",")) \
            .map(lambda row: [row[0],row[1],row[2]])

In [81]:
tuples_rdd

PythonRDD[65] at RDD at PythonRDD.scala:48

In [82]:
tuples_rdd.take(5)

[['movieId', 'title', 'genres'],
 ['1', 'Toy Story (1995)', 'Adventure|Animation|Children|Comedy|Fantasy'],
 ['2', 'Jumanji (1995)', 'Adventure|Children|Fantasy'],
 ['3', 'Grumpier Old Men (1995)', 'Comedy|Romance'],
 ['4', 'Waiting to Exhale (1995)', 'Comedy|Drama|Romance']]

In [83]:
# Now some dive into the data: (chaining transformations and action)
#
# 1. count how many Comedy cataloged movies are there in this dataset
#    we perform a transformation (filter) followed by an action count()
#
comedy=tuples_rdd.filter(lambda row : 'Comedy'   in row[2]).count()
drama=tuples_rdd.filter(lambda row : 'Drama'    in row[2]).count()
romac=tuples_rdd.filter(lambda row : 'Romance'  in row[2]).count()
child=tuples_rdd.filter(lambda row : 'Children' in row[2]).count()
scifi=tuples_rdd.filter(lambda row : 'Sci'      in row[2]).count()

In [86]:
print('comedies :-) %-d %.1f ' %(comedy,comedy/rdd.count()))
print('dramas   :-( %-d %.1f ' %(drama ,drama/rdd.count()))
print('scifi    X-J %-d %.1f ' %(scifi ,scifi/rdd.count()))

comedies :-) 2611 0.3 
dramas   :-( 3264 0.4 
scifi    X-J 656 0.1 


In [80]:
# Print the DAG
tuples.toDebugString()

b'(2) PythonRDD[36] at RDD at PythonRDD.scala:48 []\n |  /home/asier/spark-course/data/movies/ml-latest-small/movies.csv MapPartitionsRDD[27] at textFile at NativeMethodAccessorImpl.java:0 []\n |  /home/asier/spark-course/data/movies/ml-latest-small/movies.csv HadoopRDD[26] at textFile at NativeMethodAccessorImpl.java:0 []'

In [None]:
tuples.

In [5]:
bank_data=datasets_path+'bank.csv'
# Use it to load some data
df= spark \
    .read \
    .option("header","true") \
    .csv(bank_data)

In [6]:
# What is df ?
df

DataFrame["age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"": string]

In [7]:
# ok , but this is not very ... telling , we want to see some of the data also
df.take(5)

[Row("age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y""='30;"unemployed";"married";"primary";"no";1787;"no";"no";"cellular";19;"oct";79;1;-1;0;"unknown";"no"'),
 Row("age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y""='33;"services";"married";"secondary";"no";4789;"yes";"yes";"cellular";11;"may";220;1;339;4;"failure";"no"'),
 Row("age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y""='35;"management";"single";"tertiary";"no";1350;"yes";"no";"cellular";16;"apr";185;1;330;1;"failure";"no"'),
 Row("age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y""='30;"management";"married";"tertiary";"no";1476;"y

In [8]:
# You can se how a Spark DataFrame is actually a Dataset[Row] abstraction
# Let's analyze some data
# First let's check the schema
df.printSchema()

root
 |-- "age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"": string (nullable = true)



In [10]:
# but there seems to be something odd here there is only the 'root' node and then a flat leaf 
# with everything recorded as strings , even stuff that is certainly numeric
# so .. let's provide ourselves the schema

#### Manually specifying data schema

In [25]:
# we can specify the schema ourselves
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql import Row
fields = [ \
          StructField("age", DoubleType(), True), \
          StructField("job", StringType(), True), \
          StructField("marital", StringType(), True), \
          StructField("education", StringType(), True), \
          StructField("default", StringType(), True), \
          StructField("balance", DoubleType(), True), \
          StructField("housing", StringType(), True), \
          StructField("loan", StringType(), True), \
          StructField("contact", StringType(), True), \
          StructField("day", StringType(), True), \
          StructField("month", StringType(), True), \
          StructField("duration", IntegerType(), True), \
          StructField("campaign", IntegerType(), True), \
          StructField("pdays", IntegerType(), True), \
          StructField("previous", IntegerType(), True), \
          StructField("poutcome", StringType(), True)]

custom_schema=StructType(fields)

In [26]:
mdf= spark \
    .read \
    .option("header","true") \
    .schema(custom_schema) \
    .csv(bank_data)

In [27]:
mdf.printSchema()

root
 |-- age: double (nullable = true)
 |-- job: string (nullable = true)
 |-- marital: string (nullable = true)
 |-- education: string (nullable = true)
 |-- default: string (nullable = true)
 |-- balance: double (nullable = true)
 |-- housing: string (nullable = true)
 |-- loan: string (nullable = true)
 |-- contact: string (nullable = true)
 |-- day: string (nullable = true)
 |-- month: string (nullable = true)
 |-- duration: integer (nullable = true)
 |-- campaign: integer (nullable = true)
 |-- pdays: integer (nullable = true)
 |-- previous: integer (nullable = true)
 |-- poutcome: string (nullable = true)



#### Inferring the schema ?