---
# Lab Number : 0

## Title : *Introduction to Spark* 

## Goal : Spark basics. Getting famililar with Datasets / RDDs / Transformations and Actions  

## In this Lab we will cover the:


1. Spark Datasets and RDDs 
2. Datasets Transformations and actions
3. Lambda functions
4. More on Dataset actions
5. More on Dataset transformations
6. Lazy Evaluation 

## References:

1. Spark API reference : https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD

## Dataset reference:

https://grouplens.org/datasets/movielens/

---

### SparkSession and SparkContext objects

In [1]:
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf

spark = SparkSession \
        .builder \
        .appName("Lab0") \
        .getOrCreate()

sc = spark.sparkContext

In [2]:
# what is it?
spark

In [77]:
# Is SparkContext also available?
sc

#### Note:  
Inspect the Spar UI Link (open link in a new tab) to get familiar with the Spark User Interface

###  1. Spark RDDs and Datasets Creation

In [3]:
import os
#
# If you need to read some environment var  e.g. HOME : 
# os.environ.get('HOME')
#
datasets_path='/spark-course/data/'
movies = datasets_path+'movies/ml-latest-small/movies.csv'

#### Create an RDD by using the SparkContext object

In [4]:
rdd = sc.textFile(movies)

In [5]:
rdd

/spark-course/data/movies/ml-latest-small/movies.csv MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0

#### Create a  Dataframe (Dataset[Row]) by using the SparkSession object

In [6]:
df = spark.read.text(movies)

In [7]:
df

DataFrame[value: string]

### 2. RDDs transformations and actions

In [8]:
# check contents of first 5 lines using the RDD
rdd.take(5)

['movieId,title,genres',
 '1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy',
 '2,Jumanji (1995),Adventure|Children|Fantasy',
 '3,Grumpier Old Men (1995),Comedy|Romance',
 '4,Waiting to Exhale (1995),Comedy|Drama|Romance']

In [9]:
# check contents of first 5 lines using the Dataframe

In [10]:
df.head(5)

[Row(value='movieId,title,genres'),
 Row(value='1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy'),
 Row(value='2,Jumanji (1995),Adventure|Children|Fantasy'),
 Row(value='3,Grumpier Old Men (1995),Comedy|Romance'),
 Row(value='4,Waiting to Exhale (1995),Comedy|Drama|Romance')]

In [11]:
# how many lines we have - perform an action
rdd.count()

9126

In [12]:
df.count()

9126

### 3. Transformations and Lambda Functions

In [13]:
# build a dataset of (Int, String, String) tuples.
# each entry (row) correspond to : movieId , title , genres

# Explained : 
# 1. Use a transformation (map) to split the data wherever it finds a comma.
# 2. Use a transformation (map) to assign each of these results onto a python list for each row
tuples_rdd = rdd \
            .map(lambda line: line.split(",")) \
            .map(lambda row: [row[0],row[1],row[2]])

In [14]:
tuples_rdd

PythonRDD[13] at RDD at PythonRDD.scala:48

In [15]:
tuples_rdd.take(5)

[['movieId', 'title', 'genres'],
 ['1', 'Toy Story (1995)', 'Adventure|Animation|Children|Comedy|Fantasy'],
 ['2', 'Jumanji (1995)', 'Adventure|Children|Fantasy'],
 ['3', 'Grumpier Old Men (1995)', 'Comedy|Romance'],
 ['4', 'Waiting to Exhale (1995)', 'Comedy|Drama|Romance']]

In [16]:
# Now some dive into the data: (chaining transformations and action)
#
# 1. count how many Comedy cataloged movies are there in this dataset
#    we perform a transformation (filter) followed by an action count()
#
comedy=tuples_rdd.filter(lambda row : 'Comedy'   in row[2]).count()
drama=tuples_rdd.filter(lambda row : 'Drama'    in row[2]).count()
romac=tuples_rdd.filter(lambda row : 'Romance'  in row[2]).count()
child=tuples_rdd.filter(lambda row : 'Children' in row[2]).count()
scifi=tuples_rdd.filter(lambda row : 'Sci'      in row[2]).count()

In [17]:
print('comedies :-) %-d %.1f ' %(comedy,comedy/rdd.count()))
print('dramas   :-( %-d %.1f ' %(drama ,drama/rdd.count()))
print('scifi    X-J %-d %.1f ' %(scifi ,scifi/rdd.count()))

comedies :-) 2611 0.3 
dramas   :-( 3264 0.4 
scifi    X-J 656 0.1 


In [19]:
# Print the DAG
print(tuples_rdd.toDebugString())

b'(2) PythonRDD[13] at RDD at PythonRDD.scala:48 []\n |  /spark-course/data/movies/ml-latest-small/movies.csv MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0 []\n |  /spark-course/data/movies/ml-latest-small/movies.csv HadoopRDD[0] at textFile at NativeMethodAccessorImpl.java:0 []'
