---
## Lab Number : 0

### Title : *Introduction to Spark* 

### Goal : Spark basics. Getting famililar with Datasets / RDDs / Transformations and Actions  

### In this Lab we will cover the:


1. Spark Datasets and RDDs 
3. Datasets Transformations and actions
2. Lambda functions
3. More on Dataset actions
4. More on Dataset transformations
5. Lazy Evaluation 

### References:

1. Spark API reference : https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD

### Dataset reference:

https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

---

#### Virtual Environments

Python virtual environments are a very useful feature that allows the creation of a controlled (and contained) working environment with tailored package installations for each analysis scenario

In [1]:
# For example : IF you needed to use different versions , you can create specific environments
# Just type the '%' character and then press tab to see all available commands
# % python3 -m venv labenv-pyt3 python=3.6

In [2]:
# %% source activate labenv-pyt3

#### Getting a SparkSession object

In [3]:
from pyspark.sql import SparkSession
spark = SparkSession \
        .builder \
        .appName("Lab0") \
        .getOrCreate()

In [4]:
# what is it?
spark

#### Let's load some data

In [5]:
datasets_path='/spark-course/data/banking/'
bank_data=datasets_path+'bank.csv'
# Use it to load some data
df= spark \
    .read \
    .option("header","true") \
    .csv(bank_data)

In [6]:
# What is df
df

DataFrame["age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"": string]

In [7]:
# ok , but this is not very ... telling , we want to see some of the data also
df.take(5)

[Row("age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y""='30;"unemployed";"married";"primary";"no";1787;"no";"no";"cellular";19;"oct";79;1;-1;0;"unknown";"no"'),
 Row("age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y""='33;"services";"married";"secondary";"no";4789;"yes";"yes";"cellular";11;"may";220;1;339;4;"failure";"no"'),
 Row("age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y""='35;"management";"single";"tertiary";"no";1350;"yes";"no";"cellular";16;"apr";185;1;330;1;"failure";"no"'),
 Row("age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y""='30;"management";"married";"tertiary";"no";1476;"y

In [8]:
# You can se how a Spark DataFrame is actually a Dataset[Row] abstraction
# Let's analyze some data
# First let's check the schema
df.printSchema()

root
 |-- "age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"": string (nullable = true)



In [10]:
# but there seems to be something odd here there is only the 'root' node and then a flat leaf 
# with everything recorded as strings , even stuff that is certainly numeric
# so .. let's provide ourselves the schema

#### Manually specifying data schema

In [23]:
# we can specify the schema ourselves
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql import Row
custom_schema = StructType([ \
    StructField("age", DoubleType(), True), \
    StructField("job", StringType(), True), \
    StructField("marital", StringType(), True), \
    StructField("education", StringType(), True), \
    StructField("default", StringType(), True), \
    StructField("balance", DoubleType(), True), \
    StructField("housing", StringType(), True), \
    StructField("loan", StringType(), True), \
    StructField("contact", StringType(), True), \ 
    StructField("day", StringType(), True), \ 
    StructField("month", StringType(), True), \                    
    StructField("duration", IntegerType(), True), \  
    StructField("campaign", IntegerType(), True), \    
    StructField("pdays", IntegerType(), True), \ 
    StructField("previous", IntegerType(), True), \
    StructField("poutcome", StringType(), True)])

SyntaxError: unexpected character after line continuation character (<ipython-input-23-a443d6e05d90>, line 5)

In [15]:
mdf= spark \
    .read \
    .option("header","true") \
    .schema(custom_schema) \
    .csv(bank_data)

NameError: name 'custom_schema' is not defined

In [13]:
pdf.head(5)

Unnamed: 0,"""age"";""job"";""marital"";""education"";""default"";""balance"";""housing"";""loan"";""contact"";""day"";""month"";""duration"";""campaign"";""pdays"";""previous"";""poutcome"";""y"""""
0,"30;""unemployed"";""married"";""primary"";""no"";1787;..."
1,"33;""services"";""married"";""secondary"";""no"";4789;..."
2,"35;""management"";""single"";""tertiary"";""no"";1350;..."
3,"30;""management"";""married"";""tertiary"";""no"";1476..."
4,"59;""blue-collar"";""married"";""secondary"";""no"";0;..."


In [56]:
# List Comprehensions
cuad = [item * item for item in lst]

In [57]:
cuad

[1, 4, 9, 16, 25, 36]

In [58]:
# Conditionals
for item in lst:
    if item % 2 == 0:
        print(item)

2
4
6


In [60]:
# We can actually write this as a list comprehension
[print(item) for item in lst if (item % 2 == 0)]

2
4
6


[None, None, None]

#### ... one of the most anoying things about Python is identation by the way

In [61]:
# Functions

# Define a function with the def reserved keyword
def check_even(number):
    """Return whether an integer is even or not."""
    even = (number % 2 == 0)
    if (even):
        print("number %d is even" %number)
    else:
        print("number %d is odd" %number) 

In [63]:
# Use a function by making a call to it
for number in lst:
    check_even(number)

number 1 is odd
number 2 is even
number 3 is odd
number 4 is even
number 5 is odd
number 6 is even
