# Chapter 2 Learn Pyspark

[Reference](https://www.amazon.com/Learn-PySpark-Python-based-Machine-Learning/dp/1484249607/ref=sr_1_2?keywords=Learn+pyspark&qid=1575896237&sr=8-2)

### Creating a SparkSession Object

In [1]:
from pyspark.sql import SparkSession

[Apache Spark reference](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html)

- SparkSession is the entry point to Spark SQL.
- SparkSession is required to work with Dataset and DataFrame API.


In [2]:
spark=SparkSession.builder.appName('data_processing').getOrCreate()

To create an instance of the SparkSession class attribute `builder`. Think of `builder` as a constructor method.

`appName` sets the name of the application that will be shown in the Spark UI.

`getOrCreate` similar to `new` in JavaScript.

In [3]:
import pyspark.sql.functions as F

`pyspark.sql.functions` is a collection of built-in functions.
[Link to functions.](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions)

In [4]:
from pyspark.sql.types import *

`pyspark.sql.types` are needed to create the data types in the dataframes.

### Create dataframe tha does not have null values

DataFrames
- Data structure in tabular form. Think of it as a SQL table.

In [5]:
# Define the structure of the schema.
schema=StructType() \
        .add('user_id', 'string') \
        .add('country', 'string') \
        .add('browser', 'string') \
        .add('OS', 'string') \
        .add('age', 'integer')

Storing the structure of the schema in the variable schema.

With StructType we are defining the type of data in each column of the dataframe.

In [6]:
# Insert data to a DataFrame
df=spark.createDataFrame( \
                        [('A203','India','Chrome','WIN', 33) \
                        ,('A201','China','Safari','MacOS',35) \
                        ,('A205','UK','Mozilla','Linux',25)] \
                        ,schema=schema)

In [7]:
# Printe schema of dataframe df.
df.printSchema()

root
 |-- user_id: string (nullable = true)
 |-- country: string (nullable = true)
 |-- browser: string (nullable = true)
 |-- OS: string (nullable = true)
 |-- age: integer (nullable = true)



Notice that in the schema definition nothing was done about nulls.

In [8]:
# Show the data of dataframe df
df.show()

+-------+-------+-------+-----+---+
|user_id|country|browser|   OS|age|
+-------+-------+-------+-----+---+
|   A203|  India| Chrome|  WIN| 33|
|   A201|  China| Safari|MacOS| 35|
|   A205|     UK|Mozilla|Linux| 25|
+-------+-------+-------+-----+---+



### Create a dataframe the has null values

In [9]:
# Use the same schema defined for df.
df_na=spark.createDataFrame( \
                           [('A203',None,'Chrome','WIN',33) \
                           ,('A201','China',None,'MacOS',35) \
                           ,('A205','UK','Mozilla','Linux',25)] \
                           ,schema=schema)

In [10]:
df_na.show()

+-------+-------+-------+-----+---+
|user_id|country|browser|   OS|age|
+-------+-------+-------+-----+---+
|   A203|   null| Chrome|  WIN| 33|
|   A201|  China|   null|MacOS| 35|
|   A205|     UK|Mozilla|Linux| 25|
+-------+-------+-------+-----+---+



In [11]:
# Replace null values of df_na.
# Use fillna()
df_na.fillna('0').show()

+-------+-------+-------+-----+---+
|user_id|country|browser|   OS|age|
+-------+-------+-------+-----+---+
|   A203|      0| Chrome|  WIN| 33|
|   A201|  China|      0|MacOS| 35|
|   A205|     UK|Mozilla|Linux| 25|
+-------+-------+-------+-----+---+



Notice that fillna does not modify df_na. Instead, it creates a "virtual" dataframe that shows all the null values being replaced by 0.

Execute the line below to see the df is not modified.

In [12]:
df_na.show()

+-------+-------+-------+-----+---+
|user_id|country|browser|   OS|age|
+-------+-------+-------+-----+---+
|   A203|   null| Chrome|  WIN| 33|
|   A201|  China|   null|MacOS| 35|
|   A205|     UK|Mozilla|Linux| 25|
+-------+-------+-------+-----+---+



### Replace null values in df_na with specific values

In [13]:
df_na.fillna({'country': 'USA', 'browser': 'Safari'}).show()
# The syntax inside the parenthesis is similar to defining a JS object.

+-------+-------+-------+-----+---+
|user_id|country|browser|   OS|age|
+-------+-------+-------+-----+---+
|   A203|    USA| Chrome|  WIN| 33|
|   A201|  China| Safari|MacOS| 35|
|   A205|     UK|Mozilla|Linux| 25|
+-------+-------+-------+-----+---+



### Drop all rows of df_na that have a null value in any column.

In [14]:
df_na.na.drop().show()

+-------+-------+-------+-----+---+
|user_id|country|browser|   OS|age|
+-------+-------+-------+-----+---+
|   A205|     UK|Mozilla|Linux| 25|
+-------+-------+-------+-----+---+



### Drop all rows where the country column is null

In [15]:
df_na.na.drop(subset='country').show()

+-------+-------+-------+-----+---+
|user_id|country|browser|   OS|age|
+-------+-------+-------+-----+---+
|   A201|  China|   null|MacOS| 35|
|   A205|     UK|Mozilla|Linux| 25|
+-------+-------+-------+-----+---+



### Replace a specific value with another value

In [16]:
df_na.replace('Chrome','Google Chrome').show()

+-------+-------+-------------+-----+---+
|user_id|country|      browser|   OS|age|
+-------+-------+-------------+-----+---+
|   A203|   null|Google Chrome|  WIN| 33|
|   A201|  China|         null|MacOS| 35|
|   A205|     UK|      Mozilla|Linux| 25|
+-------+-------+-------------+-----+---+



### Drop the user_id column

In [17]:
df_na.drop('user_id').show()

+-------+-------+-----+---+
|country|browser|   OS|age|
+-------+-------+-----+---+
|   null| Chrome|  WIN| 33|
|  China|   null|MacOS| 35|
|     UK|Mozilla|Linux| 25|
+-------+-------+-----+---+



## Import a csv file and manipulate its data

In [23]:
df=spark.read.csv('customer_data.csv' \
                 ,header=True \
                 ,inferSchema=True)
# customer_data.csv is in the location as this notebook.

In [19]:
# Verify the file was loaded by counting the number of rows.
df.count()

2000

In [20]:
# Count the number of columns.
len(df.columns)

7

In [22]:
# Print the schema of df.
df.printSchema()

root
 |-- Customer_subtype: string (nullable = true)
 |-- Number_of_houses: integer (nullable = true)
 |-- Avg_size_household: integer (nullable = true)
 |-- Avg_age: string (nullable = true)
 |-- Customer_main_type: string (nullable = true)
 |-- Avg_Salary: integer (nullable = true)
 |-- label: integer (nullable = true)



In [25]:
# Show the first 50 rows.
df.show(50)

+--------------------+----------------+------------------+-----------+--------------------+----------+-----+
|    Customer_subtype|Number_of_houses|Avg_size_household|    Avg_age|  Customer_main_type|Avg_Salary|label|
+--------------------+----------------+------------------+-----------+--------------------+----------+-----+
|Lower class large...|               1|                 3|30-40 years|Family with grown...|     44905|    0|
|Mixed small town ...|               1|                 2|30-40 years|Family with grown...|     37575|    0|
|Mixed small town ...|               1|                 2|30-40 years|Family with grown...|     27915|    0|
|Modern, complete ...|               1|                 3|40-50 years|      Average Family|     19504|    0|
|  Large family farms|               1|                 4|30-40 years|             Farmers|     34943|    0|
|    Young and rising|               1|                 2|20-30 years|         Living well|     13064|    0|
|Large religious f.

In [26]:
df.summary().show()

+-------+--------------------+------------------+------------------+-----------+--------------------+-----------------+------------------+
|summary|    Customer_subtype|  Number_of_houses|Avg_size_household|    Avg_age|  Customer_main_type|       Avg_Salary|             label|
+-------+--------------------+------------------+------------------+-----------+--------------------+-----------------+------------------+
|  count|                2000|              2000|              2000|       2000|                2000|             2000|              2000|
|   mean|                null|            1.1075|            2.6895|       null|                null|     1616908.0835|            0.0605|
| stddev|                null|0.3873225521186316|0.7914562220841646|       null|                null|6822647.757312146|0.2384705099001677|
|    min|Affluent senior a...|                 1|                 1|20-30 years|      Average Family|             1361|                 0|
|    25%|                nu

### Working with a subset of a dataframe

In [27]:
# Select statement
df.select(['Customer_subtype', 'Avg_Salary']).show()

+--------------------+----------+
|    Customer_subtype|Avg_Salary|
+--------------------+----------+
|Lower class large...|     44905|
|Mixed small town ...|     37575|
|Mixed small town ...|     27915|
|Modern, complete ...|     19504|
|  Large family farms|     34943|
|    Young and rising|     13064|
|Large religious f...|     29090|
|Lower class large...|      6895|
|Lower class large...|     35497|
|     Family starters|     30800|
|       Stable family|     39157|
|Modern, complete ...|     40839|
|Lower class large...|     30008|
|        Mixed rurals|     37209|
|    Young and rising|     45361|
|Lower class large...|     45650|
|Traditional families|     18982|
|Mixed apartment d...|     30093|
|Young all america...|     27097|
|Low income catholics|     23511|
+--------------------+----------+
only showing top 20 rows



### Filter data using the filter function

In [28]:
# Filter data using a single condition
df.filter(df['Avg_Salary'] > 1000000).count()

128

In [35]:
# Show the filtered data
df.filter(df['Avg_Salary'] > 1000000).show()

+--------------------+----------------+------------------+-----------+--------------------+----------+-----+
|    Customer_subtype|Number_of_houses|Avg_size_household|    Avg_age|  Customer_main_type|Avg_Salary|label|
+--------------------+----------------+------------------+-----------+--------------------+----------+-----+
| High status seniors|               1|                 3|40-50 years|Successful hedonists|   4670288|    0|
| High status seniors|               1|                 3|50-60 years|Successful hedonists|   9561873|    0|
| High status seniors|               1|                 2|40-50 years|Successful hedonists|  18687005|    0|
| High status seniors|               1|                 2|40-50 years|Successful hedonists|  24139960|    0|
| High status seniors|               1|                 2|50-60 years|Successful hedonists|   6718606|    0|
|High Income, expe...|               1|                 3|40-50 years|Successful hedonists|  19347139|    0|
|High Income, expe.

In [36]:
# Composite filter
df.filter(df['Avg_Salary'] > 500000).filter(df['Number_of_houses'] > 2).show()

+--------------------+----------------+------------------+-----------+--------------------+----------+-----+
|    Customer_subtype|Number_of_houses|Avg_size_household|    Avg_age|  Customer_main_type|Avg_Salary|label|
+--------------------+----------------+------------------+-----------+--------------------+----------+-----+
|Affluent senior a...|               3|                 2|50-60 years|Successful hedonists|    596723|    0|
|Affluent senior a...|               3|                 2|50-60 years|Successful hedonists|    944444|    0|
|Affluent senior a...|               3|                 2|50-60 years|Successful hedonists|    788477|    0|
|Affluent senior a...|               3|                 2|50-60 years|Successful hedonists|    994077|    0|
+--------------------+----------------+------------------+-----------+--------------------+----------+-----+



### Filter data using the where clause

In [39]:
df.where( \
         (df['Avg_Salary'] > 500000) \
         & \
         (df['Number_of_houses'] > 2) \
        ).show()

+--------------------+----------------+------------------+-----------+--------------------+----------+-----+
|    Customer_subtype|Number_of_houses|Avg_size_household|    Avg_age|  Customer_main_type|Avg_Salary|label|
+--------------------+----------------+------------------+-----------+--------------------+----------+-----+
|Affluent senior a...|               3|                 2|50-60 years|Successful hedonists|    596723|    0|
|Affluent senior a...|               3|                 2|50-60 years|Successful hedonists|    944444|    0|
|Affluent senior a...|               3|                 2|50-60 years|Successful hedonists|    788477|    0|
|Affluent senior a...|               3|                 2|50-60 years|Successful hedonists|    994077|    0|
+--------------------+----------------+------------------+-----------+--------------------+----------+-----+

