<a id='tablecontents'></a>

# PySpark for Data Analysis on Jupyter Notebook - Foundations
<h5>2023, Andrea Paviglianiti</h5>

<hr>

## Table of Contents:

- [Dependencies](#section1)
- [Load Dataset](#section2)
- [Use PySpark in a function](#section3)
- [Data Cleansing with PySpark](#section4)
- [Data Analysis with PySpark](#section5)
- [Summary of functions used](#summary)
- [Next Steps](#section6)

<br>
<hr>

<a id='section1'></a>

## Dependencies

In [1]:
import pandas as pd
import numpy as np
import os

### Apache Spark: PySpark Dependencies

In [2]:
import findspark
findspark.init()
findspark.find()

'/Users/apavigli/opt/anaconda3/lib/python3.9/site-packages/pyspark'

In [3]:
#os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
#os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop3.2"

# Import PySpark
from pyspark.sql import SparkSession

#Create SparkSession
#spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
spark = SparkSession.builder.master("local[*]").getOrCreate()

#Create a Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

#Import all functions from pyspark to explore and manipulate data
from pyspark.sql.functions import *

#Show Spark Session
spark

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/06/21 15:41:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/06/21 15:41:50 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
23/06/21 15:41:50 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


In [4]:
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
columns = ["language","users_count"]

#Create a DataFrame using PySpark
df0 = spark.createDataFrame(data).toDF(*columns)

#SHOW the dataframe
df0.show()

[Stage 0:>                                                          (0 + 1) / 1]

+--------+-----------+
|language|users_count|
+--------+-----------+
|    Java|      20000|
|  Python|     100000|
|   Scala|       3000|
+--------+-----------+



                                                                                

<br>

[Back to Table of Contents](#tablecontents)

<a id='section2'></a>

## Load Dataset

In [5]:
directory = r'input/original.csv'

#Read csv file using PySpark
df = spark.read.format("csv").option("header", "true").load(directory)

#### Method Breakdown:

- `spark.read` = calls for reading a file
- `.format("csv")` = specifies the file type
- `.option("header", "true")` = specifies that first row in the file will be the header
- `.load(my_file)` = calls for the file (or can include the whole directory)

`spark.read.format("csv").option("header", "true").load(directory)`

In [6]:
#Show only first 5 rows
df.show(5)

+---+----------+----------+------+-------------+--------------------+---------+----------+-----------+
| id|first_name| last_name|gender|         City|            JobTitle|   Salary|  Latitude|  Longitude|
+---+----------+----------+------+-------------+--------------------+---------+----------+-----------+
|  1|   Melinde| Shilburne|Female|    Nowa Ruda| Assistant Professor|$57438.18|50.5774075| 16.4967184|
|  2|  Kimberly|Von Welden|Female|       Bulgan|       Programmer II|$62846.60|48.8231572|103.5218199|
|  3|    Alvera|  Di Boldi|Female|         null|                null|$57576.52|39.9947462|116.3397725|
|  4|   Shannon| O'Griffin|  Male|Divnomorskoye|Budget/Accounting...|$61489.23|44.5047212| 38.1300171|
|  5|  Sherwood|   Macieja|  Male|    Mytishchi|            VP Sales|$63863.09|      null| 37.6489954|
+---+----------+----------+------+-------------+--------------------+---------+----------+-----------+
only showing top 5 rows



<br>

[Back to Table of Contents](#tablecontents)

<a id='section3'></a>

## Use PySpark in a function

In [7]:
#Create a function that loads a csv as spark dataframe w/o need of recoding this
def spark_df(file_dir):
    df = spark.read.format("csv").option("header","true").load(file_dir)
    return df

#What if the header is not specified?
def spark_df_wrong(file_dir):
    df = spark.read.format("csv").load(file_dir)
    return df

In [8]:
fd = spark_df(directory)
fd.show(5)

+---+----------+----------+------+-------------+--------------------+---------+----------+-----------+
| id|first_name| last_name|gender|         City|            JobTitle|   Salary|  Latitude|  Longitude|
+---+----------+----------+------+-------------+--------------------+---------+----------+-----------+
|  1|   Melinde| Shilburne|Female|    Nowa Ruda| Assistant Professor|$57438.18|50.5774075| 16.4967184|
|  2|  Kimberly|Von Welden|Female|       Bulgan|       Programmer II|$62846.60|48.8231572|103.5218199|
|  3|    Alvera|  Di Boldi|Female|         null|                null|$57576.52|39.9947462|116.3397725|
|  4|   Shannon| O'Griffin|  Male|Divnomorskoye|Budget/Accounting...|$61489.23|44.5047212| 38.1300171|
|  5|  Sherwood|   Macieja|  Male|    Mytishchi|            VP Sales|$63863.09|      null| 37.6489954|
+---+----------+----------+------+-------------+--------------------+---------+----------+-----------+
only showing top 5 rows



In [9]:
fd1 = spark_df_wrong(directory)
fd1.show(5)

+---+----------+----------+------+-------------+--------------------+---------+----------+-----------+
|_c0|       _c1|       _c2|   _c3|          _c4|                 _c5|      _c6|       _c7|        _c8|
+---+----------+----------+------+-------------+--------------------+---------+----------+-----------+
| id|first_name| last_name|gender|         City|            JobTitle|   Salary|  Latitude|  Longitude|
|  1|   Melinde| Shilburne|Female|    Nowa Ruda| Assistant Professor|$57438.18|50.5774075| 16.4967184|
|  2|  Kimberly|Von Welden|Female|       Bulgan|       Programmer II|$62846.60|48.8231572|103.5218199|
|  3|    Alvera|  Di Boldi|Female|         null|                null|$57576.52|39.9947462|116.3397725|
|  4|   Shannon| O'Griffin|  Male|Divnomorskoye|Budget/Accounting...|$61489.23|44.5047212| 38.1300171|
+---+----------+----------+------+-------------+--------------------+---------+----------+-----------+
only showing top 5 rows



<b>When headers aren't specified in the options, two things occur:</b>
1. The header is treated as first row;
2. The header is generated automatically and columns have names assigned like `_c0`, `_c1`, ..., `_c_N`

<br>

[Back to Table of Contents](#tablecontents)

<a id='section4'></a>

## Data Cleansing with PySpark

The greatest difference between `pandas` and `pyspark` in data manipulation is that <b>pyspark dataframes are immutable.</b>

This means that every time we execute an operation a <u>new dataframe must be created as a result.</u>

In [10]:
df.show()

+---+----------+----------+------+---------------+--------------------+---------+----------+-----------+
| id|first_name| last_name|gender|           City|            JobTitle|   Salary|  Latitude|  Longitude|
+---+----------+----------+------+---------------+--------------------+---------+----------+-----------+
|  1|   Melinde| Shilburne|Female|      Nowa Ruda| Assistant Professor|$57438.18|50.5774075| 16.4967184|
|  2|  Kimberly|Von Welden|Female|         Bulgan|       Programmer II|$62846.60|48.8231572|103.5218199|
|  3|    Alvera|  Di Boldi|Female|           null|                null|$57576.52|39.9947462|116.3397725|
|  4|   Shannon| O'Griffin|  Male|  Divnomorskoye|Budget/Accounting...|$61489.23|44.5047212| 38.1300171|
|  5|  Sherwood|   Macieja|  Male|      Mytishchi|            VP Sales|$63863.09|      null| 37.6489954|
|  6|     Maris|      Folk|Female|Kinsealy-Drinan|      Civil Engineer|$30101.16|53.4266145| -6.1644997|
|  7|     Masha|    Divers|Female|         Dachun|     

### Substitute Null Values

In [11]:
#Deal with Null values in City Column
df2 = df.withColumn("clean_city", when(df.City.isNull(), "Unknown").otherwise(df.City))
df2.show()

+---+----------+----------+------+---------------+--------------------+---------+----------+-----------+---------------+
| id|first_name| last_name|gender|           City|            JobTitle|   Salary|  Latitude|  Longitude|     clean_city|
+---+----------+----------+------+---------------+--------------------+---------+----------+-----------+---------------+
|  1|   Melinde| Shilburne|Female|      Nowa Ruda| Assistant Professor|$57438.18|50.5774075| 16.4967184|      Nowa Ruda|
|  2|  Kimberly|Von Welden|Female|         Bulgan|       Programmer II|$62846.60|48.8231572|103.5218199|         Bulgan|
|  3|    Alvera|  Di Boldi|Female|           null|                null|$57576.52|39.9947462|116.3397725|        Unknown|
|  4|   Shannon| O'Griffin|  Male|  Divnomorskoye|Budget/Accounting...|$61489.23|44.5047212| 38.1300171|  Divnomorskoye|
|  5|  Sherwood|   Macieja|  Male|      Mytishchi|            VP Sales|$63863.09|      null| 37.6489954|      Mytishchi|
|  6|     Maris|      Folk|Femal

#### Breaking down the method

1. `df.withColumn("new_column", when(...)).otherwise(...))` : it functions like IF/ELSE condition. The new column has new values if condition is met, otherwise values will be identical to original column

2. Method: `when(df.City.isNull(), "Unknown")` : specifies that WHEN the condition is met, the value of that cell in the column witll be "Unknown"

3. Method: `otherwise(df.City)` : specifies that WHEN the condition is not met, values in the cell will not change.

### Delete data entries

In [12]:
#Overwrite df2 by filtering out Null values for `JobTitle` column
df2 = df2.filter(df2.JobTitle.isNotNull())
df2.show()

+---+----------+----------+------+---------------+--------------------+---------+----------+-----------+---------------+
| id|first_name| last_name|gender|           City|            JobTitle|   Salary|  Latitude|  Longitude|     clean_city|
+---+----------+----------+------+---------------+--------------------+---------+----------+-----------+---------------+
|  1|   Melinde| Shilburne|Female|      Nowa Ruda| Assistant Professor|$57438.18|50.5774075| 16.4967184|      Nowa Ruda|
|  2|  Kimberly|Von Welden|Female|         Bulgan|       Programmer II|$62846.60|48.8231572|103.5218199|         Bulgan|
|  4|   Shannon| O'Griffin|  Male|  Divnomorskoye|Budget/Accounting...|$61489.23|44.5047212| 38.1300171|  Divnomorskoye|
|  5|  Sherwood|   Macieja|  Male|      Mytishchi|            VP Sales|$63863.09|      null| 37.6489954|      Mytishchi|
|  6|     Maris|      Folk|Female|Kinsealy-Drinan|      Civil Engineer|$30101.16|53.4266145| -6.1644997|Kinsealy-Drinan|
|  8|   Goddart|     Flear|  Mal

You will notice that rows <b>3</b>, <b>7</b> were filtered out.

#### Breaking down the method:

1. `df2.filter()` : it decides what to <u>keep</u>
2. `df2.JobTitle.isNotNull()` : NotNull values gets to stay, and Null values will be filtered out.

### Replace Values

The column `salary` is currently a <b>string</b> type because it contains other symbols than numbers.

In [13]:
#Overwrite df2 and replace values
df2 = df2.withColumn("clean_salary", df2.Salary.substr(2,100).cast('float'))
df2.show(5)

+---+----------+----------+------+---------------+--------------------+---------+----------+-----------+---------------+------------+
| id|first_name| last_name|gender|           City|            JobTitle|   Salary|  Latitude|  Longitude|     clean_city|clean_salary|
+---+----------+----------+------+---------------+--------------------+---------+----------+-----------+---------------+------------+
|  1|   Melinde| Shilburne|Female|      Nowa Ruda| Assistant Professor|$57438.18|50.5774075| 16.4967184|      Nowa Ruda|    57438.18|
|  2|  Kimberly|Von Welden|Female|         Bulgan|       Programmer II|$62846.60|48.8231572|103.5218199|         Bulgan|     62846.6|
|  4|   Shannon| O'Griffin|  Male|  Divnomorskoye|Budget/Accounting...|$61489.23|44.5047212| 38.1300171|  Divnomorskoye|    61489.23|
|  5|  Sherwood|   Macieja|  Male|      Mytishchi|            VP Sales|$63863.09|      null| 37.6489954|      Mytishchi|    63863.09|
|  6|     Maris|      Folk|Female|Kinsealy-Drinan|      Civil 

#### Breaking down the method:

1. `df2.withColumn("clean_salary"), df2.Salary` : it says that the new column refers to an existing one
2. `df2.Salary.substr(2,100)` : it means that we skip the character 1 and we include all other characters until 100
3. `.cast('float')` the new column is converted to float

### Calculate Values

In [14]:
salary_mean = df2.groupBy().avg('clean_salary')
salary_max = df2.groupBy().max('clean_salary')
salary_min = df2.groupBy().min('clean_salary')

for calculation in [salary_mean, salary_max, salary_min]:
    calculation.show()

+-----------------+
|avg(clean_salary)|
+-----------------+
|55516.32088199837|
+-----------------+

+-----------------+
|max(clean_salary)|
+-----------------+
|         99948.28|
+-----------------+

+-----------------+
|min(clean_salary)|
+-----------------+
|         10101.92|
+-----------------+



Values are calculated as a dataframe, but can be extracted.

In [15]:
this_mean = salary_mean.take(1)[0][0]
print(this_mean)

55516.32088199837


In [16]:
#Do it with a function
def spark_take_value(mydf):
    return mydf.take(1)[0][0]

In [17]:
for calculation in [salary_mean, salary_max, salary_min]:
    print(spark_take_value(calculation))

55516.32088199837
99948.28125
10101.919921875


When values are stored in a variable, the <b>lit</b> method (aka <i>"literal"</i>) must be used.

For example, to replace null values:

In [18]:
from pyspark.sql.functions import lit
df2 = df2.withColumn("new_salary", when(df2.clean_salary.isNull(), lit(this_mean)).otherwise(df2.clean_salary))

#Check rows where NaN values have been replaced by mean
df2.filter(col('new_salary') == lit(this_mean)).show()

+---+----------+---------+------+----+--------+------+--------+---------+----------+------------+----------+
| id|first_name|last_name|gender|City|JobTitle|Salary|Latitude|Longitude|clean_city|clean_salary|new_salary|
+---+----------+---------+------+----+--------+------+--------+---------+----------+------------+----------+
+---+----------+---------+------+----+--------+------+--------+---------+----------+------------+----------+



We did not have any Null values in Salary and therefore nothing to replace.

However you can also use the same method to filter values above / below target value.

In [19]:
#Check all rows with value below the mean
print(this_mean)
df2.filter(col('new_salary') < lit(this_mean)).show(10)

55516.32088199837
+---+----------+---------+------+---------------+--------------------+---------+----------+-----------+---------------+------------+----------------+
| id|first_name|last_name|gender|           City|            JobTitle|   Salary|  Latitude|  Longitude|     clean_city|clean_salary|      new_salary|
+---+----------+---------+------+---------------+--------------------+---------+----------+-----------+---------------+------------+----------------+
|  6|     Maris|     Folk|Female|Kinsealy-Drinan|      Civil Engineer|$30101.16|53.4266145| -6.1644997|Kinsealy-Drinan|    30101.16|  30101.16015625|
|  8|   Goddart|    Flear|  Male|      Trélissac|Desktop Support T...|$46116.36|45.1905186|  0.7423124|      Trélissac|    46116.36|    46116.359375|
| 11|    Kylynn|  Lockart|Female|       El Cardo|Nuclear Power Eng...|$13604.63|     -5.85|-79.8833329|       El Cardo|    13604.63|13604.6298828125|
| 13|      Kerr|   Braden|  Male|      Sułkowice|Compensation Analyst|$33432.99|49

### Calculate the Median Value

Calculating the <b>median</b> in pyspark is a bit more complex than with pandas.

In pandas: `df['my_column'].median()` does the job.

Whereas, in pyspark we exploit numpy:

In [20]:
#Select a specific column from pyspark dataframe
latitudes = df2.select("Latitude")

latitudes.show(5)

+----------+
|  Latitude|
+----------+
|50.5774075|
|48.8231572|
|44.5047212|
|      null|
|53.4266145|
+----------+
only showing top 5 rows



In [21]:
#Filter out Null Values
latitudes = latitudes.filter(latitudes.Latitude.isNotNull())
latitudes.show(5)

+----------+
|  Latitude|
+----------+
|50.5774075|
|48.8231572|
|44.5047212|
|53.4266145|
|45.1905186|
+----------+
only showing top 5 rows



In [22]:
#Ensure the values are float:
latitudes = latitudes.withColumn('Latitude2', latitudes.Latitude.cast('float')).select('Latitude2')
latitudes.show(5)

+---------+
|Latitude2|
+---------+
|50.577408|
| 48.82316|
|44.504723|
|53.426613|
|45.190517|
+---------+
only showing top 5 rows



#### Break down the method:
1. We create a new column that is equal to `Latitude`. We name this `Latitude2`
2. The new column is cast as `float`
3. We select only the new column using the `select` method.

In [23]:
#Calculate the median
latitude_median = np.median(latitudes.collect())
print(latitude_median)

31.93397331237793


In [24]:
# Do it with a function 
def spark_extract_median(data, column):
    from pyspark.sql.functions import col      #Any string provided in the function can be used with the `col` method
    
    #Step 1: Select
    this_data = data.select(column)
    
    #Step 2: Remove Nulls
    this_data.filter(col(column).isNotNull())
    
    #Step 3: Convert to Float
    this_data = this_data.withColumn(
        "new_col", col(column).cast('float')
    ).select("new_col")
    
    #Step 4: Calculate the Median
    my_median = np.median(this_data.collect())
    return my_median

In [25]:
longitudes = df2.select("Longitude")
longitude_median = spark_extract_median(longitudes, "Longitude")
print(longitude_median)

36.85906219482422


In [26]:
#Replace values:
df2 = df2.withColumn('lat', when(df2.Latitude.isNull(), lit(latitude_median)).otherwise(df2.Latitude))
df2 = df2.withColumn('lon', when(df2.Longitude.isNull(), lit(longitude_median)).otherwise(df2.Longitude))

#Show all rows where Latitude and Longitude are Null.
df2.filter((df2.Latitude.isNull()) | (df2.Longitude.isNull())).show()

+---+----------+---------+------+---------+--------+---------+--------+----------+----------+------------+--------------+-----------------+----------+
| id|first_name|last_name|gender|     City|JobTitle|   Salary|Latitude| Longitude|clean_city|clean_salary|    new_salary|              lat|       lon|
+---+----------+---------+------+---------+--------+---------+--------+----------+----------+------------+--------------+-----------------+----------+
|  5|  Sherwood|  Macieja|  Male|Mytishchi|VP Sales|$63863.09|    null|37.6489954| Mytishchi|    63863.09|63863.08984375|31.93397331237793|37.6489954|
+---+----------+---------+------+---------+--------+---------+--------+----------+----------+------------+--------------+-----------------+----------+



<br>

[Back to Table of Contents](#tablecontents)

<a id='section5'></a>

## Data Analysis with PySpark

With PySpark we can run more <i>sql-like</i> analysis, including grouping and aggregations for specific target variables.

In [27]:
#Import sql functions
import pyspark.sql.functions as sqlfunc

In [28]:
#Group by Gender, & Aggregate by Salary Mean

In [29]:
df2.show(2)

+---+----------+----------+------+---------+-------------------+---------+----------+-----------+----------+------------+-------------+----------+-----------+
| id|first_name| last_name|gender|     City|           JobTitle|   Salary|  Latitude|  Longitude|clean_city|clean_salary|   new_salary|       lat|        lon|
+---+----------+----------+------+---------+-------------------+---------+----------+-----------+----------+------------+-------------+----------+-----------+
|  1|   Melinde| Shilburne|Female|Nowa Ruda|Assistant Professor|$57438.18|50.5774075| 16.4967184| Nowa Ruda|    57438.18|57438.1796875|50.5774075| 16.4967184|
|  2|  Kimberly|Von Welden|Female|   Bulgan|      Programmer II|$62846.60|48.8231572|103.5218199|    Bulgan|     62846.6|62846.6015625|48.8231572|103.5218199|
+---+----------+----------+------+---------+-------------------+---------+----------+-----------+----------+------------+-------------+----------+-----------+
only showing top 2 rows



### Single and Multiple Aggregation

In [30]:
genders = df2.groupBy('gender').agg(sqlfunc.avg('new_salary'))
genders.show()

+------+------------------+
|gender|   avg(new_salary)|
+------+------------------+
|Female|55677.250125558036|
|  Male| 55361.09385573019|
+------+------------------+



In [31]:
#Double Aggregation, Sorted by target column
doubleagg = df2.groupBy('JobTitle','gender').agg(sqlfunc.avg('new_salary')).orderBy('JobTitle','gender')   #in orderBy you can use `ascending=False`
doubleagg.show()

+--------------------+------+------------------+
|            JobTitle|gender|   avg(new_salary)|
+--------------------+------+------------------+
| Account Coordinator|Female|     46707.4453125|
| Account Coordinator|  Male|51446.623697916664|
|   Account Executive|Female|   52020.779296875|
|   Account Executive|  Male|    65415.96796875|
|Account Represent...|Female|51116.973307291664|
|Account Represent...|Female|    41786.91015625|
|Account Represent...|  Male|    40562.69921875|
|Account Represent...|  Male|     22420.7109375|
|       Accountant II|Female|  50354.8154296875|
|      Accountant III|Female|  15589.5595703125|
|      Accountant III|  Male|39183.798177083336|
|       Accountant IV|Female|   82732.248046875|
|Accounting Assist...|Female|  58916.0830078125|
|Accounting Assist...|  Male|     59255.4296875|
|Accounting Assist...|Female|44071.866536458336|
|Accounting Assist...|  Male|   18795.439453125|
|Accounting Assist...|Female|      57337.484375|
|             Actuar

### Separate values based on condition / target variable and compare statistics

Currently, the average salary is calculated based on all entries, regardless of the gender. This provides opportunity for data leak and bias in ML.

Therefore:
- we will separate women salary and men salary,
- we will calculate the averages separately,
- we will compare


In [32]:
#Create a new dataframe, based on df2
dfg = df2.withColumn('f_salary', when(df2.gender=='Female', df2.new_salary).otherwise(lit(0)))

#Overwrite current dataframe to add new column
dfg = dfg.withColumn('m_salary', when(dfg.gender=='Male', df2.new_salary).otherwise(lit(0)))

#View results
dfg.select('id','gender','new_salary','f_salary','m_salary').show(5)

+---+------+--------------+--------------+--------------+
| id|gender|    new_salary|      f_salary|      m_salary|
+---+------+--------------+--------------+--------------+
|  1|Female| 57438.1796875| 57438.1796875|           0.0|
|  2|Female| 62846.6015625| 62846.6015625|           0.0|
|  4|  Male|61489.23046875|           0.0|61489.23046875|
|  5|  Male|63863.08984375|           0.0|63863.08984375|
|  6|Female|30101.16015625|30101.16015625|           0.0|
+---+------+--------------+--------------+--------------+
only showing top 5 rows



In [33]:
# Group data by JobTitle and aggregate by mean of salary/gender. 
# Provide also an alias to new columns to make them more understandable.
# In this case, Job Title does not repeat itself, the column will have unique entries compared to when `gender` variable was used to aggregate
genda = dfg.groupBy('JobTitle').agg(sqlfunc.avg('f_salary').alias('female_salary'), sqlfunc.avg('m_salary').alias('male_salary'))
genda.show()

+--------------------+------------------+------------------+
|            JobTitle|     female_salary|       male_salary|
+--------------------+------------------+------------------+
|Systems Administr...|   50590.474609375|  15540.9501953125|
|   Media Manager III|29586.436197916668|17381.920572916668|
|  Recruiting Manager|34848.452473958336|  26383.4951171875|
|       Geologist III|      31749.046875|    12830.75390625|
|        Geologist II|               0.0|   43293.865234375|
|Database Administ...|               0.0|     52018.4609375|
|   Financial Analyst|   23353.776953125|       39606.05625|
|  Analyst Programmer|  16406.1287109375|  21042.9634765625|
|Software Engineer II|               0.0|      74782.640625|
|       Accountant IV|   82732.248046875|               0.0|
|    Product Engineer|    41825.48359375|       20464.94375|
|Software Test Eng...|  32218.6083984375|   27122.462890625|
|Safety Technician...|               0.0|   29421.529296875|
|    Junior Executive|15

### Use case: Gender Pay Gap by Job Title

In [34]:
# Calculate the `delta` between female and male salary
genda = genda.withColumn('delta', genda.female_salary-genda.male_salary)
genda.show(5)

+--------------------+------------------+------------------+-----------------+
|            JobTitle|     female_salary|       male_salary|            delta|
+--------------------+------------------+------------------+-----------------+
|Systems Administr...|   50590.474609375|  15540.9501953125| 35049.5244140625|
|   Media Manager III|29586.436197916668|17381.920572916668|     12204.515625|
|  Recruiting Manager|34848.452473958336|  26383.4951171875|8464.957356770836|
|       Geologist III|      31749.046875|    12830.75390625|   18918.29296875|
|        Geologist II|               0.0|   43293.865234375| -43293.865234375|
+--------------------+------------------+------------------+-----------------+
only showing top 5 rows



<br>
<b>Question 1a:</b> What are the top 5 job with the highest average female salary?

In [35]:
genda.orderBy('female_salary', ascending=False).show(5)

+--------------------+---------------+-----------+---------------+
|            JobTitle|  female_salary|male_salary|          delta|
+--------------------+---------------+-----------+---------------+
|Programmer Analys...|   88029.109375|        0.0|   88029.109375|
|Computer Systems ...|  87553.7265625|        0.0|  87553.7265625|
|     Media Manager I|  83143.3984375|        0.0|  83143.3984375|
|       Accountant IV|82732.248046875|        0.0|82732.248046875|
|Systems Administr...|    77059.21875|        0.0|    77059.21875|
+--------------------+---------------+-----------+---------------+
only showing top 5 rows



<br>
<b>Question 1b:</b> What instead for the male average salary?

In [36]:
genda.orderBy('male_salary', ascending=False).show(5)

+--------------------+-------------+-----------------+------------------+
|            JobTitle|female_salary|      male_salary|             delta|
+--------------------+-------------+-----------------+------------------+
|Database Administ...|          0.0|    94743.2578125|    -94743.2578125|
|Computer Systems ...|          0.0|     93275.421875|     -93275.421875|
| Safety Technician I|          0.0|    90918.6171875|    -90918.6171875|
|Systems Administr...|          0.0|88168.77083333333|-88168.77083333333|
|      Health Coach I|          0.0|83999.41927083333|-83999.41927083333|
+--------------------+-------------+-----------------+------------------+
only showing top 5 rows



<br>
<b>Question 2a:</b> Based on delta where are the biggest discrepancies?

In [37]:
# Filter only jobs where both male and female have avg. salary != 0
genda.filter((genda.female_salary !=0) & (genda.male_salary !=0)).orderBy('delta', ascending=False).show(5)

+--------------------+------------------+-----------------+-----------------+
|            JobTitle|     female_salary|      male_salary|            delta|
+--------------------+------------------+-----------------+-----------------+
|Software Test Eng...|     68684.3765625|     5185.9859375|63498.39062499999|
|     Web Designer II|           69477.0| 16183.1201171875| 53293.8798828125|
|Systems Administr...|     58661.3828125| 10665.5595703125| 47995.8232421875|
|       Social Worker|   51405.037890625|  13511.651171875|   37893.38671875|
|   Director of Sales|43417.360677083336|5873.821180555556|37543.53949652778|
+--------------------+------------------+-----------------+-----------------+
only showing top 5 rows



<b>Question 2b:</b> For which jobs there is less gender pay gap?

In [38]:
#Calculate mean for delta
avg_delta = genda.groupBy().avg('delta').take(1)[0][0]
print(avg_delta)

-810.3332432444215


In [39]:
#Use mean and its opposite to filter a range for target value
genda2 = genda.filter((genda.delta >= lit(avg_delta)) & (genda.delta <= lit(avg_delta)*-1))
genda2.show()

+--------------------+------------------+------------------+-----------------+
|            JobTitle|     female_salary|       male_salary|            delta|
+--------------------+------------------+------------------+-----------------+
|        Engineer III|33802.518229166664|33726.610677083336|75.90755208332848|
| Biostatistician III|   24033.947265625|        24631.0625|   -597.115234375|
|Account Represent...|   20893.455078125|   20281.349609375|     612.10546875|
+--------------------+------------------+------------------+-----------------+



### Use Case: Gender Pay Gap by City

<br>
<b>Question 3:</b> What is the city that pays better?

In [40]:
dfg.select('id','City', 'gender','new_salary','f_salary','m_salary').show(5)

+---+---------------+------+--------------+--------------+--------------+
| id|           City|gender|    new_salary|      f_salary|      m_salary|
+---+---------------+------+--------------+--------------+--------------+
|  1|      Nowa Ruda|Female| 57438.1796875| 57438.1796875|           0.0|
|  2|         Bulgan|Female| 62846.6015625| 62846.6015625|           0.0|
|  4|  Divnomorskoye|  Male|61489.23046875|           0.0|61489.23046875|
|  5|      Mytishchi|  Male|63863.08984375|           0.0|63863.08984375|
|  6|Kinsealy-Drinan|Female|30101.16015625|30101.16015625|           0.0|
+---+---------------+------+--------------+--------------+--------------+
only showing top 5 rows



In [41]:
# Group by city, and average salary
cityavg = dfg.groupBy('City').agg(sqlfunc.avg('new_salary').alias('avg_salary'))
cityavg = cityavg.sort(col('avg_salary').desc())    # Sort by Average Salary in descending order
cityavg.show(5)

+-----------+-------------+
|       City|   avg_salary|
+-----------+-------------+
|  Mesopotam|  99948.28125|
| Zhongcheng| 99942.921875|
|     Caxias|99786.3984375|
|Karangtawar|99638.9921875|
|  Itabaiana|  99502.15625|
+-----------+-------------+
only showing top 5 rows



<b>Question 4a:</b> In Which cities the gender pay gap is more pronounced?

In [42]:
genda3 = dfg.groupBy('City').agg(sqlfunc.avg('f_salary').alias('female_salary'), sqlfunc.avg('m_salary').alias('male_salary'))
genda3 = genda3.withColumn('delta', genda3.female_salary - genda3.male_salary)
genda3.show(5)

+-----------+----------------+--------------+----------------+
|       City|   female_salary|   male_salary|           delta|
+-----------+----------------+--------------+----------------+
|  Sułkowice|             0.0|33432.98828125| -33432.98828125|
|    Klippan|     77039.46875|           0.0|     77039.46875|
|Trollhättan|53311.6845703125|           0.0|53311.6845703125|
|  Shinaihai|    39544.640625|           0.0|    39544.640625|
|   Hongzhou|  35707.30859375|           0.0|  35707.30859375|
+-----------+----------------+--------------+----------------+
only showing top 5 rows



In [43]:
max_delta = genda3.groupBy().max('delta').take(1)[0][0]
min_delta = genda3.groupBy().min('delta').take(1)[0][0]

print(f'Range with min {min_delta} & max {max_delta}')

Range with min -99942.921875 & max 99948.28125


In [44]:
genda3a = genda3.filter((genda3.delta == lit(min_delta)) | (genda3.delta == lit(max_delta)))
genda3a.show()

+----------+-------------+------------+-------------+
|      City|female_salary| male_salary|        delta|
+----------+-------------+------------+-------------+
|Zhongcheng|          0.0|99942.921875|-99942.921875|
| Mesopotam|  99948.28125|         0.0|  99948.28125|
+----------+-------------+------------+-------------+



<b>Question 4b:</b> What are the top 5 cities with less gender pay gap?

In [45]:
avg_delta2 = genda3.groupBy().avg('delta').take(1)[0][0]
print(avg_delta2)

-736.0660494562759


In [46]:
genda3b = genda3.filter((genda3.delta >= lit(avg_delta2)) & (genda3.delta <= lit(avg_delta2)*-1))
genda3b.show()

+----+-------------+-----------+-----+
|City|female_salary|male_salary|delta|
+----+-------------+-----------+-----+
+----+-------------+-----------+-----+



<b>Note:</b> 

Dataset must be very skewed (with mean and median being apart).

It would be interesting to give a look at the hued distribution of delta to understand why there are no values included between range of averages.

In [47]:
genda3c = genda3.filter((genda3.delta >= -10200) & (genda3.delta <= 10200)).orderBy('delta')
genda3c.show()

+-----------+-----------------+----------------+-----------------+
|       City|    female_salary|     male_salary|            delta|
+-----------+-----------------+----------------+-----------------+
|Springfield| 15278.0595703125|    25419.265625|-10141.2060546875|
|     Gaoshi|              0.0| 10101.919921875| -10101.919921875|
|   Yongfeng|   43173.80078125|   37051.3046875|    6122.49609375|
|  Mieścisko|17656.75244140625|11236.7802734375| 6419.97216796875|
|    Atlanta|  20872.615234375|  12964.96484375|   7907.650390625|
+-----------+-----------------+----------------+-----------------+



<b>Yonfeng</b> gas the most moderate gender pay gap of all other cities.

A more complete study should suggest average salary by gender, by city, by job title.

<br>

[Back to Table of Contents](#tablecontents)

<a id='summary'></a>

## Summary of functions used

<br>
<h5>Load data from lists of tuples</h5>

<br>
<h5>Load a csv file and show the Dataframe</h5>

<br>
<h5>Handle Null Values</h5>

<br>
<h5>Filter data</h5>

<br>
<h5>Select Substring</h5>

<br>
<h5>Cast as float</h5>

<br>
<h5>Select Substring AND cast as float</h5>

<br>
<h5>Calculate statistics from numerical column</h5>

<br>
<h5>Calculate Median (using NumPy)</h5>

<br>
<h5>Import pyspark.sql.functions</h5>

<br>
<h5>Select specific columns</h5>

<br>
<h5>Group by categorical variable</h5>

<br>
<h5>Group by categorical variable and aggregate by numerical variable (e.g. average salary by gender)</h5>

<br>
<h5>Order by target value</h5>

<br>
<h5>Multiaggregation with alias for new columns</h5>

<br>
<h5>Filter using AND (&) and OR (|) conditions</h5>

<br>
<h5>Use literal value (.lit method) for calculated values</h5>

<a id='section6'></a>

<h2> </h2>
<hr>

## Next Steps

Define in details how to use the features of PySpark library for:

1. Bringing data into dataframes
2. Inspecting a Dataframe
3. Handling Null & Duplicate values
4. Selecting and Filtering Data
5. Applying Multiple Filters
6. Running SQL on Dataframes
7. Adding Calculated Columns
8. Group By and Aggregation
9. Writing Dataframes to files

<h6><i> End of the Notebook</i></h6>

In [48]:
print('\n    End of the Notebook :)')


    End of the Notebook :)


<br>

[Back to Table of Contents](#tablecontents)