## ETL using Spark


## __Table of Contents__

<ol>
  <li>
    <a href="#Objectives">Objectives
    </a>
  </li>
  <li>
    <a href="#Datasets">Datasets
    </a>
  </li>
  <li>
    <a href="#Setup">Setup
    </a>
    <ol>
      <li>
        <a href="#Installing-Required-Libraries">Installing Required Libraries
        </a>
      </li>
      <li>
        <a href="#Importing-Required-Libraries">Importing Required Libraries
        </a>
      </li>
    </ol>
  </li>
  <li>
    <a href="#Examples">Examples
    </a>
    <ol>
    <li>
      <a href="#Task-1---Create-a-Dataframe-from-the-raw-data-and-write-to-CSV-file.">Task 1 - Create a Dataframe from the raw data and write to CSV file.
      </a>
    </li>
    <li>
      <a href="#Task-2---Read-from-a-csv-file-and-write-to-parquet-file">Task 2 - Read from a csv file and write to parquet file
      </a>
    </li>
    <li>
      <a href="#Task-3---Condense-PARQUET-to-a-single-file.">Task 3 - Condense PARQUET to a single file.
      </a>
    </li>
    <li>
      <a href="#Task-4---Read-from-a-parquet-file-and-write-to-csv-file">Task 4 - Read from a parquet file and write to csv file
      </a>
    </li>
      </ol>
  <li>
    <a href="#Exercises">Exercises
    </a>
  </li>
  <ol>
    <li>
      <a href="#Exercise-1---Extract">Exercise 1 - Extract
      </a>
    </li>
    <li>
      <a href="#Exercise-2---Transform">Exercise 2 - Transform
      </a>
    </li>
    <li>
      <a href="#Exercise-3---Load">Exercise 3 - Load
      </a>
    </li>
  </ol>
</ol>


## Objectives

After completing this lab you will be able to:

 - Create a Spark Dataframe from the raw data and write to CSV file.
 - Read from a csv file and write to parquet file
 - Condense PARQUET to a single file.
 - Read from a parquet file and write to csv file


----


## Setup


In [None]:
!pip install pyspark==3.1.2 -q
!pip install findspark -q

### Importing Required Libraries



In [None]:
# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

# FindSpark simplifies the process of using Apache Spark with Python

import findspark
findspark.init()

from pyspark.sql import SparkSession


In [None]:
#Create SparkSession
#Ignore any warnings by SparkSession command

spark = SparkSession.builder.appName("ETL using Spark").getOrCreate()

## Task 1 - Create a Dataframe from the raw data and write to CSV file.


In [None]:
#create a list of tuples
#each tuple contains the student id, height and weight
data = [("student1",64,90),
        ("student2",59,100),
        ("student3",69,95),
        ("",70,110),
        ("student5",60,80),
        ("student3",69,95),
        ("student6",62,85),
        ("student7",65,80),
        ("student7",65,80)]

# some rows are intentionally duplicated

In [None]:
#create a dataframe using createDataFrame and pass the data and the column names.

df = spark.createDataFrame(data, ["student","height_inches","weight_pounds"])

In [None]:
# show the data frame

df.show()

+--------+-------------+-------------+
| student|height_inches|weight_pounds|
+--------+-------------+-------------+
|student1|           64|           90|
|student2|           59|          100|
|student3|           69|           95|
|        |           70|          110|
|student5|           60|           80|
|student3|           69|           95|
|student6|           62|           85|
|student7|           65|           80|
|student7|           65|           80|
+--------+-------------+-------------+



Write to csv file

>**Note: In Apache Spark, when you use the write method to save a DataFrame to a CSV file, it indeed creates a directory rather than a single file. This is because Spark is designed to run in a distributed manner across multiple nodes, and it saves the output as multiple part files within a directory.The csv file is within the directory.**


In [None]:
df.write.mode("overwrite").csv("student-hw.csv", header=True)

In [None]:
#If you do not wish to over write use df.write.csv("student-hw.csv", header=True)

Verify the csv file


In [None]:
# Load student dataset
df2 = spark.read.csv("student-hw.csv", header=True, inferSchema=True)

# display dataframe
df2.show()

+--------+-------------+-------------+
| student|height_inches|weight_pounds|
+--------+-------------+-------------+
|student5|           60|           80|
|student3|           69|           95|
|student6|           62|           85|
|student7|           65|           80|
|student7|           65|           80|
|student1|           64|           90|
|student2|           59|          100|
|student3|           69|           95|
|    null|           70|          110|
+--------+-------------+-------------+



## Task 2 - Read from a csv file and write to parquet file


In [None]:
# Load student dataset
df = spark.read.csv("student-hw.csv", header=True, inferSchema=True)

# display dataframe
df.show()

+--------+-------------+-------------+
| student|height_inches|weight_pounds|
+--------+-------------+-------------+
|student5|           60|           80|
|student3|           69|           95|
|student6|           62|           85|
|student7|           65|           80|
|student7|           65|           80|
|student1|           64|           90|
|student2|           59|          100|
|student3|           69|           95|
|    null|           70|          110|
+--------+-------------+-------------+



In [None]:
# print the number of rows in the dataframe
df.count()

9

Drop Duplicates


In [None]:
df = df.dropDuplicates()

In [None]:
df.show()

+--------+-------------+-------------+
| student|height_inches|weight_pounds|
+--------+-------------+-------------+
|student6|           62|           85|
|student3|           69|           95|
|student2|           59|          100|
|student7|           65|           80|
|    null|           70|          110|
|student1|           64|           90|
|student5|           60|           80|
+--------+-------------+-------------+



In [None]:
#Notice that the duplicates are removed

In [None]:
# print the number of rows in the dataframe
df.count()

7

Drop Null values


In [None]:
df=df.dropna()

In [None]:
#Observe the rows with null values getting dropped
df.show()

+--------+-------------+-------------+
| student|height_inches|weight_pounds|
+--------+-------------+-------------+
|student6|           62|           85|
|student3|           69|           95|
|student2|           59|          100|
|student7|           65|           80|
|student1|           64|           90|
|student5|           60|           80|
+--------+-------------+-------------+



Save to parquet file


In [None]:
#Write the data to a Parquet file
df.write.mode("overwrite").parquet("student-hw.parquet")

In [None]:
# if you do not wish to overwrite use the command df.write.parquet("student-hw.parquet")

In [None]:
# verify that the parquet file(s) are created

In [None]:
!!ls -l student-hw.parquet

['total 28',
 '-rw-r--r-- 1 root root 501 Jun 20 11:27 part-00000-c5804ab0-ea3a-419b-8056-98d0d34fff24-c000.snappy.parquet',
 '-rw-r--r-- 1 root root 945 Jun 20 11:27 part-00003-c5804ab0-ea3a-419b-8056-98d0d34fff24-c000.snappy.parquet',
 '-rw-r--r-- 1 root root 945 Jun 20 11:27 part-00010-c5804ab0-ea3a-419b-8056-98d0d34fff24-c000.snappy.parquet',
 '-rw-r--r-- 1 root root 945 Jun 20 11:27 part-00054-c5804ab0-ea3a-419b-8056-98d0d34fff24-c000.snappy.parquet',
 '-rw-r--r-- 1 root root 945 Jun 20 11:27 part-00132-c5804ab0-ea3a-419b-8056-98d0d34fff24-c000.snappy.parquet',
 '-rw-r--r-- 1 root root 945 Jun 20 11:27 part-00172-c5804ab0-ea3a-419b-8056-98d0d34fff24-c000.snappy.parquet',
 '-rw-r--r-- 1 root root 945 Jun 20 11:27 part-00186-c5804ab0-ea3a-419b-8056-98d0d34fff24-c000.snappy.parquet',
 '-rw-r--r-- 1 root root   0 Jun 20 11:27 _SUCCESS']

Notice that there are a lot of .parquet files in the output.
- To improve parallellism, spark stores each dataframe in multiple partitions.
- When the data is saved as parquet file, each partition is saved as a separate file.


## Task 3 - Condense PARQUET to a single file.


Reduce the number of partitions in the dataframe to one.


In [None]:
df = df.repartition(1)

Save to parquet file


In [None]:
#Write the data to a Parquet file
df.write.mode("overwrite").parquet("student-hw-single.parquet")

In [None]:
# if you do not wish to overwrite use the command df.write.parquet("student-hw-single.parquet")

In [None]:
# verify that the parquet file(s) are created

In [None]:
!ls -l student-hw-single.parquet

total 4
-rw-r--r-- 1 root root 978 Jun 20 11:27 part-00000-466a134c-e5aa-492e-9eaf-42096b8302cf-c000.snappy.parquet
-rw-r--r-- 1 root root   0 Jun 20 11:27 _SUCCESS


In [None]:
#Notice that there is only one .parquet file

## Task 4 - Read from a parquet file and write to csv file


In [None]:
df = spark.read.parquet("student-hw-single.parquet")

In [None]:
df.show()

+--------+-------------+-------------+
| student|height_inches|weight_pounds|
+--------+-------------+-------------+
|student6|           62|           85|
|student3|           69|           95|
|student2|           59|          100|
|student7|           65|           80|
|student1|           64|           90|
|student5|           60|           80|
+--------+-------------+-------------+



Transform the data


In [None]:
#import the expr function that helps in transforming the data
from pyspark.sql.functions import expr

Convert inches to centimeters


In [None]:
# Convert inches to centimeters
# Multiply the column height_inches with 2.54 to get a new column height_centimeters
df = df.withColumn("height_centimeters", expr("height_inches * 2.54"))
df.show()

+--------+-------------+-------------+------------------+
| student|height_inches|weight_pounds|height_centimeters|
+--------+-------------+-------------+------------------+
|student6|           62|           85|            157.48|
|student3|           69|           95|            175.26|
|student2|           59|          100|            149.86|
|student7|           65|           80|            165.10|
|student1|           64|           90|            162.56|
|student5|           60|           80|            152.40|
+--------+-------------+-------------+------------------+



Convert pounds to kilograms


In [None]:
# Convert pounds to kilograms
# Multiply weight_pounds with 0.453592 to get a new column weight_kg
df = df.withColumn("weight_kg", expr("weight_pounds * 0.453592"))
df.show()

+--------+-------------+-------------+------------------+---------+
| student|height_inches|weight_pounds|height_centimeters|weight_kg|
+--------+-------------+-------------+------------------+---------+
|student6|           62|           85|            157.48|38.555320|
|student3|           69|           95|            175.26|43.091240|
|student2|           59|          100|            149.86|45.359200|
|student7|           65|           80|            165.10|36.287360|
|student1|           64|           90|            162.56|40.823280|
|student5|           60|           80|            152.40|36.287360|
+--------+-------------+-------------+------------------+---------+



Drop the columns


In [None]:
# drop the columns "height_inches","weight_pounds"
df = df.drop("height_inches","weight_pounds")
df.show()

+--------+------------------+---------+
| student|height_centimeters|weight_kg|
+--------+------------------+---------+
|student6|            157.48|38.555320|
|student3|            175.26|43.091240|
|student2|            149.86|45.359200|
|student7|            165.10|36.287360|
|student1|            162.56|40.823280|
|student5|            152.40|36.287360|
+--------+------------------+---------+



Rename a column


In [None]:
# rename the lengthy column name "height_centimeters" to "height_cm"
df = df.withColumnRenamed("height_centimeters","height_cm")
df.show()

+--------+---------+---------+
| student|height_cm|weight_kg|
+--------+---------+---------+
|student6|   157.48|38.555320|
|student3|   175.26|43.091240|
|student2|   149.86|45.359200|
|student7|   165.10|36.287360|
|student1|   162.56|40.823280|
|student5|   152.40|36.287360|
+--------+---------+---------+



Save to csv file


In [None]:
df.write.mode("overwrite").csv("student_transformed.csv", header=True)

Verify the csv file


In [None]:
# Load student dataset
df = spark.read.csv("student_transformed.csv", header=True, inferSchema=True)
# display dataframe
df.show()

+--------+---------+---------+
| student|height_cm|weight_kg|
+--------+---------+---------+
|student6|   157.48| 38.55532|
|student3|   175.26| 43.09124|
|student2|   149.86|  45.3592|
|student7|    165.1| 36.28736|
|student1|   162.56| 40.82328|
|student5|    152.4| 36.28736|
+--------+---------+---------+



Stop Spark Session


In [None]:
spark.stop()

# Exercises


Create Spark Session


In [None]:
#Create SparkSession
#Ignore any warnings by SparkSession command

spark = SparkSession.builder.appName("Exercises - ETL using Spark").getOrCreate()

### Exercise 1 - Extract


Load data from student_transformed.csv into a dataframe


In [None]:
# Load student dataset
df = spark.read.csv("student_transformed.csv",header=True, inferSchema=True)
# display dataframe
df.show()

+--------+---------+---------+
| student|height_cm|weight_kg|
+--------+---------+---------+
|student6|   157.48| 38.55532|
|student3|   175.26| 43.09124|
|student2|   149.86|  45.3592|
|student7|    165.1| 36.28736|
|student1|   162.56| 40.82328|
|student5|    152.4| 36.28736|
+--------+---------+---------+



### Exercise 2 - Transform


Convert cm to meters


In [None]:
#import the expr function that helps in transforming the data
from pyspark.sql.functions import expr

In [None]:
# Convert centimeters to meters
df = df.withColumn("height_m", expr("height_cm / 100"))
# display dataframe
df.show()

+--------+---------+---------+------------------+
| student|height_cm|weight_kg|          height_m|
+--------+---------+---------+------------------+
|student6|   157.48| 38.55532|            1.5748|
|student3|   175.26| 43.09124|            1.7526|
|student2|   149.86|  45.3592|1.4986000000000002|
|student7|    165.1| 36.28736|             1.651|
|student1|   162.56| 40.82328|            1.6256|
|student5|    152.4| 36.28736|             1.524|
+--------+---------+---------+------------------+



Create a column named bmi


In [None]:
# compute bmi using the below formula
# BMI = weight/(height * height)
# weight must be in kgs
# height must be in meters
df = df.withColumn("bmi", expr("weight_kg / (height_m * height_m)"))
# display dataframe
df.show()

+--------+---------+---------+------------------+------------------+
| student|height_cm|weight_kg|          height_m|               bmi|
+--------+---------+---------+------------------+------------------+
|student6|   157.48| 38.55532|            1.5748|15.546531093062187|
|student3|   175.26| 43.09124|            1.7526|14.028892161964118|
|student2|   149.86|  45.3592|1.4986000000000002|20.197328530250278|
|student7|    165.1| 36.28736|             1.651|13.312549228648752|
|student1|   162.56| 40.82328|            1.6256|15.448293591899683|
|student5|    152.4| 36.28736|             1.524|15.623755691955827|
+--------+---------+---------+------------------+------------------+



Drop the columns height_cm, weight_kg and height_meters


In [None]:
# Drop the columns height_cm, weight_kg and height_meters
df = df.drop("height_cm", "weight_kg", "height_meters")
# display dataframe
df.show()

+--------+------------------+------------------+
| student|          height_m|               bmi|
+--------+------------------+------------------+
|student6|            1.5748|15.546531093062187|
|student3|            1.7526|14.028892161964118|
|student2|1.4986000000000002|20.197328530250278|
|student7|             1.651|13.312549228648752|
|student1|            1.6256|15.448293591899683|
|student5|             1.524|15.623755691955827|
+--------+------------------+------------------+



In [None]:
# Let us round the column bmi
from pyspark.sql.functions import col, round
df = df.withColumn("bmi_rounded", round(col("bmi")))
df.show()

+--------+------------------+------------------+-----------+
| student|          height_m|               bmi|bmi_rounded|
+--------+------------------+------------------+-----------+
|student6|            1.5748|15.546531093062187|       16.0|
|student3|            1.7526|14.028892161964118|       14.0|
|student2|1.4986000000000002|20.197328530250278|       20.0|
|student7|             1.651|13.312549228648752|       13.0|
|student1|            1.6256|15.448293591899683|       15.0|
|student5|             1.524|15.623755691955827|       16.0|
+--------+------------------+------------------+-----------+



### Exercise 3 - Load


Save the dataframe into a parquet file


In [None]:
#Write the data to a Parquet file, set the mode to overwrite
df.write.mode("overwrite").parquet("student-hw-bmi.parquet")

Stop Spark Session


In [None]:
spark.stop()