## Practice Project - Create a machine learning pipeline for a regression project


## Scenario


You are a data engineer at a data analytics consulting company. Your company prides itself in being able to efficiently handle huge datasets. Data scientists in your office need to work with different algorithms and data in different formats. While they are good at Machine Learning, they count on you to be able do ETL jobs and build ML pipelines.



## Objectives

In this 4 part assignment you will:

- Part 1 ETL
  - Load a csv dataset
  - Remove duplicates if any
  - Drop rows with null values if any
  - Make transformations
  - Store the cleaned data in parquet format
- Part 2 Machine Learning Pipeline creation
  - Create a machine learning pipeline for prediction
- Part 3 Model evaluation
  - Evaluate the model using metrics
  - Print the intercept and the coefficients
- Part 4 Model Persistance
  - Cave the model for future production use
  - Load and verify the stored model


## Datasets

In this lab you will be using dataset(s):

 - Modified version of car mileage dataset. Original dataset available at https://archive.ics.uci.edu/ml/datasets/auto+mpg



----


## Setup


In [1]:
!pip install pyspark==3.1.2 -q
!pip install findspark -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


### Importing Required Libraries



In [2]:
# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

# FindSpark simplifies the process of using Apache Spark with Python

import findspark
findspark.init()

## Part 1 - ETL


### Task 1 - Import required libraries


In [44]:
from pyspark.sql import SparkSession

from pyspark.ml.feature import StringIndexer, StandardScaler
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression

# import functions/Classes for metrics
from pyspark.ml.evaluation import RegressionEvaluator, MulticlassClassificationEvaluator

from pyspark.ml import Pipeline
from pyspark.ml.pipeline import PipelineModel


### Task 2 - Create a spark session


In [4]:
#Create SparkSession

spark = SparkSession.builder.appName("Practice Project").getOrCreate()


### Task 3 - Load the csv file into a dataframe


Download the data file


In [5]:
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/datasets/mpg-raw.csv


--2024-06-21 10:49:42--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/datasets/mpg-raw.csv
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.45.118.108
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.45.118.108|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14354 (14K) [text/csv]
Saving to: ‘mpg-raw.csv’


2024-06-21 10:49:43 (305 MB/s) - ‘mpg-raw.csv’ saved [14354/14354]



Load the dataset into the spark dataframe


In [6]:
# Load dataset
df = spark.read.csv("mpg-raw.csv", header=True, inferSchema=True)



### Task 4 - Print top 5 rows of the dataset


In [8]:
df.show(5)


+----+---------+-----------+----------+------+----------+----+--------+
| MPG|Cylinders|Engine Disp|Horsepower|Weight|Accelerate|Year|  Origin|
+----+---------+-----------+----------+------+----------+----+--------+
|46.6|        4|       86.0|        65|  2110|      17.9|  80|Japanese|
|44.6|        4|       91.0|        67|  1850|      13.8|  80|Japanese|
|44.3|        4|       90.0|        48|  2085|      21.7|  80|European|
|44.0|        4|       97.0|        52|  2130|      24.6|  82|European|
|43.4|        4|       90.0|        48|  2335|      23.7|  80|European|
+----+---------+-----------+----------+------+----------+----+--------+
only showing top 5 rows



### Task 5 - Print the number of cars in each Origin


In [9]:
df.groupBy("Origin").count().show()


+--------+-----+
|  Origin|count|
+--------+-----+
|European|   70|
|    null|    1|
|Japanese|   88|
|American|  247|
+--------+-----+



### Task 6 - Print the total number of rows in the dataset


In [10]:
#your code goes here
rowcount1 = df.count()
print(rowcount1)

406


### Task 7 - Drop all the duplicate rows from the dataset


In [11]:
df = df.dropDuplicates()


### Task 8 - Print the total number of rows in the dataset


In [12]:
#your code goes here
rowcount2 = df.count()
print(rowcount2)

392


### Task 9 - Drop all the rows that contain null values from the dataset


In [13]:
df = df.dropna()

### Task 10 - Print the total number of rows in the dataset


In [21]:
#your code goes here
rowcount3 = df.count()
print(rowcount3)

385


### Task 11 - Rename the column "Engine Disp" to "Engine_Disp"


In [15]:
#your code goes here

df = df.withColumnRenamed("Engine Disp", "Engine_Disp")

df.show(5)

+----+---------+-----------+----------+------+----------+----+--------+
| MPG|Cylinders|Engine_Disp|Horsepower|Weight|Accelerate|Year|  Origin|
+----+---------+-----------+----------+------+----------+----+--------+
|24.0|        4|      134.0|        96|  2702|      13.5|  75|Japanese|
|18.0|        6|      250.0|        88|  3139|      14.5|  71|American|
|29.0|        4|       68.0|        49|  1867|      19.5|  73|European|
|22.4|        6|      231.0|       110|  3415|      15.8|  81|American|
|20.5|        6|      231.0|       105|  3425|      16.9|  77|American|
+----+---------+-----------+----------+------+----------+----+--------+
only showing top 5 rows



### Task 12 - Save the dataframe in parquet format, name the file as "mpg-cleaned.parquet"


In [17]:
print("n. of partitions of df: ", df.rdd.getNumPartitions())

n. of partitions of df:  200


In [18]:
#
df = df.repartition(4)

In [19]:
df.write.mode("overwrite").parquet("mpg-cleaned.parquet")


In [20]:
print("n. of partitions of df: ", df.rdd.getNumPartitions())

n. of partitions of df:  4


#### Part 1 - Evaluation



Run the code cell below.<br>
If the code throws up any errors, go back and review the code you have written.


In [22]:
print("Part 1 - Evaluation")

print("Total rows = ", rowcount1)
print("Total rows after dropping duplicate rows = ", rowcount2)
print("Total rows after dropping duplicate rows and rows with null values = ", rowcount3)
print("Renamed column name = ", df.columns[2])

import os

print("mpg-cleaned.parquet exists :", os.path.isdir("mpg-cleaned.parquet"))

Part 1 - Evaluation
Total rows =  406
Total rows after dropping duplicate rows =  392
Total rows after dropping duplicate rows and rows with null values =  385
Renamed column name =  Engine_Disp
mpg-cleaned.parquet exists : True


## Part - 2 Machine Learning Pipeline creation


### Task 1 - Load data from "mpg-cleaned.parquet" into a dataframe


In [23]:
df = spark.read.parquet("mpg-cleaned.parquet")
rowcount4 = df.count()

In [24]:
#show top 5 rows
df.show(5)


+----+---------+-----------+----------+------+----------+----+--------+
| MPG|Cylinders|Engine_Disp|Horsepower|Weight|Accelerate|Year|  Origin|
+----+---------+-----------+----------+------+----------+----+--------+
|24.0|        4|      134.0|        96|  2702|      13.5|  75|Japanese|
|22.4|        6|      231.0|       110|  3415|      15.8|  81|American|
|13.0|        8|      350.0|       175|  4100|      13.0|  73|American|
|18.0|        3|       70.0|        90|  2124|      13.5|  73|Japanese|
|25.0|        6|      181.0|       110|  2945|      16.4|  82|American|
+----+---------+-----------+----------+------+----------+----+--------+
only showing top 5 rows



In [25]:
#print the schema of the dataframe
df.printSchema()


root
 |-- MPG: double (nullable = true)
 |-- Cylinders: integer (nullable = true)
 |-- Engine_Disp: double (nullable = true)
 |-- Horsepower: integer (nullable = true)
 |-- Weight: integer (nullable = true)
 |-- Accelerate: double (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Origin: string (nullable = true)



### Task 2 - Define the StringIndexer pipeline stage


In [26]:
# Stage - 1 Using StringIndexer convert the string column "Origin" into "OriginIndex"
indexer = StringIndexer(inputCol="Origin", outputCol="OriginIndex")

### Task 3 - Define the VectorAssembler pipeline stage


In [27]:
# Stage 2 - assemble the input columns 'Cylinders','Engine_Disp','Horsepower','Weight','Accelerate','Year' into a single column "features"
inp_cols = ['Cylinders','Engine_Disp','Horsepower','Weight','Accelerate','Year']
assembler = VectorAssembler(inputCols=inp_cols, outputCol="features")

### Task 4 - Define the StandardScaler pipeline stage


In [28]:
# Stage 3 - scale the "features" using standard scaler and store in "scaledFeatures" column
scaler = StandardScaler(inputCol="features",outputCol="scaledFeatures")

### Task 5 - Define the Model creation pipeline stage


In [29]:
# Stage 4 - Create a LinearRegression stage to predict "MPG"

lr = LinearRegression(featuresCol="scaledFeatures", labelCol="MPG")

### Task 6 - Build the pipeline


In [31]:
# Build a pipeline using the above four stages

pipeline = Pipeline(stages=[indexer,assembler,scaler,lr])

In [35]:
pipeline.getStages()

[StringIndexer_30d527af6f18,
 VectorAssembler_30ff9147e4fd,
 StandardScaler_b80cf5a91aed,
 LinearRegression_2801daca705c]

### Task 7 - Split the data


In [32]:
# Split the data into training and testing sets with 70:30 split. Use 42 as seed
(trainingData, testingData) = df.randomSplit([0.7, 0.3], seed=42)

### Task 8 - Fit the pipeline


In [33]:
# Fit the pipeline using the training data

pipelineModel = pipeline.fit(trainingData)

#### Part 2 - Evaluation



Run the code cell below.<br>
If the code throws up any errors, go back and review the code you have written.


In [34]:
print("Part 2 - Evaluation")
print("Total rows = ", rowcount4)
ps = [str(x).split("_")[0] for x in pipeline.getStages()]

print("Pipeline Stage 1 = ", ps[0])
print("Pipeline Stage 2 = ", ps[1])
print("Pipeline Stage 3 = ", ps[2])

print("Label column = ", lr.getLabelCol())

Part 2 - Evaluation
Total rows =  385
Pipeline Stage 1 =  StringIndexer
Pipeline Stage 2 =  VectorAssembler
Pipeline Stage 3 =  StandardScaler
Label column =  MPG


## Part 3 - Model Evaluation


### Task 1 - Predict using the model


In [36]:
# Make predictions on testing data
predictions = pipelineModel.transform(testingData)

In [39]:
predictions.show(5)

+----+---------+-----------+----------+------+----------+----+--------+-----------+--------------------+--------------------+------------------+
| MPG|Cylinders|Engine_Disp|Horsepower|Weight|Accelerate|Year|  Origin|OriginIndex|            features|      scaledFeatures|        prediction|
+----+---------+-----------+----------+------+----------+----+--------+-----------+--------------------+--------------------+------------------+
|11.0|        8|      318.0|       210|  4382|      13.5|  70|American|        0.0|[8.0,318.0,210.0,...|[4.88543879835397...| 9.759498081805237|
|12.0|        8|      429.0|       198|  4952|      11.5|  73|American|        0.0|[8.0,429.0,198.0,...|[4.88543879835397...| 7.985853771276446|
|13.0|        8|      302.0|       130|  3870|      15.0|  76|American|        0.0|[8.0,302.0,130.0,...|[4.88543879835397...|17.104458097769506|
|13.0|        8|      307.0|       130|  4098|      14.0|  72|American|        0.0|[8.0,307.0,130.0,...|[4.88543879835397...|12.45

### Task 2 - Print the MSE


In [38]:
#Your code goes here

evaluator = RegressionEvaluator(labelCol="MPG", predictionCol="prediction", metricName="mse")
mse = evaluator.evaluate(predictions)
print("Mean Squared Error (MSE) =", mse)


Mean Squared Error (MSE) = 12.49694027066151


### Task 3 - Print the MAE


In [40]:
#Your code goes here

evaluator = RegressionEvaluator(labelCol="MPG", predictionCol="prediction", metricName="mae")
mae = evaluator.evaluate(predictions)
print("Mean Absolute Error (MAE) =", mae)



Mean Absolute Error (MAE) = 2.6483267526123027


### Task 4 - Print the R-Squared(R2)


In [41]:
#Your code goes here

evaluator = RegressionEvaluator(labelCol="MPG", predictionCol="prediction", metricName="r2")
r2 = evaluator.evaluate(predictions)
print("R Squared =", r2)


R Squared = 0.8069575453299896


#### Part 3 - Evaluation



Run the code cell below.<br>
If the code throws up any errors, go back and review the code you have written.


In [42]:
print("Part 3 - Evaluation")

print("Mean Squared Error = ", round(mse,2))
print("Mean Absolute Error = ", round(mae,2))
print("R Squared = ", round(r2,2))

lrModel = pipelineModel.stages[-1]

print("Intercept = ", round(lrModel.intercept,2))


Part 3 - Evaluation
Mean Squared Error =  12.5
Mean Absolute Error =  2.65
R Squared =  0.81
Intercept =  -11.73


## Part 4 - Model persistance


### Task 1 - Save the model to the path "Practice_Project"


In [43]:
# Save the pipeline model
!mkdir Practice_Project


### Task 2 - Load the model from the path "Practice_Project"


In [45]:
# Load the pipeline model
pipelineModel.write().overwrite().save("./Practice_Project")


### Task 3 - Make predictions using the loaded model on the test data


In [51]:
# Use the loaded pipeline model for predictions
loadedPipelineModel = PipelineModel.load("./Practice_Project")

predictions = loadedPipelineModel.transform(testingData)


### Task 4 - Show the predictions


In [52]:
# your code goes here
predictions.select("prediction").show(5)

+------------------+
|        prediction|
+------------------+
| 9.759498081805237|
| 7.985853771276446|
|17.104458097769506|
|12.454601366798265|
| 9.889362696628698|
+------------------+
only showing top 5 rows



#### Part 4 - Evaluation



Run the code cell below.<br>
If the code throws up any errors, go back and review the code you have written.


In [53]:
print("Part 4 - Evaluation")

loadedmodel = loadedPipelineModel.stages[-1]
totalstages = len(loadedPipelineModel.stages)
inputcolumns = loadedPipelineModel.stages[1].getInputCols()

print("Number of stages in the pipeline = ", totalstages)
for i,j in zip(inputcolumns, loadedmodel.coefficients):
    print(f"Coefficient for {i} is {round(j,4)}")

Part 4 - Evaluation
Number of stages in the pipeline =  4
Coefficient for Cylinders is -0.2365
Coefficient for Engine_Disp is 0.5697
Coefficient for Horsepower is 0.3963
Coefficient for Weight is -6.2209
Coefficient for Accelerate is 0.1962
Coefficient for Year is 2.6642


In [54]:
loadedPipelineModel.stages

[StringIndexerModel: uid=StringIndexer_30d527af6f18, handleInvalid=error,
 VectorAssembler_30ff9147e4fd,
 StandardScalerModel: uid=StandardScaler_b80cf5a91aed, numFeatures=6, withMean=false, withStd=true,
 LinearRegressionModel: uid=LinearRegression_2801daca705c, numFeatures=6]

### Task 5 - Stop Spark Session


In [55]:
spark.stop()