## *Practice Project on Machine Learning using Apache Spark*
The main goal in this project is to predict the sound level based on other columns. I will do that using SparkML to implement the machine learning algorithms.

#### What is gonnna be done to perform the task?

The project is splitted into four parts as the following:
1. Perform ETL activities such as extracting data from csv file and load it to a spark dataframe, apply some transformation such as (removing duplicates and null values if exists and any necessary transformations) and finally store cleaned data in the parquet format.

2. Create machine learning pipeline with three stages including regression stage. The pipeline is the backbone of the model development process.

3. Evaluate the model through evaluation metrices according to which algorithm we used.
4. The last part is to persist the model on local machine and allow loading it again to predict new real-world data. It helps in reusability, portability for future use.

### Part 1 (Clean and Transform data)

In this part, We will start with makeing the dataset ready for the model. This will done throughout the following:
1. Load the data from csv file.
2. Understand the dataset (I think it's the most important step in any data-based projects).
3. Define where is the problems (Are there missing values, duplicates or any other defectives).
4. Clean the data using apache spark dataframe.
5. Do any necessary transformations.
6. Load cleaned and transformed data in parquet format file for future usage.

In [1]:
#Import libraries
import os
import findspark
from pyspark.sql import SparkSession, functions as F, types as T
from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler
from pyspark.ml.regression import LinearRegression
from pyspark.ml import Pipeline
from pyspark.ml.pipeline import PipelineModel
from pyspark.ml.evaluation import RegressionEvaluator

Download the data file

In [57]:
%%bash
fileName="mpg-raw.csv"

if ! test -d data;then
    mkdir data
fi

if ! test -f  data/$fileName;then
    wget -P data/ https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/datasets/mpg-raw.csv
else
    echo "File already exists"
fi

File already exists


In [58]:
#Create a Spark instance
findspark.init()
spark = SparkSession.builder.appName('Practice Project').getOrCreate()

In [59]:
#Load the data file to spark dataframe
df = spark.read.csv('data/mpg-raw.csv',header=True,inferSchema=True)
df.printSchema()

root
 |-- MPG: double (nullable = true)
 |-- Cylinders: integer (nullable = true)
 |-- Engine Disp: double (nullable = true)
 |-- Horsepower: integer (nullable = true)
 |-- Weight: integer (nullable = true)
 |-- Accelerate: double (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Origin: string (nullable = true)



In [None]:
#One null apears means that data contains nulls
df.groupBy('Origin').count().show()

+--------+-----+
|  Origin|count|
+--------+-----+
|European|   70|
|    NULL|    1|
|Japanese|   88|
|American|  247|
+--------+-----+



In [60]:
total_row_count = df.count()
total_row_count

406

In [61]:
#Drop duplicates
df  = df.drop_duplicates()
no_duplicates_row_count = df.count()
no_duplicates_row_count

392

In [None]:
#This represent which columns contain null values
expr = sum(F.when(F.col(x).isNull(),1).otherwise(0)for x in df.columns)
df.withColumn('nulls', expr).filter(F.col('nulls') > 0).orderBy('nulls',ascending=False).show()

+----+---------+-----------+----------+------+----------+----+--------+-----+
| MPG|Cylinders|Engine Disp|Horsepower|Weight|Accelerate|Year|  Origin|nulls|
+----+---------+-----------+----------+------+----------+----+--------+-----+
|33.5|        4|      151.0|        90|  2556|      13.2|  79|    NULL|    1|
|NULL|        4|       97.0|        67|  2065|      17.8|  81|Japanese|    1|
|32.0|        4|       83.0|      NULL|  2003|      19.0|  74|Japanese|    1|
|32.0|        4|       NULL|        96|  2665|      13.9|  82|Japanese|    1|
|28.0|        4|      116.0|        90|  2123|      NULL|  71|European|    1|
|30.7|        6|      145.0|        76|  NULL|      19.6|  81|European|    1|
|30.0|     NULL|      135.0|        84|  2385|      12.9|  81|American|    1|
+----+---------+-----------+----------+------+----------+----+--------+-----+



In [63]:
#Drop Null Values
df = df.dropna()
no_null_row_count = df.count()
df.withColumn('nulls', expr).filter(F.col('nulls') > 0).orderBy('nulls',ascending=False).show()
#Nulls are dropped
no_null_row_count

+---+---------+-----------+----------+------+----------+----+------+-----+
|MPG|Cylinders|Engine Disp|Horsepower|Weight|Accelerate|Year|Origin|nulls|
+---+---------+-----------+----------+------+----------+----+------+-----+
+---+---------+-----------+----------+------+----------+----+------+-----+



385

In [64]:
#Rename Columns
df = df.withColumnRenamed('Engine Disp','Engine_Disp')
df.columns

['MPG',
 'Cylinders',
 'Engine_Disp',
 'Horsepower',
 'Weight',
 'Accelerate',
 'Year',
 'Origin']

I think data is cleaned now. Let's save it in parquet format

In [65]:
df.write.mode('overwrite').parquet('data/mpg-cleaned.parquet')

In [75]:
print(f"Row count of original data = {total_row_count}")
print(f"Row count of data after dropping duplicates rows = {no_duplicates_row_count}")
print(f"Row count of after dropping rows contain null values = {no_null_row_count}")
print(f"mpg-cleaned.parquet exists: {os.path.isdir('data/mpg-cleaned.parquet/')}")

Row count of original data = 406
Row count of data after dropping duplicates rows = 392
Row count of after dropping rows contain null values = 385
mpg-cleaned.parquet exists: True


### Part 2 (Build Machine Learning Pipeline)

In this part, We will start building the pipeline stages that implement the machine learning model and that will done throughout the following:
1. Create `StringIndexer` stage.
2. Create `VectorAssembler` stage.
3. Create `StandardScaler` stage.
4. Define model creation stage.
5. Build the pipeline from the above stages.
6. Split the dataset into training and test sets.
7. Fit the model using training data.

In [77]:
df.show(5)

+----+---------+-----------+----------+------+----------+----+--------+
| MPG|Cylinders|Engine_Disp|Horsepower|Weight|Accelerate|Year|  Origin|
+----+---------+-----------+----------+------+----------+----+--------+
|34.0|        4|      108.0|        70|  2245|      16.9|  82|Japanese|
|27.0|        4|      112.0|        88|  2640|      18.6|  82|American|
|22.0|        6|      250.0|       105|  3353|      14.5|  76|American|
|29.0|        4|       98.0|        83|  2219|      16.5|  74|European|
|41.5|        4|       98.0|        76|  2144|      14.7|  80|European|
+----+---------+-----------+----------+------+----------+----+--------+
only showing top 5 rows


In [78]:
#Create a string indexer of Origin attribute
indexer = StringIndexer(inputCol='Origin',outputCol='Origin_Indexer')

In [88]:
#Print each Origin category with its index
indexer.fit(df).transform(df).groupBy('Origin').agg(
    F.median(F.col('Origin_Indexer')).alias('Index')
).show()

+--------+-----+
|  Origin|Index|
+--------+-----+
|European|  2.0|
|Japanese|  1.0|
|American|  0.0|
+--------+-----+



In [91]:
#Combine necessary features in one vector in column called features
#define necessary features
input_cols = df.columns[1:len(df.columns)-1]
assembler = VectorAssembler(inputCols=input_cols, outputCol='features')

In [96]:
assembler.transform(df).select(input_cols).show(5,truncate=False)

+---------+-----------+----------+------+----------+----+----------------------------------+
|Cylinders|Engine_Disp|Horsepower|Weight|Accelerate|Year|features                          |
+---------+-----------+----------+------+----------+----+----------------------------------+
|4        |108.0      |70        |2245  |16.9      |82  |[4.0,108.0,70.0,2245.0,16.9,82.0] |
|4        |112.0      |88        |2640  |18.6      |82  |[4.0,112.0,88.0,2640.0,18.6,82.0] |
|6        |250.0      |105       |3353  |14.5      |76  |[6.0,250.0,105.0,3353.0,14.5,76.0]|
|4        |98.0       |83        |2219  |16.5      |74  |[4.0,98.0,83.0,2219.0,16.5,74.0]  |
|4        |98.0       |76        |2144  |14.7      |80  |[4.0,98.0,76.0,2144.0,14.7,80.0]  |
+---------+-----------+----------+------+----------+----+----------------------------------+
only showing top 5 rows


Now, these features must be scaled to prevent model bias, So we will scale these features using `StandardScaler`

In [102]:
scaler = StandardScaler(inputCol='features',outputCol='scaled_features')

We need to define the  model creation stage

In [110]:
lr = LinearRegression(featuresCol='scaled_features', labelCol='MPG')

Then, Build the pipeline

In [111]:
pipeline = Pipeline(stages=[indexer,assembler, scaler,lr])

Now the time we need to split our dataset into training and test set

In [112]:
(training_data, testing_data)  = df.randomSplit([0.7,0.3],seed=42)

Let's train the model using `training_data`

In [113]:
pipeline_model = pipeline.fit(training_data)

25/12/13 17:16:29 WARN Instrumentation: [a9d22bbe] regParam is zero, which might cause numerical instability and overfitting.


In [129]:
print(f"Pipeline Stages")
print('-'*25)
for i,value in enumerate([str(x).split('_')[0] for x in pipeline.getStages()]):
    print(f"Stage {i+1}: {value}")

print('-'*25)
print(f"The Label Column: {lr.getLabelCol()}")
print('-'*25)


Pipeline Stages
-------------------------
Stage 1: StringIndexer
Stage 2: VectorAssembler
Stage 3: StandardScaler
Stage 4: LinearRegression
-------------------------
The Label Column: MPG
-------------------------


### Part 3 (Evaluation)

The next step is evaluation which are predict the testing data using the trained model by the training data and  then finally compute the evaluation metrices for the model to ensure the quality of the  prediction. We do that in the following order:
1. Predict the testing_data using the trained model.

2. Generate Evaluation metrices of the model to ensure the quality of prediction. Metrices used in this evalutation:
    - Mean Square Error (MSE).
    - Root Mean Square Error  (RMSE).
    - Mean Absolute Error (MAE).
    - R-Squared.

In [133]:
predictions = pipeline_model.transform(testing_data)
predictions.columns

['MPG',
 'Cylinders',
 'Engine_Disp',
 'Horsepower',
 'Weight',
 'Accelerate',
 'Year',
 'Origin',
 'Origin_Indexer',
 'features',
 'scaled_features',
 'prediction']

In [135]:
evaluator= RegressionEvaluator(predictionCol='prediction',labelCol='MPG', metricName='mse')
mse = evaluator.evaluate(predictions)
evaluator= RegressionEvaluator(predictionCol='prediction',labelCol='MPG', metricName='r2')
r2 = evaluator.evaluate(predictions)
evaluator= RegressionEvaluator(predictionCol='prediction',labelCol='MPG', metricName='rmse')
rmse = evaluator.evaluate(predictions)
evaluator= RegressionEvaluator(predictionCol='prediction',labelCol='MPG', metricName='mae')
mae = evaluator.evaluate(predictions)

In [137]:
print(f"MSE = {mse}")
print(f"RMSE = {rmse}")
print(f"MAE = {mae}")
print(f"R-Squared = {r2}")

MSE = 12.22674583557129
RMSE = 3.4966763984634452
MAE = 2.8457151130135854
R-Squared = 0.8018737394895717


In [141]:
#Print the Intercept of the linear regression model
linearRegressionModel = pipeline_model.stages[-1]  #Because it's the last stage in the pipeline  model
print(f"Intercept of the linear regression model = {round(linearRegressionModel.intercept,2)}")

Intercept of the linear regression model = -17.37


### Part 4 (Make the model persist)

Making the model persist is so important for reusability and portability and time consuming wise because it allows you save the trained model in you local machine and you can use it again on new real-world datasets by just loading the model and predict the new data you have.

This will be done through the following steps:
1. Save the trained mode in the physical disk on the local machine.
2. Load the model again.
3. Use it to predict new data.

In [145]:
%%bash
if ! test -d saved_models; then
    mkdir  saved_models
fi

In [146]:
#Save the model
pipeline_model.write().overwrite().save('saved_models/')

In [147]:
#Load the model
loaded_model = PipelineModel.load('saved_models/')

In [148]:
#Use the loaded model to predict testing data
newPerdictions = loaded_model.transform(testing_data)

In [149]:
newPerdictions.show(5)

+----+---------+-----------+----------+------+----------+----+--------+--------------+--------------------+--------------------+------------------+
| MPG|Cylinders|Engine_Disp|Horsepower|Weight|Accelerate|Year|  Origin|Origin_Indexer|            features|     scaled_features|        prediction|
+----+---------+-----------+----------+------+----------+----+--------+--------------+--------------------+--------------------+------------------+
|10.0|        8|      360.0|       215|  4615|      14.0|  70|American|           0.0|[8.0,360.0,215.0,...|[4.81279869941784...| 6.960764577508346|
|11.0|        8|      429.0|       208|  4633|      11.0|  72|American|           0.0|[8.0,429.0,208.0,...|[4.81279869941784...| 8.545911819807522|
|12.0|        8|      350.0|       180|  4499|      12.5|  73|American|           0.0|[8.0,350.0,180.0,...|[4.81279869941784...|10.226709705747677|
|12.0|        8|      383.0|       180|  4955|      11.5|  71|American|           0.0|[8.0,383.0,180.0,...|[4.81

In [150]:
evaluator= RegressionEvaluator(predictionCol='prediction',labelCol='MPG', metricName='mse')
mse = evaluator.evaluate(newPerdictions)
evaluator= RegressionEvaluator(predictionCol='prediction',labelCol='MPG', metricName='r2')
r2 = evaluator.evaluate(newPerdictions)
evaluator= RegressionEvaluator(predictionCol='prediction',labelCol='MPG', metricName='rmse')
rmse = evaluator.evaluate(newPerdictions)
evaluator= RegressionEvaluator(predictionCol='prediction',labelCol='MPG', metricName='mae')
mae = evaluator.evaluate(newPerdictions)


#Displaying the results
print(f"MSE = {mse}")
print(f"RMSE = {rmse}")
print(f"MAE = {mae}")
print(f"R-Squared = {r2}")

MSE = 12.22674583557129
RMSE = 3.4966763984634452
MAE = 2.8457151130135854
R-Squared = 0.8018737394895717


In [151]:
#Loaded model stages
loaded_model.stages

[StringIndexerModel: uid=StringIndexer_f760d9b7fc4b, handleInvalid=error,
 VectorAssembler_b6a02e2a0178,
 StandardScalerModel: uid=StandardScaler_f4c5091385f1, numFeatures=6, withMean=false, withStd=true,
 LinearRegressionModel: uid=LinearRegression_a291628b5dca, numFeatures=6]

In [153]:
inputColumns = loaded_model.stages[1].getInputCols() 
inputColumns

['Cylinders', 'Engine_Disp', 'Horsepower', 'Weight', 'Accelerate', 'Year']

In [154]:
spark.stop()