## Machine Learning with Apache Spark
### Final Project
**Project Description**

The main goal is to clean the dataset, build a machine learning pipeline, evaluate the model and make the model persist for future use.

**Project Phases**
1. Data cleaning
    - Remove duplicates.
    - Handle missing values.
    - Fix some other defects.
    - Save the cleaned data in parquet format.
2. Build machine learning pipeline stages
    - Assemble features in one vector.
    - Assign categorical attributes to indices.
    - Scale the variables to prevent model bias.
    - Build the regression model stage.
    - Build the pipeline with the mentioned stages.
3. Evaluation Phase
    - Use evaluation metrics.
4. Saving the model
    - Make the model persist for future use, portability and time consuming.
    - Save the model.
    - Load the model.

In [193]:
import findspark
from pyspark.sql import SparkSession, functions as F, types as T
from pyspark.ml.feature import VectorAssembler,  StandardScaler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.pipeline import PipelineModel
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml import Pipeline

In [194]:
spark = SparkSession.builder.appName('Final Project').getOrCreate()

#### Part 1
1. Download data.
2. Load it to spark dataframe.
3. Remove duplicates.
4. Remove rows contain nulls.
5. Save data in parquet format.

Download the dataset

In [195]:
%%bash
if ! test -d data;then
    mkdir data
fi
fileName="NASA_airfoil_noise_raw.csv"
if ! test -f data/$fileName;then
    wget -P data https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/datasets/NASA_airfoil_noise_raw.csv
else
    echo "File exists"
fi

File exists


In [196]:
fileName='NASA_airfoil_noise_raw.csv'
df = spark.read.csv(f'data/{fileName}',header=True, inferSchema=True)
df.printSchema()

root
 |-- Frequency: integer (nullable = true)
 |-- AngleOfAttack: double (nullable = true)
 |-- ChordLength: double (nullable = true)
 |-- FreeStreamVelocity: double (nullable = true)
 |-- SuctionSideDisplacement: double (nullable = true)
 |-- SoundLevel: double (nullable = true)



In [197]:
df.show(5)

+---------+-------------+-----------+------------------+-----------------------+----------+
|Frequency|AngleOfAttack|ChordLength|FreeStreamVelocity|SuctionSideDisplacement|SoundLevel|
+---------+-------------+-----------+------------------+-----------------------+----------+
|      800|          0.0|     0.3048|              71.3|             0.00266337|   126.201|
|     1000|          0.0|     0.3048|              71.3|             0.00266337|   125.201|
|     1250|          0.0|     0.3048|              71.3|             0.00266337|   125.951|
|     1600|          0.0|     0.3048|              71.3|             0.00266337|   127.591|
|     2000|          0.0|     0.3048|              71.3|             0.00266337|   127.461|
+---------+-------------+-----------+------------------+-----------------------+----------+
only showing top 5 rows


Print all records containing null values

In [198]:
expr = sum(F.when(F.col(x).isNull(),1).otherwise(0) for x in df.columns)
df.withColumn('nulls', expr).filter(F.col('nulls') != 0).show()

+---------+-------------+-----------+------------------+-----------------------+----------+-----+
|Frequency|AngleOfAttack|ChordLength|FreeStreamVelocity|SuctionSideDisplacement|SoundLevel|nulls|
+---------+-------------+-----------+------------------+-----------------------+----------+-----+
|     NULL|          0.0|     0.3048|              55.5|             0.00283081|   123.236|    1|
|      630|          0.0|     0.3048|              NULL|             0.00310138|   128.629|    1|
|     2500|          1.5|     0.3048|              NULL|             0.00392107|   120.981|    1|
|      800|          3.0|       NULL|              39.6|             0.00495741|   129.552|    1|
+---------+-------------+-----------+------------------+-----------------------+----------+-----+



### How to handle null values?
There are more than one way to handle it, according to the requirements and the dataset itself so either:
1. Remove the whole record.
2. Replace the null values by the mean of that attribute in the whole dataset (if the variable is numberic).
3. Replace it by the median.
4. For categorical attributes you may replace nulls with the mode.

> In this we will remove the whole record

In [199]:
print(f"Total number of records of original data set = {df.count()}")

Total number of records of original data set = 1522


In [200]:
df  = df.dropna()

In [201]:
df.withColumn('nulls', expr).filter(F.col('nulls') != 0).show()
#Now there is no null

+---------+-------------+-----------+------------------+-----------------------+----------+-----+
|Frequency|AngleOfAttack|ChordLength|FreeStreamVelocity|SuctionSideDisplacement|SoundLevel|nulls|
+---------+-------------+-----------+------------------+-----------------------+----------+-----+
+---------+-------------+-----------+------------------+-----------------------+----------+-----+



In [202]:
print(f"Number of records after removing rows contain null values = {df.count()}")

Number of records after removing rows contain null values = 1518


In [203]:
#Rename column
df = df.withColumnRenamed('SoundLevel','SoundLevelDecibels')
df.columns

['Frequency',
 'AngleOfAttack',
 'ChordLength',
 'FreeStreamVelocity',
 'SuctionSideDisplacement',
 'SoundLevelDecibels']

In [204]:
df  = df.dropDuplicates()
print(f"Records number after removing duplicates = {df.count()}")

Records number after removing duplicates = 1499


Save the cleaned dataframe in the parquet formate

In [205]:
df.write.mode('overwrite').parquet(f'data/{fileName}_cleaned.parquet')

In [206]:
del df

#### Part 2 (Building the pipeline)
1. Load data from the parquet format.
2. Define assembler stage.
3. Define scaler stage.
4. Define linear regression model stage.
5. Build the pipeline.
6. Split the dataset into training and testing sets.
7. Fit the pipeline using training data.

In [207]:
df = spark.read.parquet(f'data/{fileName}_cleaned.parquet')
df.printSchema()

root
 |-- Frequency: integer (nullable = true)
 |-- AngleOfAttack: double (nullable = true)
 |-- ChordLength: double (nullable = true)
 |-- FreeStreamVelocity: double (nullable = true)
 |-- SuctionSideDisplacement: double (nullable = true)
 |-- SoundLevelDecibels: double (nullable = true)



In [208]:
df.count()

1499

In [209]:
print(df.columns)

['Frequency', 'AngleOfAttack', 'ChordLength', 'FreeStreamVelocity', 'SuctionSideDisplacement', 'SoundLevelDecibels']


In [210]:
#Stage 1: Assembler
features_columns  = df.columns[:-1]
assembler= VectorAssembler(inputCols=features_columns, outputCol='features')

In [211]:
#Stage 2: Scaler
scaler = StandardScaler(inputCol='features',outputCol='scaled_features')

In [212]:
#Stage 3: LinearRegression
lr = LinearRegression(featuresCol='scaled_features',labelCol='SoundLevelDecibels')

In [213]:
#Build the pipeline
pipeline = Pipeline(stages=[assembler, scaler, lr])

In [214]:
#Split the dataset
training_data, testing_data = df.randomSplit([0.7,0.3], seed=42)

In [215]:
print(training_data.count())
print(testing_data.count())

1101
398


In [216]:
#Fit the pipeline
model = pipeline.fit(training_data)

25/12/14 22:06:34 WARN Instrumentation: [d81bf7c1] regParam is zero, which might cause numerical instability and overfitting.


In [217]:
print("Part 2 - Evaluation")
print("Total rows = ", df.count())
ps = [str(x).split("_")[0] for x in pipeline.getStages()]

print("Pipeline Stage 1 = ", ps[0])
print("Pipeline Stage 2 = ", ps[1])
print("Pipeline Stage 3 = ", ps[2])

print("Label column = ", lr.getLabelCol())

Part 2 - Evaluation
Total rows =  1499
Pipeline Stage 1 =  VectorAssembler
Pipeline Stage 2 =  StandardScaler
Pipeline Stage 3 =  LinearRegression
Label column =  SoundLevelDecibels


#### Part 3 (Evaluation)
1. Predict the testing data using the model.
2. Show some of linear regression evaluation metrics.
    - MSE
    - MAE
    - R-Squared

In [218]:
predictions  = model.transform(testing_data)

In [219]:
print(predictions.columns)

['Frequency', 'AngleOfAttack', 'ChordLength', 'FreeStreamVelocity', 'SuctionSideDisplacement', 'SoundLevelDecibels', 'features', 'scaled_features', 'prediction']


In [220]:
predictions.select(['SoundLevelDecibels','prediction']).show()

+------------------+------------------+
|SoundLevelDecibels|        prediction|
+------------------+------------------+
|           128.679| 122.5972291437678|
|            133.42|127.37968204568845|
|           119.146|130.34077425074514|
|           116.074|131.11016975113546|
|           134.319|  127.126273601251|
|            125.01|127.89456373905162|
|           125.941|131.06220981224092|
|           130.588| 125.7373995384845|
|           128.354|121.53249832197926|
|           121.783|124.20059665619317|
|            122.94|125.87997778533574|
|           116.146|125.24362112904097|
|           114.044|126.06429872612996|
|           109.951|127.67830278943781|
|           125.974|121.25022147564815|
|           116.066|123.31966959832607|
|           118.595|124.20046348885941|
|           126.395|126.16068839641792|
|           130.089| 122.5337859220606|
|           131.889|123.42922049990017|
+------------------+------------------+
only showing top 20 rows


In [221]:
evaluator = RegressionEvaluator(labelCol='SoundLevelDecibels', predictionCol='prediction', metricName='mse')
mse = evaluator.evaluate(predictions)
evaluator = RegressionEvaluator(labelCol='SoundLevelDecibels', predictionCol='prediction', metricName='mae')
mae = evaluator.evaluate(predictions)
evaluator = RegressionEvaluator(labelCol='SoundLevelDecibels', predictionCol='prediction', metricName='r2')
r2 = evaluator.evaluate(predictions)

In [222]:
print("Part 3 - Evaluation")

print("Mean Squared Error = ", round(mse,2))
print("Mean Absolute Error = ", round(mae,2))
print("R Squared = ", round(r2,2))
lrModel = model.stages[-1]
print(f"Intercept = {round(lrModel.intercept,2)}")

Part 3 - Evaluation
Mean Squared Error =  25.0
Mean Absolute Error =  3.91
R Squared =  0.5
Intercept = 132.88


#### Part 4 (persist the model)
1. Save the model.
2. Load the model.
3. Make prediction on  testing data using loaded model.
4. Show predictions.

In [223]:
model.write().overwrite().save('saved_model/')

In [224]:
loaded_model = PipelineModel.load('saved_model/')

In [225]:
new_predictions = loaded_model.transform(testing_data)

In [226]:
print(f"Total number of stages = {len(loaded_model.stages)}")
input_columns = loaded_model.stages[0].getInputCols()
linear_regression_model = loaded_model.stages[-1]
print(input_columns)
#Print the coefficients of each input variable
print('Coefficients')
for i,  j in zip(input_columns, linear_regression_model.coefficients):
    print(f"Variable: {i} with coefficient = {round(j,4)}")

Total number of stages = 3
['Frequency', 'AngleOfAttack', 'ChordLength', 'FreeStreamVelocity', 'SuctionSideDisplacement']
Coefficients
Variable: Frequency with coefficient = -3.9906
Variable: AngleOfAttack with coefficient = -2.2881
Variable: ChordLength with coefficient = -3.3269
Variable: FreeStreamVelocity with coefficient = 1.4832
Variable: SuctionSideDisplacement with coefficient = -2.0551


In [227]:
spark.stop()