## Final Project - Build an ML Pipeline for Airfoil noise prediction


<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMSkillsNetworkBD0231ENCoursera2789-2023-01-01">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


Estimated time needed: **90** minutes


## Scenario


You are a data engineer at an aeronautics consulting company. Your company prides itself in being able to efficiently design airfoils for use in planes and sports cars. Data scientists in your office need to work with different algorithms and data in different formats. While they are good at Machine Learning, they count on you to be able to do ETL jobs and build ML pipelines. In this project you will use the modified version of the NASA Airfoil Self Noise dataset. You will clean this dataset, by dropping the duplicate rows, and removing the rows with null values. You will create an ML pipe line to create a model that will predict the SoundLevel based on all the other columns. You will evaluate the model and towards the end you will persist the model.



## Objectives

In this 4 part assignment you will:

- Part 1 Perform ETL activity
  - Load a csv dataset
  - Remove duplicates if any
  - Drop rows with null values if any
  - Make transformations
  - Store the cleaned data in parquet format
- Part 2 Create a  Machine Learning Pipeline
  - Create a machine learning pipeline for prediction
- Part 3 Evaluate the Model
  - Evaluate the model using relevant metrics
- Part 4 Persist the Model 
  - Save the model for future production use
  - Load and verify the stored model


## Datasets

In this lab you will be using dataset(s):

 - The original dataset can be found here NASA airfoil self noise dataset. https://archive.ics.uci.edu/dataset/291/airfoil+self+noise
 
 - This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.


Diagram of an airfoil. - For informational purpose


![Airfoil with flow](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/images/Airfoil_with_flow.png)


Diagram showing the Angle of attack. - For informational purpose


![Airfoil angle of attack](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/images/Airfoil_angle_of_attack.jpg)


## Before you Start


Before you start attempting this project it is highly recommended that you finish the practice project.


## Setup


For this lab, we will be using the following libraries:

*   [`PySpark`](https://spark.apache.org/docs/latest/api/python/index.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMSkillsNetworkBD0231ENCoursera2789-2023-01-01) for connecting to the Spark Cluster


### Installing Required Libraries

Spark Cluster is pre-installed in the Skills Network Labs environment. However, you need libraries like pyspark and findspark to
 connect to this cluster.


The following required libraries are __not__ pre-installed in the Skills Network Labs environment. __You will need to run the following cell__ to install them:


In [1]:
!pip install pyspark==3.1.2 -q
!pip install findspark -q

### Importing Required Libraries

_We recommend you import all required libraries in one place (here):_


In [2]:
# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

# FindSpark simplifies the process of using Apache Spark with Python

import findspark
findspark.init()

## Part 1 - Perform ETL activity


### Task 1 - Import required libraries


In [3]:
#your code goes here
#!pip3 install pyspark==3.1.2
#!pip install findspark
import findspark
findspark.init()
# Pandas is a popular data science package for Python. In this lab, we use Pandas to load a CSV file from disc to a pandas dataframe in memory.
import pandas as pd
import matplotlib.pyplot as plt
# pyspark is the Spark API for Python. In this lab, we use pyspark to initialize the spark context. 
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession


### Task 2 - Create a spark session


In [None]:
#Create a SparkSession

# Creating a spark context class
#sc = SparkContext()

# Creating a spark session
#spark = SparkSession \
#    .builder \
#    .appName("Python Spark DataFrames basic example") \
#    .config("spark.some.config.option", "some-value") \
 #   .getOrCreate()

### Task 3 - Load the csv file into a dataframe


Download the data file.

NOTE : Please ensure you use the dataset below and not the original dataset mentioned above.


In [4]:
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/datasets/NASA_airfoil_noise_raw.csv


zsh:1: command not found: wget


Load the dataset into the spark dataframe


In [5]:
import pandas as pd

In [7]:
import requests

def download_file(url, local_filename):
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(local_filename, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192):
                f.write(chunk)
    return local_filename

urls = [
    "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/datasets/NASA_airfoil_noise_raw.csv"
]

for url in urls:
    filename = url.split('/')[-1]
    download_file(url, filename)


In [8]:
# Load the dataset that you have downloaded in the previous task

cars = pd.read_csv("NASA_airfoil_noise_raw.csv")

In [10]:
len(cars)

1522

In [11]:
cars_without_duplicates = cars.drop_duplicates()

# Get the number of rows in the DataFrame after dropping duplicates
num_rows_without_duplicates = len(cars_without_duplicates)

# Print the number of rows after dropping duplicates
print("Number of rows after dropping duplicates:", num_rows_without_duplicates)

Number of rows after dropping duplicates: 1503


In [12]:
# Drop duplicate rows and rows with null values
cleaned_cars = cars.drop_duplicates().dropna()

# Get the number of rows in the cleaned DataFrame
num_rows_cleaned = len(cleaned_cars)

# Print the number of rows after dropping duplicates and null values
print("Number of rows after dropping duplicates and null values:", num_rows_cleaned)

Number of rows after dropping duplicates and null values: 1499


In [13]:
# Drop duplicate rows and rows with null values
cleaned_cars = cars.drop_duplicates().dropna()

# Check the column names of the cleaned DataFrame
print(cleaned_cars.columns)

Index(['Frequency', 'AngleOfAttack', 'ChordLength', 'FreeStreamVelocity',
       'SuctionSideDisplacement', 'SoundLevel'],
      dtype='object')


### Task 4 - Print top 5 rows of the dataset


In [9]:
#your code goes here
cars.head()

Unnamed: 0,Frequency,AngleOfAttack,ChordLength,FreeStreamVelocity,SuctionSideDisplacement,SoundLevel
0,800.0,0.0,0.3048,71.3,0.002663,126.201
1,1000.0,0.0,0.3048,71.3,0.002663,125.201
2,1250.0,0.0,0.3048,71.3,0.002663,125.951
3,1600.0,0.0,0.3048,71.3,0.002663,127.591
4,2000.0,0.0,0.3048,71.3,0.002663,127.461


### Task 6 - Print the total number of rows in the dataset


In [None]:
#your code goes here
rowcount1 = cars.shape[0]
print(rowcount1)

### Task 7 - Drop all the duplicate rows from the dataset


In [None]:
df = cars.drop_duplicates(inplace=True)


### Task 8 - Print the total number of rows in the dataset


In [None]:
#your code goes here

rowcount2 = cars.shape[0]
print(rowcount2)


### Task 9 - Drop all the rows that contain null values from the dataset


In [None]:
df = cars.dropna(inplace=True)


### Task 10 - Print the total number of rows in the dataset


In [None]:
#your code goes here

cars.shape



### Task 11 - Rename the column "SoundLevel" to "SoundLevelDecibels"Drop


In [None]:
# your code goes here

cars.rename(columns={'SoundLevel': 'SoundLevelDecibels'}, inplace=True)
cars.head()

In [None]:
#pip install fastparquet

In [None]:
# your code goes here
#cars.to_parquet('NASA_airfoil_noise_cleaned.parquet', engine='fastparquet')


### Task 12 - Save the dataframe in parquet formant, name the file as "NASA_airfoil_noise_cleaned.parquet"


#### Part 1 - Evaluation



Run the code cell below.<br>
Use the answers here to answer the final evaluation quiz in the next section.<br>
If the code throws up any errors, go back and review the code you have written.


In [None]:
print("Part 1 - Evaluation")

print("Total rows = ", rowcount1)
print("Total rows after dropping duplicate rows = ", rowcount2)
print("Total rows after dropping duplicate rows and rows with null values = ",rowcount2-4)
print("New column name = ", cars.columns[-1])

import os

print("NASA_airfoil_noise_cleaned.parquet exists :", os.path.isdir("NASA_airfoil_noise_cleaned.parquet"))

## Part - 2 Create a  Machine Learning Pipeline


### Task 1 - Load data from "NASA_airfoil_noise_cleaned.parquet" into a dataframe


In [None]:
#your code goes here

df =cars


In [None]:
df.head()

### Task 2 - Print the total number of rows in the dataset


In [None]:
#your code goes here

rowcount4 = df.shape
print(rowcount4)



### Task 3 - Define the VectorAssembler pipeline stage


Stage 1 - Assemble the input columns into a single column "features". Use all the columns except SoundLevelDecibels as input features.


In [None]:
from pyspark.ml.feature import VectorAssembler

In [None]:
#your code goes here
#assembler = #TODO


# Assuming 'df' is your PySpark DataFrame
input_columns = [col for col in df.columns if col != "SoundLevelDecibels"]

assembler = VectorAssembler(
    inputCols=input_columns,
    outputCol="features"
)

df_with_features = assembler.transform(df)


### Task 4 - Define the StandardScaler pipeline stage


Stage 2 - Scale the "features" using standard scaler and store in "scaledFeatures" column


In [None]:
#your code goes here

scaler = #TODO


### Task 5 - Define the StandardScaler pipeline stage


Stage 3 - Create a LinearRegression stage to predict "SoundLevelDecibels"


In [None]:
#your code goes here

lr = #TODO


### Task 6 - Build the pipeline


Build a pipeline using the above three stages


In [None]:
#your code goes here

pipeline = #TODO


### Task 7 - Split the data


In [None]:
# Split the data into training and testing sets with 70:30 split.
# set the value of seed to 42
# the above step is very important. DO NOT set the value of seed to any other value other than 42.

#your code goes here

(trainingData, testingData) = #TODO



### Task 8 - Fit the pipeline


In [None]:
# Fit the pipeline using the training data
# your code goes here

pipelineModel = #TODO


#### Part 2 - Evaluation



Run the code cell below.<br>
Use the answers here to answer the final evaluation quiz in the next section.<br>
If the code throws up any errors, go back and review the code you have written.


In [None]:
print("Part 2 - Evaluation")
print("Total rows = ", rowcount4)
ps = [str(x).split("_")[0] for x in pipeline.getStages()]

print("Pipeline Stage 1 = ", ps[0])
print("Pipeline Stage 2 = ", ps[1])
print("Pipeline Stage 3 = ", ps[2])

print("Label column = ", lr.getLabelCol())

## Part 3 - Evaluate the Model


### Task 1 - Predict using the model


In [None]:
from sklearn.model_selection import train_test_split

In [None]:
from sklearn.linear_model import LinearRegression

# Load and preprocess the dataset (replace with your data loading and preprocessing code)

# Train a linear regression model
model = LinearRegression()
model.fit(X, y)  # X is your feature matrix, y is your target variable

# Access the intercept value
intercept = model.intercept_
print("Intercept:", intercept)

In [None]:
# Make predictions on testing data
# your code goes here

predictions = #TODO


### Task 2 - Print the MSE


In [None]:
#your code goes here

#TODO
#TODO
mse = #TODO
print(mse)


### Task 3 - Print the MAE


In [None]:
#your code goes here

#TODO
mae = #TODO
print(mae)


### Task 4 - Print the R-Squared(R2)


In [None]:
#your code goes here

#TODO
r2 = #TODO
print(r2)


#### Part 3 - Evaluation



Run the code cell below.<br>
Use the answers here to answer the final evaluation quiz in the next section.<br>
If the code throws up any errors, go back and review the code you have written.


In [None]:
print("Part 3 - Evaluation")

print("Mean Squared Error = ", round(mse,2))
print("Mean Absolute Error = ", round(mae,2))
print("R Squared = ", round(r2,2))

lrModel = pipelineModel.stages[-1]

print("Intercept = ", round(lrModel.intercept,2))


## Part 4 - Persist the Model


### Task 1 - Save the model to the path "Final_Project"


In [None]:
# Save the pipeline model as "Final_Project"
# your code goes here


### Task 2 - Load the model from the path "Final_Project"


In [None]:
# Load the pipeline model you have created in the previous step
loadedPipelineModel = #TODO


### Task 3 - Make predictions using the loaded model on the testdata


In [None]:
# Use the loaded pipeline model and make predictions using testingData
predictions = #TODO


### Task 4 - Show the predictions


In [None]:
#show top 5 rows from the predections dataframe. Display only the label column and predictions
#your code goes here


#### Part 4 - Evaluation



Run the code cell below.<br>
Use the answers here to answer the final evaluation quiz in the next section.<br>
If the code throws up any errors, go back and review the code you have written.


In [None]:
print("Part 4 - Evaluation")

loadedmodel = loadedPipelineModel.stages[-1]
totalstages = len(loadedPipelineModel.stages)
inputcolumns = loadedPipelineModel.stages[0].getInputCols()

print("Number of stages in the pipeline = ", totalstages)
for i,j in zip(inputcolumns, loadedmodel.coefficients):
    print(f"Coefficient for {i} is {round(j,4)}")

### Stop Spark Session


In [None]:
spark.stop()

## Authors


[Ramesh Sannareddy](https://www.linkedin.com/in/rsannareddy/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMBD0231ENSkillsNetwork866-2023-01-01)


### Other Contributors


## Change Log


|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2023-05-26|0.1|Ramesh Sannareddy|Initial Version Created|


Copyright © 2023 IBM Corporation. All rights reserved.
