##  Machine Learning Predictor for Airfoil Self-Noise using Spark

You are a data engineer at an aeronautics consulting company. Your company prides itself in being able to efficiently design airfoils for use in planes and sports cars. Data scientists in your office need to work with different algorithms and data in different formats. While they are good at Machine Learning, they count on you to be able to do ETL jobs and build ML pipelines. In this project you will use the modified version of the NASA Airfoil Self Noise dataset. You will clean this dataset, by dropping the duplicate rows, and removing the rows with null values. You will create an ML pipe line to create a model that will predict the SoundLevel based on all the other columns. You will evaluate the model and towards the end you will persist the model.



## Part II Create a  Machine Learning Pipeline


![Airfoil with flow](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/images/Airfoil_with_flow.png)


### 1- Read processed data from S3 bucket

#### 1.1  Import required libraries

In [None]:
#your code goes here
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.ml import Pipeline
from pyspark.ml.pipeline import PipelineModel
from pyspark.ml.regression import LinearRegression,RandomForestRegressor
from pyspark.ml.feature import StandardScaler, VectorAssembler, StringIndexer
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
import matplotlib.pyplot as plt
import seaborn as sns



In [None]:
!pip install pyspark==3.1.2 -q
!pip install findspark -q

In [None]:
# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

# FindSpark simplifies the process of using Apache Spark with Python

import findspark
findspark.init()

#### 1.2 Initialising Spark

In [None]:
#Create a SparkSession
spark = SparkSession.builder \
    .appName("NASA_Project-01") \
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .config("spark.hadoop.fs.s3a.access.key", "YOUR_ACCESS_KEY") \
    .config("spark.hadoop.fs.s3a.secret.key", "YOUR_SECRET_KEY") \
    .config("spark.hadoop.fs.s3a.endpoint", "s3.amazonaws.com") \
    .getOrCreate()

#### 1.3 - Load data from "NASA_airfoil_noise_cleaned.parquet" into a dataframe from S3 bucket


In [None]:
#Read parquet file

df = spark.read.parquet("s3a://YOUR_BUCKET_NAME/PATH/nasa_airfoil_noise_raw.parquet")
df.show(5)

[Stage 15:>                                                         (0 + 1) / 1]

+---------+-------------+-----------+------------------+-----------------------+------------------+
|Frequency|AngleOfAttack|ChordLength|FreeStreamVelocity|SuctionSideDisplacement|SoundLevelDecibels|
+---------+-------------+-----------+------------------+-----------------------+------------------+
|      630|          0.0|     0.3048|              31.7|             0.00331266|           129.095|
|     4000|          0.0|     0.3048|              31.7|             0.00331266|           118.145|
|     4000|          1.5|     0.3048|              39.6|             0.00392107|           117.741|
|      800|          4.0|     0.3048|              71.3|             0.00497773|           131.755|
|     1250|          0.0|     0.2286|              31.7|              0.0027238|           128.805|
+---------+-------------+-----------+------------------+-----------------------+------------------+
only showing top 5 rows



                                                                                

###  2 - Exploratory Data Analysis


There will be no cleaning of the data, as it has been already cleaned from source. Instead, the focus will be on understanding the data and optimising its use to build a good ML model.

#### 2.1 - Data overview

The most important aspects to look for are missing values, data distribution and possible outliers. Making a preliminary data preview and providing a statistical description can offer an initial understanding of the data that will be further developed.

In [None]:
rowcount =df.count()
print(rowcount)

[Stage 16:>                                                         (0 + 8) / 8]

1499


                                                                                

In [None]:
df.printSchema()

In [None]:
df.describe().show()

In [None]:
df.toPandas().isnull().sum()

***There are no missing values in the data, as expected.***

In [None]:

df.show() 

### 2.2 -  Visualization

#### a - Histograms and distribution

In [None]:
 # Create a histogram for each variable to see their distribution
fig, ax = plt.subplots(2,3, figsize=(15,10))
ax[0][0].set_xscale("log") # This command sets the X-axis of "Frequency" in log scale
ax[1][1].set_xscale("log") # This command sets the X-axis of "SuctionSideDisplacement" in log scale

for i in range(0,2): # rows
    for j in range(0,3): # columns
        col = df.columns[j+3*i]
        sns.histplot(df.toPandas(), x=col, ax=ax[i][j])

The distribution of the target variable "SoundLevel" is slightly skewed, so it could be transformed to be "more normal".Note on x scales: Frequencies are always measured in a log scale, so that was directly set so; while SuctionSideDisplacement was set in a log scale by pure observation.

#### b - Scatterplots and correlation with Sound Level

In [None]:
fig, ax = plt.subplots(2,3, figsize=(15,10))
ax[0][0].set_xscale("log") # This command sets the X-axis of "Frequency" in log scale
ax[1][1].set_xscale("log") # This command sets the X-axis of "SuctionSideDisplacement" in log scale
for i in range(0,2): # rows
    for j in range(0,3): # columns
        col = df.columns[3*i+j]
        sns.scatterplot(df.toPandas(), x=col, y='SoundLevel', ax=ax[i][j])

At first glance, there is no clear linear dependency between target and independent variables, which suggests studying feature transformations.

#### c - Interdependency


Creating a correlation matrix to highlight connections between variables.

In [None]:
correlation_matrix = df.toPandas().corr()
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix')
plt.show()

 The correlation matrix illustrates the degree of intercorrelation among independent variables. Notably, it reveals a particularly high correlation coefficient between 'SuctionSideDisplacement' and 'AngleOfAttack'. This significant correlation suggests a strong relationship between these features, potentially indicating shared or dependent information. It is highly likely that a polynomial feature expansion will help improve the performance of the predictor, given this relationship.

### 3 - Feature Extraction and Transformation

Based on the above Exploratory Data Analysis, the selected transformations are:
* Log transformation
* Polynomial expansion
* Standardization

#### 3.1 - Log transformation

While inspecting the distributions of 'Frequency' and 'SuctionSideDisplacement', both exhibit noticeable skewness, deviating from the expected bell-shaped curve. These skewed distributions have the potential to influence model performance. It will be employed a log transformation to mitigate the skewness, aiming to align these variables with a more symmetric distribution.

#### 3.2 - Polynomial expansion

The inclusion of polynomial features facilitates the representation of higher-order interactions, which are hinted by the above's correlation matrix. This expansion enlarges the feature space, givig the predictor a more comprehensive understanding of the underlying relationships, and strongly likely elevating its predictive performance.

#### 3.3 - Standardization

Given the substantial scale variations among variables, particularly accentuated within the polynomial terms, standardizing the data emerges as a pivotal step. This transformation ensures a consistent scale across all features, mitigating the disparate magnitudes and leading to improved model convergence and interpretability.

###  4 - Build the pipelines for each ML algorithm

 In this section, the pipelines will be defined for every algorithm and two variants each, one with data transformations applied, the other without them. This dual-pipeline approach serves as a strategic methodology to investigate the influence of feature engineering on the performance of the model.

#### a. Pipeline Flow with data transformations

The flow of the pipeline is defined in 4 stages,

* Stage 1.- Assemble the independent variables into one single column called "Features"
* Stage 2.- Create the polinomial feature columns
* Stage 3.- Standardise the data with StandardScaler
* Stage 4.- Feed data into the regressor

#### b.  Pipeline Flow without data transformations

Much simpler flow this time, with only 2 stages,

* Stage 1.- Assemble the independent variables into one single column called "Features"
* Stage 2.- Feed data into the regressor

#### 4.1 - Pipeline for the Unregularised Linear Regression 

#### a - Pipeline Flow with data transformations

In [None]:
# Create a transformer to apply the log transform to Frequency and SuctionSideDisplacement
log_transform_sql = SQLTransformer(
    statement="SELECT log(frequency + 1) AS log_frequency, AngleOfAttack, ChordLength, FreeStreamVelocity, log(SuctionSideDisplacement + 1) AS log_SSD, SoundLevel FROM __THIS__"
)

# Create an assembler
assembler = VectorAssembler(inputCols=('log_frequency', 'AngleOfAttack', 'ChordLength', 'FreeStreamVelocity', 'log_SSD'), outputCol='features')

# Create a polynomial expansion transformer
from pyspark.ml.feature import PolynomialExpansion
px = PolynomialExpansion(degree=2)
px.setInputCol("features")
px.setOutputCol("PolyFeat")

# Create an standard scaler
scaler = StandardScaler(inputCol="PolyFeat", outputCol="finalFeatures")

# Finally, create the unregularised linear regressor (regParam=0 indicates no regularisation terms)
lru = LinearRegression(featuresCol="finalFeatures", labelCol="SoundLevel", predictionCol="PredictedSoundLevel", regParam=0)

 Consolidate the pipeline with spark's Pipeline function.

In [None]:
pipeline = Pipeline(stages=[log_transform_sql, assembler, px, scaler,lru])

#### b. -  Without data transformations

In [None]:
# Create a new assembler with the untransformed columns
assemblerWO = VectorAssembler(inputCols=df.drop("SoundLevel").columns, outputCol='finalFeatures')

In [None]:
pipelineWO = Pipeline(stages=[assemblerWO, lru])

#### 4.2 - Pipeline for the Regularised Linear Regression

#### a - Pipeline Flow with data transformations

In [None]:
# Since all the other elements of the pipeline can be reused, it is only needed a new linear regressor (regParam=1 indicates the weight of the regularisation term)
lrr = LinearRegression(featuresCol="finalFeatures", labelCol="SoundLevel", predictionCol="PredictedSoundLevel", regParam=1)

In [None]:
pipelineReg = Pipeline(stages=[log_transform_sql, assembler, px, scaler,lrr])

#### b. -  Without data transformations

In [None]:
pipelineRegWO = Pipeline(stages=[assemblerWO, lrr])

 ####  4.3  Pipeline for the Random Forest Regression

#### a - Pipeline Flow with data transformations

In [None]:
rf = RandomForestRegressor(featuresCol='finalFeatures', labelCol='SoundLevel', predictionCol='PredictedSoundLevel')

In [None]:
pipelineRF = Pipeline(stages=[log_transform_sql, assembler, px, scaler,rf])

#### b. -  Without data transformations

In [None]:
pipelineRFWO = Pipeline(stages=[assemblerWO, rf])

 ### 5.  Evaluate the different models

* Preparing for the evaluation

Before starting, it is necessary to split the dataset into training and validation parts, create the grid search as well, and defining a common evaluator for all the models.

In [None]:
# Split dataset into training and testing with 30% of data saved for validation
(trainingData, testingData) = df.randomSplit([0.7, 0.3], seed=123)

# Initialise an evaluator (RMSE for regression)
evaluator = RegressionEvaluator(labelCol='SoundLevel', predictionCol='PredictedSoundLevel', metricName='rmse')

# Predctions graph function
from statsmodels.nonparametric.smoothers_lowess import lowess
def plotPred(pred,predWO, model='Regression'):
    '''Plots the comparison between predictions and actual values.'''

    # Compute residuals
    residDF = pred.withColumn("Residuals", col('SoundLevel')- col('PredictedSoundLevel'))
    residDFWO = predWO.withColumn("Residuals", col('SoundLevel')- col('PredictedSoundLevel'))

    # Convert Spark DFs to Pandas so Seaborn can handle them
    resid_df = residDF.toPandas()
    resid_dfWO = residDFWO.toPandas()

    # Compute lowess curve
    lowess_vals = lowess(resid_df.Residuals.ravel(), resid_df.SoundLevel.ravel(), frac=0.5, it=10)
    lowess_valsWO = lowess(resid_dfWO.Residuals.ravel(), resid_dfWO.SoundLevel.ravel(), frac=0.5, it=10)

    # Create the grid of plots
    fig, ax = plt.subplots(1,2, figsize=(15,6))
    
    # Plot the data points
    ax[0].plot([70,140], [0,0], ls='--', color='gray')
    ax[1].plot([70,140], [0,0], ls='--', color='gray')
    
    sns.scatterplot(resid_df, x="SoundLevel", y="Residuals", ax=ax[0], label='residuals')
    sns.scatterplot(resid_dfWO, x="SoundLevel", y="Residuals", ax=ax[1], label='residuals')

    ax[0].plot(lowess_vals[:, 0], lowess_vals[:, 1], color='red', label='lowess')
    ax[1].plot(lowess_valsWO[:, 0], lowess_valsWO[:, 1], color='red', label='lowess')

    

    # Set plot labels and title
    ax[0].set_title("SoundLevel prediction with DT residuals")
    ax[0].set_ylabel("Residuals")
    ax[0].set_xlabel("Real Sound Level values (dB)")
    ax[0].set_ylim((-15,15))
    ax[0].set_xlim((100,138))
    ax[0].legend(loc='upper left')    
    
    ax[1].set_title("SoundLevel prediction without DT residuals")
    ax[1].set_ylabel("Residuals")
    ax[1].set_xlabel("Real Sound Level values (dB)")
    ax[1].set_ylim((-15,15))
    ax[1].set_xlim((100,138))
    ax[1].legend(loc='upper left')
    
    fig.suptitle(f"Comparison between residuals of {model}", fontweight='bold')

    # Show the plot
    # fig.show() # I have an error where it's plotting two times, so I commented this in order to plot only once

def plotComparison(pred1, pred2, model1='model 1', model2='model 2'):
    
    # Compute residuals
    resid1 = pred1.withColumn("Residuals", -col('SoundLevel')+ col('PredictedSoundLevel'))
    resid2 = pred2.withColumn("Residuals", -col('SoundLevel')+ col('PredictedSoundLevel'))

    # Convert Spark DFs to Pandas so Seaborn can handle them
    resid_1 = resid1.toPandas()
    resid_2 = resid2.toPandas()

    # Compute lowess curve
    lowess_vals1 = lowess(resid_1.Residuals.ravel(), resid_1.SoundLevel.ravel(), frac=0.5, it=10)
    lowess_vals2 = lowess(resid_2.Residuals.ravel(), resid_2.SoundLevel.ravel(), frac=0.5, it=10)

    # Create the grid of plots
    fig, ax = plt.subplots(1,2, figsize=(15,6))
    
    # Plot the data points
    ax[0].plot([70,140], [0,0], ls='--', color='gray')
    ax[1].plot([70,140], [0,0], ls='--', color='gray')
    
    sns.scatterplot(resid_1, x="SoundLevel", y="Residuals", ax=ax[0], label='residuals')
    sns.scatterplot(resid_2, x="SoundLevel", y="Residuals", ax=ax[1], label='residuals')

    ax[0].plot(lowess_vals1[:, 0], lowess_vals1[:, 1], color='red', label='lowess')
    ax[1].plot(lowess_vals2[:, 0], lowess_vals2[:, 1], color='red', label='lowess')

    # Set plot labels and title
    ax[0].set_title(f"Prediction's residual of {model1}")
    ax[0].set_ylabel("Negative Residuals")
    ax[0].set_xlabel("Real Sound Level values (dB)")
    ax[0].set_ylim((-15,15))
    ax[0].set_xlim((100,138))
    ax[0].legend(loc='upper left')
    
    ax[1].set_title(f"Prediction's residual of {model2}")
    ax[1].set_ylabel("Negative Residuals")
    ax[1].set_xlabel("Real Sound Level values (dB)")
    ax[1].set_ylim((-15,15))
    ax[1].set_xlim((100,138))
    ax[1].legend(loc='upper left')

    fig.suptitle(f"Residuals comparison between {model1} and {model2}", fontweight='bold')

    # Show the plot
    # fig.show() # I have an error where it's plotting two times, so I commented this in order to plot only once

 #### 5.1 - Unregularised Linear Regression

* #### Prediction

In [None]:
'''Pipeline with data transformations'''
# Fit the model
lruModel = pipeline.fit(trainingData)

# Make predictions
lruPredictions = lruModel.transform(testingData)

'''Pipeline without data transformations'''
# Fit the model
lruModelWO = pipelineWO.fit(trainingData)

# Make predictions
lruPredictionsWO = lruModelWO.transform(testingData)

In [None]:
plotPred(lruPredictions, lruPredictionsWO, "the unregularised linear regression")

* ####  Training set rmse

In [None]:
'''Cross validation average rmse for trainingData'''

# With DT
trainingRmseLRU = evaluator.evaluate(lruModel.transform(trainingData))
print(f"With DT \t ---> Training set rmse: {trainingRmseLRU}")

# Without DT
trainingRmseLRUWO = evaluator.evaluate(lruModelWO.transform(trainingData))
print(f"Without DT \t ---> Training set rmse: {trainingRmseLRUWO}")

* ####  Test set rmse

In [None]:
'''Cross validation average rmse for trainingData'''

# With DT
rmseLRU = evaluator.evaluate(lruPredictions) # This value is the average metric for each fold of the cross validation
print(f"With DT \t ---> Testing set rmse: {rmseLRU}")

# Without DT
rmseLRUWO = evaluator.evaluate(lruPredictionsWO) # This value is the average metric for each fold of the cross validation
print(f"Without DT\t ---> Testing set rmse: {rmseLRUWO}")

 #### 5. 2 Regularised Linear Regression

* #### Prediction

In [None]:
# Creating parameter grid for hyperparameter tuning
param_grid = (ParamGridBuilder()
                            .addGrid(lrr.regParam, [0.01, 0.1, 1.0, 10]) # L2 Ridge regularisation parameter
                            .addGrid(lrr.elasticNetParam, [0.0, 0.5, 1.0]) # L1+L2 ElasticNet regularisation parameter
                            .build())

'''Pipeline with DT'''

# Set up CrossValidator with the pipeline for hyperparameter tuning
crossval = CrossValidator(estimator=pipelineReg, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=5)

# Fit the model
cv_model = crossval.fit(trainingData)

# Make predictions
lrrPredictions = cv_model.transform(testingData)

'''Pipeline without DT'''

# Set up CrossValidator with the pipeline for hyperparameter tuning
crossval = CrossValidator(estimator=pipelineRegWO, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=5)

# Fit the model
cv_modelWO = crossval.fit(trainingData)

# Make predictions
lrrPredictionsWO = cv_modelWO.transform(testingData)


# Get the tuned model
tuned_lrr = cv_model.bestModel

In [None]:
plotPred(lrrPredictions, lrrPredictionsWO, 'the regularised linear regression')

* ####   Training set rmse

In [None]:
'''Cross validation average rmse for testingData'''

# With DT
rmseLRR_train = evaluator.evaluate(cv_model.transform(trainingData))
print(f"With DT \t ---> Training set rmse: {rmseLRR_train}")

# Without DT
rmseLRRWO = evaluator.evaluate(cv_modelWO.transform(trainingData))
print(f"Without DT \t ---> Training set rmse: {rmseLRRWO}")

* ####   Test set rmse

In [None]:
'''Cross validation average rmse for testingData'''

# With DT
rmseLRR_test = evaluator.evaluate(lrrPredictions)
print(f"With DT \t ---> Testing set rmse: {rmseLRR_test}")

# Without DT
rmseLRRWO = evaluator.evaluate(lrrPredictionsWO)
print(f"Without DT \t ---> Testing set rmse: {rmseLRRWO}")

 #### 5. 3  Random Forest Regression

* ####  Prediction

In [None]:
# Define a parameter grid for hyperparameter tuning
param_grid = (ParamGridBuilder().addGrid(rf.numTrees, [10, 20, 30]).addGrid(rf.maxDepth, [5, 10, 15,20]).build())

'''Pipeline with DT'''

# Set up CrossValidator with the pipeline, evaluator, and parameter grid for the Random Forest algorithm
crossval = CrossValidator(estimator=pipelineRF, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=5)

# Fit the model
cv_model = crossval.fit(trainingData)

# Make predictions
rfPredictions = cv_model.transform(testingData)

'''Pipeline without DT'''

# Set up CrossValidator with the pipeline, evaluator, and parameter grid for the Random Forest algorithm
crossval = CrossValidator(estimator=pipelineRFWO, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=5)

# Fit the model
cv_modelWO = crossval.fit(trainingData)

# Make predictions
rfPredictionsWO = cv_modelWO.transform(testingData)

# Get best model
best_model = cv_model.bestModel

In [None]:
plotPred(rfPredictions, rfPredictionsWO, 'the random forest regression')

* ####   Training set rmse

In [None]:
'''Cross validation average rmse for trainingData'''

# With DT
rmseRF = evaluator.evaluate(cv_model.transform(trainingData))
print(f"With DT \t ---> Testing set rmse: {rmseRF}")

# Without DT
rmseRFWO = evaluator.evaluate(cv_modelWO.transform(trainingData))
print(f"Without DT \t ---> Testing set rmse: {rmseRFWO}")

* ####   Test set rmse

In [None]:
'''Cross validation average rmse for testingData'''

# With DT
rmseRF = evaluator.evaluate(rfPredictions)
print(f"With DT \t ---> Testing set rmse: {rmseRF}")

# Without DT
rmseRFWO = evaluator.evaluate(rfPredictionsWO)
print(f"Without DT \t ---> Testing set rmse: {rmseRFWO}")

 ### 6 Results 

###  Linear Regressor performance

Looking at the RMSE of the un- and regularised versions of the Linear Regressor, something stands out. The model without regularization performs better. This has an interesting implication, being that the regularization parameter stops the regressor from capturing the apparent complex nature of the dataset. It could be said that the LRR model is underfitting the data, if we compare RMSE scores between the training set and testing set.

In [None]:
print(f"RMSE score of LRR model (training set): {rmseLRR_train} \t vs \t RMSE score of LRU model (training set): {trainingRmseLRU}")
print(f"RMSE score of LRR model (testing set):  {rmseLRR_test}  \t vs \t RMSE score of LRU model (testing set):  {rmseLRU}")

In [None]:
# Compute difference of rmse scores between training and testing set's
diffLRR = rmseLRR_test - rmseLRR_train
diffLRU = rmseLRU - trainingRmseLRU
print(f"The difference between scores is {diffLRR} for the LRR model; whilst the difference between scores is {diffLRU} for the LRU model.")

 Greater differences between training and testing sets are commonly associated with underfitting.

### Random Forests performance

Contrary to the linear regressor, Random Forests does not surprise with its performance. Even before looking at the RMSE, it was foreseeable this was going to be the best model. Inspecting the prediction's residuals graph, a key feature of this algorithm can be seen. The red line is much flatter compared to both linear regressions, which translates into a better comprehesion of data intricacies by the model. This is a consecuence of how the algorithm works and in fact, the linear regression residuals graphs also reveal why Random Forests is suited for this specific task

In [None]:
 # Comparison between residual's graphs
plotComparison(lruPredictions, rfPredictions, 'unregularised linear regression', 'random forest regression')
# Residuals have negative sign for easier interpretation of the reader

 Airfoil behavior is known to transition between laminar and turbulent flow regimes under different conditions, and that can be seen in the left graph. The lowess or Locally Weighted Scatterplot Smoothing curve exhibits this pattern, by having two distinct regions. From 100 dB to 120 dB approximately, there is a tendency of growing predictions. From 122 dB onwards, this tendency changes and the model starts shrinking its predictions. In fact, turbulence is often associated with increased aerodynamic noise, which could explain why the region that measures greater sound level values has a tendency to be underestimated more.One possible improvement to the linear regressor could be dividing the model into two parts, one relating to laminar flow and low volumes, the other relating to turbulant flow and great volumes. But this problem is not as noticeable in the random forest, due to its non linearity capabilities and variable importance measure.

### Data transformation impact on model performance

Although the study of data transformation impact can be often ignored, due to experience with EDA, in cases where the nature of the data proves to be complex it is advisable to do so. Being fluid dynamics famously challenging and heavily nonlinear, and the dataset analysed here consisting of aerodynamic tests, it was very obvious a detailed analysis was needed. \Comparing the different RMSE scores between pipelines with data transformations and pipelines without them, it is undeniable the transformations help improve model performance.

### 7. Conclusion

Selecting the best predictive model is an easy decision, given how the Random Forest algorithm performs over a traditional linear regression model. The decision was grounded in the understanding that airfoil self-noise generation involves intricate, non-linear interactions among various physical phenomena. The Random Forest's ability to capture non-linear patterns, consider variable importance, and handle complex decision boundaries rendered it a powerful tool for this predictive task. Although there are some areas to improve that performance, like splitting into laminar and turbulent flow, or adding the Reynold's number as a variable, Random Forest has already three times lower RMSE than the base linear regression model, which for the purpose of this project, it is enough.In conclusion, this project provides valuable insights into the physics of airfoil self-noise generation and the predictive capabilities of advanced machine learning techniques. The Random Forest model, trained on the NASA Airfoil Self-Noise dataset, offers a robust framework for understanding and predicting the scaled sound pressure level associated with different airfoil configurations. As we navigate the complex realm of aerodynamics, this exploration lays the foundation for further research and applications in aeronautical engineering and noise reduction strategies.

### 7. Stop SparkSession

In [None]:
spark.stop()


## Authors


[Vianney](https://github.com/fermat01)


### Other Contributors


## None

|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2024-05-26|0.1|Vianney|Initial Version Created|


Copyright © 2024. All rights reserved.
