This problem has the following inputs:
1. Frequency, in Hertzs.
2. Angle of attack, in degrees.
3. Chord length, in meters.
4. Free-stream velocity, in meters per second.
5. Suction side displacement thickness, in meters.

The only output is:
6. Scaled sound pressure level, in decibels.

In [5]:
#import the dataset
import pandas as pd
data = pd.read_table('data/airfoil_self_noise.dat',header=None, 
                     names= ['freq','angle','chord','stream-velocity',
                             'displacement-thickness','soundpressure'] )

In [6]:
data.head()

Unnamed: 0,freq,angle,chord,stream-velocity,displacement-thickness,soundpressure
0,800,0.0,0.3048,71.3,0.002663,126.201
1,1000,0.0,0.3048,71.3,0.002663,125.201
2,1250,0.0,0.3048,71.3,0.002663,125.951
3,1600,0.0,0.3048,71.3,0.002663,127.591
4,2000,0.0,0.3048,71.3,0.002663,127.461


In [7]:
#check the missing values
data.isna().mean()

freq                      0.0
angle                     0.0
chord                     0.0
stream-velocity           0.0
displacement-thickness    0.0
soundpressure             0.0
dtype: float64

In [8]:
#check descriptive stats 
data.describe()

Unnamed: 0,freq,angle,chord,stream-velocity,displacement-thickness,soundpressure
count,1503.0,1503.0,1503.0,1503.0,1503.0,1503.0
mean,2886.380572,6.782302,0.136548,50.860745,0.01114,124.835943
std,3152.573137,5.918128,0.093541,15.572784,0.01315,6.898657
min,200.0,0.0,0.0254,31.7,0.000401,103.38
25%,800.0,2.0,0.0508,39.6,0.002535,120.191
50%,1600.0,5.4,0.1016,39.6,0.004957,125.721
75%,4000.0,9.9,0.2286,71.3,0.015576,129.9955
max,20000.0,22.2,0.3048,71.3,0.058411,140.987


In [39]:
#correlation study
data.corr()

Unnamed: 0,freq,angle,chord,stream-velocity,displacement-thickness,soundpressure
freq,1.0,-0.272765,-0.003661,0.133664,-0.230107,-0.390711
angle,-0.272765,1.0,-0.504868,0.05876,0.753394,-0.156108
chord,-0.003661,-0.504868,1.0,0.003787,-0.220842,-0.236162
stream-velocity,0.133664,0.05876,0.003787,1.0,-0.003974,0.125103
displacement-thickness,-0.230107,0.753394,-0.220842,-0.003974,1.0,-0.31267
soundpressure,-0.390711,-0.156108,-0.236162,0.125103,-0.31267,1.0


In [25]:
#Shuffle the rows of the dataframe
data = data.sample(frac = 1, random_state=0)

We do not need encoding, all variables are numeric
Scaling for linear regression is not required as gradient descent is not used here 

In [40]:
#extract dependent and independent variables
X = data.drop('soundpressure',axis=1)
y = data.soundpressure

In [41]:
X.head()

Unnamed: 0,freq,angle,chord,stream-velocity,displacement-thickness
968,10000,0.0,0.0254,71.3,0.000401
9,6300,0.0,0.3048,71.3,0.002663
1468,2500,12.3,0.1016,31.7,0.041876
1150,400,17.4,0.0254,71.3,0.016104
880,2500,15.4,0.0508,71.3,0.026427


In [42]:
y.head()

968     130.787
9       119.541
1468    110.317
1150    117.396
880     127.625
Name: soundpressure, dtype: float64

In [29]:
#importing OLS statsmodel to check the p-values of the X variable
import statsmodels.api as sm
X2 = sm.add_constant(X) 
ols = sm.OLS(y,X2)
lr = ols.fit()
print(lr.summary())

                            OLS Regression Results                            
Dep. Variable:          soundpressure   R-squared:                       0.516
Model:                            OLS   Adj. R-squared:                  0.514
Method:                 Least Squares   F-statistic:                     318.8
Date:                Mon, 19 Apr 2021   Prob (F-statistic):          1.15e-232
Time:                        14:08:51   Log-Likelihood:                -4490.1
No. Observations:                1503   AIC:                             8992.
Df Residuals:                    1497   BIC:                             9024.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
const                    132

All the p-values are below 0.05 (significance level), we will not drop any variable

In [35]:
#k-fold cross validation using linear regression model

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
cross_val_score(LinearRegression(),X,y,cv=5).mean()

0.5092220740490526

In [36]:
model = LinearRegression()
model.fit(X,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [37]:
model.intercept_

132.83380577837818

In [38]:
model.coef_

array([-1.28220711e-03, -4.21911706e-01, -3.56880012e+01,  9.98540449e-02,
       -1.47300519e+02])

Inferences:

SoundPressure = 132.833 -0.0013 *freq -.0422 * angle of attack -35.69 * chord_length + 0.099 * free-stream velcotiy  -147.3 * Suction side displacement thickness

The model is weak because the k-fold R^2 value is very low. May be a different model (non-linear) will be better to predict the output variable as the correlation study suggests.
