# S09 T02: Aprenentatge Supervisat - Regressions


**Descripció**

Anem a practicar i a familiaritzar-nos amb regressions

**Objectius**

 * Models de regressió
 * Àrbres de regressió
 * Random Forest
 * Xarxes Neuronals
 * Altres models
 
## Exercici 1
Crea almenys tres models de regressió diferents per intentar predir el millor possible l’endarreriment dels vols (ArrDelay) de DelayedFlights.csv.



In [95]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split

In [96]:
# The models normally performs better with data that has been pre-cleaned and prepared before it can used
# We upload the sample Prepared (Cleaned, Normallize, Standarize & Dummy Var) from the Exercise S09 T01
df_sample = pd.read_csv ('DelayFlightsPreparedDataSample_NO_PCA.csv')

In [97]:
df_sample.shape

(20000, 652)

Before we aply any model we do the train-test-split

In [98]:
#The target variable is ArrDelay 
X = df_sample.drop(['ArrDelay'], axis=1).values
y = df_sample.ArrDelay.values

X_train_sample, X_test_sample, y_train_sample, y_test_sample = train_test_split(X, y, test_size=0.25, random_state=1)
print(X_train_sample.shape, X_test_sample.shape, y_train_sample.shape, y_test_sample.shape)

(15000, 651) (5000, 651) (15000,) (5000,)


### Multiple Linear Regression

Biography: Good link about lineal, mulptiple and polynomial Regressions.
https://realpython.com/linear-regression-in-python/

In [99]:
from sklearn.linear_model import LinearRegression

In [100]:
model = LinearRegression().fit(X_train_sample, y_train_sample)


In [101]:
print('intercept:', model.intercept_)

intercept: -8069664863.470742


In [102]:
print('slope:', model.coef_)

slope: [ 3.90491264e+06 -5.47649659e-05 -2.85191969e-05 -1.54316741e-04
  1.36089837e-03 -1.31263745e-03  1.44639691e-03 -1.49735052e-03
  9.55242908e+06 -2.18794320e-03 -9.55242908e+06  2.87745870e-03
 -3.70328961e-02 -5.03075150e+07 -1.36952100e+08 -1.47353197e+07
  3.74040100e-03 -5.36767766e-05 -2.85580754e-05 -5.67906536e-05
 -1.95945984e-04 -4.96525317e-05  2.94985355e+04  2.94985394e+04
  2.94985509e+04  2.94985392e+04  2.94985408e+04  2.94985373e+04
  2.94985377e+04  2.94985420e+04  2.94985367e+04  2.94985390e+04
  2.94985353e+04  2.94985388e+04  2.94985390e+04  2.94985359e+04
  2.94985373e+04  2.94985352e+04  2.94985407e+04  2.94985406e+04
  2.94985391e+04  2.94985395e+04  9.70099557e+03  9.70098610e+03
  9.70098807e+03  9.70100438e+03  9.70096567e+03  9.70099441e+03
  9.70099613e+03 -1.55883928e+05  1.19464639e+05  9.70104128e+03
  9.70097763e+03  9.70100758e+03  1.46099735e+05  9.70098871e+03
  1.39333240e+05  9.70097914e+03  9.70099812e+03  9.70098173e+03
  9.70098875e+03  

In [103]:
# R Square/Adjusted R Square
r_sq = model.score(X_train_sample, y_train_sample)
r_sq

0.9989886485922281

The R2 is very high so we use Adjusted R2 Square to check if it´s due to overfitting

Biography: https://towardsdatascience.com/what-are-the-best-metrics-to-evaluate-your-regression-model-418ca481755b

"R Square is a good measure to determine how well the model fits the dependent variables. However, it does not take into consideration of overfitting problem. If your regression model has many independent variables, because the model is too complicated, it may fit very well to the training data but performs badly for testing data. That is why Adjusted R Square is introduced because it will penalize additional independent variables added to the model and adjust the metric to prevent overfitting issues."

In [104]:
#R_Square and Adjusted R Square
import statsmodels.api as sm
X_addC = sm.add_constant(X_train_sample)
result = sm.OLS(y_train_sample, X_addC).fit()
print(result.rsquared, result.rsquared_adj)

0.9989886487403703 0.9989487694010266


In [105]:
# Linear Regression Predictions
y_pred = model.predict(X_test_sample)

In [106]:
# Mean Square Error(MSE)/Root Mean Square Error(RMSE)
from sklearn.metrics import mean_squared_error
import math
MSE_LR = mean_squared_error(y_test_sample, y_pred)
RMSE_LR = math.sqrt(mean_squared_error(y_test_sample, y_pred))
print('Mean Squared Error =', MSE_LR)
print('Root Mean Squared Error =', RMSE_LR)

Mean Squared Error = 79318360.02245145
Root Mean Squared Error = 8906.085561145897


### Random Forest 

Biography: https://towardsdatascience.com/random-forest-in-python-24d0893d51c0

In [107]:
# Import the model we are using
from sklearn.ensemble import RandomForestRegressor

In [108]:
# Instantiate model with 100 decision trees
rf = RandomForestRegressor(n_estimators = 100, random_state = 42)

# Train the model on training data
rf.fit(X_train_sample, y_train_sample);

In [109]:
# Use the forest's predict method on the test data
y_pred_RF = rf.predict(X_test_sample)

In [110]:
rf_r2 = rf.score(X_test_sample, y_test_sample)
rf_r2

0.9982344828216638

In [111]:
# Mean Square Error(MSE)/Root Mean Square Error(RMSE)
MSE_RF = mean_squared_error(y_test_sample, y_pred_RF)
RMSE_RF = math.sqrt(mean_squared_error(y_test_sample, y_pred_RF))
print('Mean Squared Error =', MSE_RF)
print('Root Mean Squared Error =', RMSE_RF)

Mean Squared Error = 0.002092231740449457
Root Mean Squared Error = 0.045740919759548526


### Neural Networks

Biografia: https://www.pluralsight.com/guides/machine-learning-neural-networks-scikit-learn

Neural Networks require all data to be Normalized (values between 0 and 1).

The previous dataframe had standarize variables with variables our of the 0 to 1 range, therefore we take a copy of the dataframe cleaned from NaN from the Exercise S09T01 and prepare it for the Neural Network model by creating Dummy variables and normalizing all the data.

In [112]:
df_sample2 = pd.read_csv ('DelayFlightsNOPreparedDataSample.csv')
df_sample2.shape

(20000, 27)

In [113]:
df2 = pd.get_dummies(df_sample2)
df2.shape

(20000, 599)

In [114]:
target_column = ['ArrDelay'] 
predictors = list(set(list(df2.columns))-set(target_column))
df2[predictors] = df2[predictors]/df2[predictors].max()
df2.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Year,20000.0,1.000000,0.000000,1.000000,1.000000,1.000000,1.000000,1.0
Month,20000.0,0.509412,0.290598,0.083333,0.250000,0.500000,0.750000,1.0
DayofMonth,20000.0,0.509347,0.283190,0.032258,0.258065,0.516129,0.741935,1.0
DayOfWeek,20000.0,0.568507,0.284771,0.142857,0.285714,0.571429,0.857143,1.0
DepTime,20000.0,0.632455,0.188653,0.000417,0.500000,0.643750,0.792083,1.0
...,...,...,...,...,...,...,...,...
Dest_YAK,20000.0,0.000250,0.015810,0.000000,0.000000,0.000000,0.000000,1.0
Dest_YUM,20000.0,0.000300,0.017318,0.000000,0.000000,0.000000,0.000000,1.0
CancellationCode_A,20000.0,0.000100,0.010000,0.000000,0.000000,0.000000,0.000000,1.0
CancellationCode_B,20000.0,0.000100,0.010000,0.000000,0.000000,0.000000,0.000000,1.0


In [115]:
#The target variable is ArrDelay 
X = df2[predictors].values
y = df2[target_column].values

X_train_NN, X_test_NN, y_train_NN, y_test_NN = train_test_split(X, y, test_size=0.25, random_state=1)
print(X_train_NN.shape, X_test_NN.shape, y_train_NN.shape, y_test_NN.shape)

(15000, 598) (5000, 598) (15000, 1) (5000, 1)


In [116]:
from sklearn.neural_network import MLPRegressor

mlp = MLPRegressor(hidden_layer_sizes=(8,8,8), activation='relu', solver='adam', max_iter=500, random_state=42)

mlp.fit(X_train_NN, y_train_NN)

y_pred_NN = mlp.predict(X_test_NN)

  return f(*args, **kwargs)


In [117]:
# R Square
mlp_r2 = mlp.score(X_test_NN, y_test_NN)
mlp_r2

0.9984614696856783

In [118]:
# Mean Square Error(MSE)/Root Mean Square Error(RMSE)
MSE_NN = mean_squared_error(y_test_NN, y_pred_NN)
RMSE_NN = math.sqrt(mean_squared_error(y_test_NN, y_pred_NN))
print('Mean Squared Error =', MSE_NN)
print('Root Mean Squared Error =', RMSE_NN)

Mean Squared Error = 69.93865071642688
Root Mean Squared Error = 8.36293314073638


## Exercici 2
Compara’ls en base al MSE i al R2 .

In [119]:
mse_r2 = [{'R-Square': r_sq, 'MSE': MSE_LR}, {'R-Square': rf_r2, 'MSE': MSE_RF}, {'R-Square': mlp_r2, 'MSE': MSE_NN}]
df_mse_r2 = pd.DataFrame(mse_r2, index=['Linear Regression (LR)','Random Forest (RF)', 'Neural Network (NN)'])
df_mse_r2

Unnamed: 0,R-Square,MSE
Linear Regression (LR),0.998989,79318360.0
Random Forest (RF),0.998234,0.002092232
Neural Network (NN),0.998461,69.93865


All 3 models have a R-Square above 0.99, which is a great correlation.

From the 3 models the Random Forest is the one with the smallest MSE which means that it has a better acuracy and would be the best model for this dataframe.

## Exercici 3
Entrena’ls utilitzant els diferents paràmetres que admeten.

### Random Forest 

We can increase the max-depth of the tree. We will notice that it will decrease the acuracy and increase the error of the model

In [120]:
# Instantiate model with 100 decision trees
rf = RandomForestRegressor(n_estimators = 100, max_depth=5, random_state = 42)

# Train the model on training data
rf.fit(X_train_sample, y_train_sample);

In [121]:
# Use the forest's predict method on the test data
y_pred_RF = rf.predict(X_test_sample)

In [122]:
rf_r2 = rf.score(X_test_sample, y_test_sample)
rf_r2

0.9533856605284281

In [123]:
# Mean Square Error(MSE)/Root Mean Square Error(RMSE)
MSE_RF = mean_squared_error(y_test_sample, y_pred_RF)
RMSE_RF = math.sqrt(mean_squared_error(y_test_sample, y_pred_RF))
print('Mean Squared Error =', MSE_RF)
print('Root Mean Squared Error =', RMSE_RF)

Mean Squared Error = 0.055240471063792436
Root Mean Squared Error = 0.23503291485192546


In [124]:
score = pd.DataFrame([{'R-Square': rf_r2, 'MSE': MSE_RF}], index=['RF - Max Depth 5'])
df_mse_r2 = df_mse_r2.append(score)
df_mse_r2

Unnamed: 0,R-Square,MSE
Linear Regression (LR),0.998989,79318360.0
Random Forest (RF),0.998234,0.002092232
Neural Network (NN),0.998461,69.93865
RF - Max Depth 5,0.953386,0.05524047


We can also increase the number of estimators which will increase the simulation time but it doesn´t improve the results

In [125]:
# Instantiate model with 100 decision trees
rf = RandomForestRegressor(n_estimators = 250, random_state = 42)

# Train the model on training data
rf.fit(X_train_sample, y_train_sample);

In [126]:
# Use the forest's predict method on the test data
y_pred_RF = rf.predict(X_test_sample)

In [127]:
rf_r2 = rf.score(X_test_sample, y_test_sample)
rf_r2

0.9982648755302602

In [128]:
# Mean Square Error(MSE)/Root Mean Square Error(RMSE)
MSE_RF = mean_squared_error(y_test_sample, y_pred_RF)
RMSE_RF = math.sqrt(mean_squared_error(y_test_sample, y_pred_RF))
print('Mean Squared Error =', MSE_RF)
print('Root Mean Squared Error =', RMSE_RF)

Mean Squared Error = 0.002056214764583173
Root Mean Squared Error = 0.045345504348095776


In [129]:
score = pd.DataFrame([{'R-Square': rf_r2, 'MSE': MSE_RF}], index=['RF - 250 estimators'])
df_mse_r2 = df_mse_r2.append(score)
df_mse_r2

Unnamed: 0,R-Square,MSE
Linear Regression (LR),0.998989,79318360.0
Random Forest (RF),0.998234,0.002092232
Neural Network (NN),0.998461,69.93865
RF - Max Depth 5,0.953386,0.05524047
RF - 250 estimators,0.998265,0.002056215


### Neural Networks

Let's modify the solver to lbfgs. The sklearn biography state about the solver parameter:

**solver{‘lbfgs’, ‘sgd’, ‘adam’}, default=’adam’**
The solver for weight optimization.

 - ‘lbfgs’ is an optimizer in the family of quasi-Newton methods.

 - ‘sgd’ refers to stochastic gradient descent.

 - ‘adam’ refers to a stochastic gradient-based optimizer proposed by Kingma, Diederik, and Jimmy Ba

**Note:** The default solver ‘adam’ works pretty well on relatively large datasets (with thousands of training samples or more) in terms of both training time and validation score. For small datasets, however, ‘lbfgs’ can converge faster and perform better.

In our case, lbfgs will have very similar results as adam.

In [130]:
mlp = MLPRegressor(hidden_layer_sizes=(8,8,8), activation='relu', solver='lbfgs', max_iter=500, random_state=42)

mlp.fit(X_train_NN, y_train_NN)

y_pred_NN = mlp.predict(X_test_NN)

  return f(*args, **kwargs)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


In [131]:
# R Square
mlp_r2 = mlp.score(X_test_NN, y_test_NN)
mlp_r2

0.9989451325986837

In [132]:
# Mean Square Error(MSE)/Root Mean Square Error(RMSE)
MSE_NN = mean_squared_error(y_test_NN, y_pred_NN)
RMSE_NN = math.sqrt(mean_squared_error(y_test_NN, y_pred_NN))
print('Mean Squared Error =', MSE_NN)
print('Root Mean Squared Error =', RMSE_NN)

Mean Squared Error = 47.95225810375541
Root Mean Squared Error = 6.924756898531197


In [133]:
score = pd.DataFrame([{'R-Square': mlp_r2, 'MSE': MSE_NN}], index=['NN - lbfgs solver'])
df_mse_r2 = df_mse_r2.append(score)
df_mse_r2

Unnamed: 0,R-Square,MSE
Linear Regression (LR),0.998989,79318360.0
Random Forest (RF),0.998234,0.002092232
Neural Network (NN),0.998461,69.93865
RF - Max Depth 5,0.953386,0.05524047
RF - 250 estimators,0.998265,0.002056215
NN - lbfgs solver,0.998945,47.95226


Now let´s try different learning_rates. Sklearn biography state about this parameter:

**learning_rate{‘constant’, ‘invscaling’, ‘adaptive’}, default=’constant’**
Learning rate schedule for weight updates.
**Only used when solver=’sgd’.**

 - ‘constant’ is a constant learning rate given by ‘learning_rate_init’.

 - ‘invscaling’ gradually decreases the learning rate learning_rate_ at each time step ‘t’ using an inverse scaling exponent of ‘power_t’. effective_learning_rate = learning_rate_init / pow(t, power_t)

 - ‘adaptive’ keeps the learning rate constant to ‘learning_rate_init’ as long as training loss keeps decreasing. Each time two consecutive epochs fail to decrease training loss by at least tol, or fail to increase validation score by at least tol if ‘early_stopping’ is on, the current learning rate is divided by 5.

The results do not improve either by using sgd solver with the adaptative learning rate

In [134]:
mlp = MLPRegressor(hidden_layer_sizes=(8,8,8), activation='relu', solver='sgd', learning_rate='adaptive', max_iter=500, random_state=42)

mlp.fit(X_train_NN, y_train_NN)

y_pred_NN = mlp.predict(X_test_NN)

  return f(*args, **kwargs)


In [135]:
# R Square
mlp_r2 = mlp.score(X_test_NN, y_test_NN)
mlp_r2

-3.1638189456284493e-05

In [136]:
# Mean Square Error(MSE)/Root Mean Square Error(RMSE)
MSE_NN = mean_squared_error(y_test_NN, y_pred_NN)
RMSE_NN = math.sqrt(mean_squared_error(y_test_NN, y_pred_NN))
print('Mean Squared Error =', MSE_NN)
print('Root Mean Squared Error =', RMSE_NN)

Mean Squared Error = 45459.52900482318
Root Mean Squared Error = 213.21240349666147


In [137]:
score = pd.DataFrame([{'R-Square': mlp_r2, 'MSE': MSE_NN}], index=['NN - sgd solver'])
df_mse_r2 = df_mse_r2.append(score)
df_mse_r2

Unnamed: 0,R-Square,MSE
Linear Regression (LR),0.998989,79318360.0
Random Forest (RF),0.998234,0.002092232
Neural Network (NN),0.998461,69.93865
RF - Max Depth 5,0.953386,0.05524047
RF - 250 estimators,0.998265,0.002056215
NN - lbfgs solver,0.998945,47.95226
NN - sgd solver,-3.2e-05,45459.53


Finally we increase the hidden layers and note that the results are again very similar to the first simulation.

In [138]:
mlp = MLPRegressor(hidden_layer_sizes=(10,10,10), activation='relu', solver='adam', max_iter=500, random_state=42)

mlp.fit(X_train_NN, y_train_NN)

y_pred_NN = mlp.predict(X_test_NN)

  return f(*args, **kwargs)


In [139]:
# R Square
mlp_r2 = mlp.score(X_test_NN, y_test_NN)
mlp_r2

0.9984454762708215

In [140]:
# Mean Square Error(MSE)/Root Mean Square Error(RMSE)
MSE_NN = mean_squared_error(y_test_NN, y_pred_NN)
RMSE_NN = math.sqrt(mean_squared_error(y_test_NN, y_pred_NN))
print('Mean Squared Error =', MSE_NN)
print('Root Mean Squared Error =', RMSE_NN)

Mean Squared Error = 70.6656808210805
Root Mean Squared Error = 8.406288171427416


In [141]:
score = pd.DataFrame([{'R-Square': mlp_r2, 'MSE': MSE_NN}], index=['NN - more hidden layers'])
df_mse_r2 = df_mse_r2.append(score)
df_mse_r2

Unnamed: 0,R-Square,MSE
Linear Regression (LR),0.998989,79318360.0
Random Forest (RF),0.998234,0.002092232
Neural Network (NN),0.998461,69.93865
RF - Max Depth 5,0.953386,0.05524047
RF - 250 estimators,0.998265,0.002056215
NN - lbfgs solver,0.998945,47.95226
NN - sgd solver,-3.2e-05,45459.53
NN - more hidden layers,0.998445,70.66568


## Exercici 4
Compara el seu rendiment utilitzant l’aproximació traint/test o utilitzant totes les dades (validació interna)

The train-test-split was used on the previous exercises, so we will use the complete sample without doing a train-test split

In [142]:
X = df_sample.drop(['ArrDelay'], axis=1).values
y = df_sample.ArrDelay.values

model = LinearRegression().fit(X, y)


In [143]:
print('intercept:', model.intercept_)

intercept: 98906133.88583083


In [144]:
print('slope:', model.coef_)

slope: [-2.58264882e+04 -6.30592535e-05 -4.35995995e-05 -1.45164067e-04
  1.35578586e-03 -1.31030628e-03  1.45119794e-03 -1.49927402e-03
 -4.43312528e+06 -2.18462944e-03  4.43312528e+06  3.37145163e-03
 -3.88632260e-02  2.33468906e+07  6.35572181e+07  8.54770829e+07
  1.84082788e-03 -6.04372326e-05 -3.22833657e-05 -6.03438530e-05
 -2.28824909e-04 -5.37014566e-05 -1.92151804e+04 -1.92151764e+04
 -1.92151675e+04 -1.92151732e+04 -1.92151744e+04 -1.92151775e+04
 -1.92151777e+04 -1.92151749e+04 -1.92151739e+04 -1.92151778e+04
 -1.92151835e+04 -1.92151771e+04 -1.92151780e+04 -1.92151794e+04
 -1.92151780e+04 -1.92151792e+04 -1.92151755e+04 -1.92151742e+04
 -1.92151773e+04 -1.92151768e+04  2.92958845e+04  2.92958744e+04
  2.92958808e+04  2.92958985e+04  2.92958571e+04  2.92958775e+04
  2.92958899e+04  1.26716450e+05 -2.22224186e+04  2.92959320e+04
  2.92958700e+04  2.92959010e+04  2.92959185e+04  2.92958819e+04
  1.47306973e+05  2.92958771e+04  2.92958874e+04  2.92958746e+04
  2.92958805e+04  

In [145]:
# R Square/Adjusted R Square
r_sq = model.score(X, y)
r_sq

0.9990237284124105

The R2 is very high so we use Adjusted R2 Square to check if it´s due to overfitting

Biography: https://towardsdatascience.com/what-are-the-best-metrics-to-evaluate-your-regression-model-418ca481755b

"R Square is a good measure to determine how well the model fits the dependent variables. However, it does not take into consideration of overfitting problem. If your regression model has many independent variables, because the model is too complicated, it may fit very well to the training data but performs badly for testing data. That is why Adjusted R Square is introduced because it will penalize additional independent variables added to the model and adjust the metric to prevent overfitting issues."

In [146]:
#R_Square and Adjusted R Square
X_addC = sm.add_constant(X)
result = sm.OLS(y, X_addC).fit()
print(result.rsquared, result.rsquared_adj)

0.9990237284152584 0.9989942587223383


In [147]:
# Linear Regression Predictions
y_pred = model.predict(X)

In [148]:
# Mean Square Error(MSE)/Root Mean Square Error(RMSE)
from sklearn.metrics import mean_squared_error
import math
MSE_LR = mean_squared_error(y, y_pred)
RMSE_LR = math.sqrt(mean_squared_error(y, y_pred))
print('Mean Squared Error =', MSE_LR)
print('Root Mean Squared Error =', RMSE_LR)

Mean Squared Error = 0.0010182586758625835
Root Mean Squared Error = 0.03191016571349299


In [149]:
score = pd.DataFrame([{'R-Square': r_sq, 'MSE': MSE_LR}], index=['LR - No train-test'])
df_mse_r2 = df_mse_r2.append(score)
df_mse_r2

Unnamed: 0,R-Square,MSE
Linear Regression (LR),0.998989,79318360.0
Random Forest (RF),0.998234,0.002092232
Neural Network (NN),0.998461,69.93865
RF - Max Depth 5,0.953386,0.05524047
RF - 250 estimators,0.998265,0.002056215
NN - lbfgs solver,0.998945,47.95226
NN - sgd solver,-3.2e-05,45459.53
NN - more hidden layers,0.998445,70.66568
LR - No train-test,0.999024,0.001018259


As expected the R-Square and the MSE get better results, but we don´t have any values for testing the model and we may be overfitting it, which can provide un-expected results with new values. Specially with the values not being as accurate as we would expect.

## Exercici 5
Realitza algun procés d’enginyeria de variables per millorar-ne la predicció

On the exercise S09T01 we already clean up the dataframe and standarize and normalize some variables. Here we will apply a PCA to reduce the large number of features before applying again the Linear Regresis. The baseline is reduce the MSE error, which mean we would have a better linear regression model.

In [150]:
df_PCA = df_sample.copy()
df_PCA.shape

(20000, 652)

In [151]:
# Train-test-split  for the PCA
# The target variable is ArrDelay 
X = df_PCA.drop(['ArrDelay'], axis=1).values
y = df_PCA.ArrDelay.values

X_train_PCA, X_test_PCA, y_train_PCA, y_test_PCA = train_test_split(X, y, test_size=0.25, random_state=1)
print(X_train_PCA.shape, X_test_PCA.shape, y_train_PCA.shape, y_test_PCA.shape)

(15000, 651) (5000, 651) (15000,) (5000,)


In [152]:
from sklearn.decomposition import PCA

In [153]:
pca = PCA(.99) #Normally 95% would already be a good value for the variance, but in this case we case push it to 99% of the variance
pca.fit(X_train_PCA)
X_train_PCA = pca.transform(X_train_PCA)
X_test_PCA = pca.transform(X_test_PCA)
print ('We have {} features which have been reduced using PCA to {} maintaining the 99% of the variance'.format(len(X[1,:]), pca.n_components_))

We have 651 features which have been reduced using PCA to 4 maintaining the 99% of the variance


## Multiple Linear Regression


In [154]:
model_pca = LinearRegression().fit(X_train_PCA, y_train_PCA)

In [155]:
print('intercept:', model_pca.intercept_)

intercept: -0.007677603746895104


In [156]:
print('slope:', model_pca.coef_)

slope: [-0.00016775  0.00265949 -0.00200046  0.00085615]


In [157]:
#R_Square and Adjusted R Square
X_addC = sm.add_constant(X_train_PCA)
result = sm.OLS(y_train_PCA, X_addC).fit()
print(result.rsquared, result.rsquared_adj)

0.9986414192296028 0.9986410568205943


In [158]:
# Linear Regression Predictions
y_pred = model_pca.predict(X_test_PCA)

In [159]:
# Mean Square Error(MSE)/Root Mean Square Error(RMSE)
MSE_LR_PCA = mean_squared_error(y_test_PCA, y_pred)
RMSE_LR_PCA = math.sqrt(mean_squared_error(y_test_PCA, y_pred))
print('Mean Squared Error =', MSE_LR_PCA)
print('Root Mean Squared Error =', RMSE_LR_PCA)

Mean Squared Error = 0.0013485301179322418
Root Mean Squared Error = 0.03672233813269849


In [160]:
score = pd.DataFrame([{'R-Square': result.rsquared, 'MSE': MSE_LR_PCA}], index=['LR - PCA Feature'])
df_mse_r2 = df_mse_r2.append(score)
df_mse_r2

Unnamed: 0,R-Square,MSE
Linear Regression (LR),0.998989,79318360.0
Random Forest (RF),0.998234,0.002092232
Neural Network (NN),0.998461,69.93865
RF - Max Depth 5,0.953386,0.05524047
RF - 250 estimators,0.998265,0.002056215
NN - lbfgs solver,0.998945,47.95226
NN - sgd solver,-3.2e-05,45459.53
NN - more hidden layers,0.998445,70.66568
LR - No train-test,0.999024,0.001018259
LR - PCA Feature,0.998641,0.00134853


Conclusion: When we compare the Linear Regression with the results from a Dataset cleaned and prepared but without doing a PCA the R-Squared is very similar but the MSE has improved considerably when the PCA is used.

## Exercici 6
No utilitzis la variable DepDelay a l’hora de fer prediccions

In [161]:
#The target variable is ArrDelay 
X = df_sample.drop(['ArrDelay', 'DepDelay'], axis=1).values
y = df_sample.ArrDelay.values

X_train_sample, X_test_sample, y_train_sample, y_test_sample = train_test_split(X, y, test_size=0.25, random_state=1)
print(X_train_sample.shape, X_test_sample.shape, y_train_sample.shape, y_test_sample.shape)

(15000, 650) (5000, 650) (15000,) (5000,)


## Multiple Linear Regression


In [162]:
model = LinearRegression().fit(X_train_sample, y_train_sample)

print('intercept:', model.intercept_)
print('slope:', model.coef_)

intercept: -15821662.75944578
slope: [ 7.76955099e+03 -5.27815112e-05 -2.80665815e-05 -1.55402994e-04
  1.36068150e-03 -1.31252885e-03  1.44665734e-03 -1.49742543e-03
  2.35292988e+04 -2.18761946e-03 -2.35292965e+04 -1.42912719e-02
 -1.23916171e+05 -3.37336876e+05 -4.91220158e+05  6.22484009e-03
 -9.12661562e-07  2.42586284e-05 -6.64888194e-06 -1.48594064e-04
  2.88557203e-06 -7.32502019e+02 -7.32498521e+02 -7.32486808e+02
 -7.32498484e+02 -7.32497055e+02 -7.32500422e+02 -7.32500094e+02
 -7.32495621e+02 -7.32501204e+02 -7.32498947e+02 -7.32502491e+02
 -7.32499043e+02 -7.32498812e+02 -7.32501724e+02 -7.32500487e+02
 -7.32502594e+02 -7.32497027e+02 -7.32497284e+02 -7.32498610e+02
 -7.32498220e+02  8.16425696e+01  8.16325431e+01  8.16344418e+01
  8.16510755e+01  8.16128055e+01  8.16408882e+01  8.16427242e+01
  5.87078985e+02  3.44684189e+02  8.16868664e+01  8.16247629e+01
  8.16546681e+01 -3.39266167e+02  8.16359743e+01 -8.59438219e+01
  8.16261005e+01  8.16440572e+01  8.16283858e+01  8.1

In [163]:
#R_Square and Adjusted R Square
X_addC = sm.add_constant(X_train_sample)
result = sm.OLS(y_train_sample, X_addC).fit()
print(result.rsquared, result.rsquared_adj)


# Linear Regression Predictions
y_pred = model.predict(X_test_sample)



0.9989884567630553 0.998948642712845


In [164]:
# Mean Square Error(MSE)/Root Mean Square Error(RMSE)

MSE_LR6 = mean_squared_error(y_test_sample, y_pred)
RMSE_LR6 = math.sqrt(mean_squared_error(y_test_sample, y_pred))
print('Mean Squared Error =', MSE_LR6)
print('Root Mean Squared Error =', RMSE_LR6)

Mean Squared Error = 209.24602351083973
Root Mean Squared Error = 14.465338693263968


In [165]:
score = pd.DataFrame([{'R-Square': result.rsquared, 'MSE': MSE_LR6}], index=['LR - without DepDelay'])
df_mse_r2 = df_mse_r2.append(score)
df_mse_r2

Unnamed: 0,R-Square,MSE
Linear Regression (LR),0.998989,79318360.0
Random Forest (RF),0.998234,0.002092232
Neural Network (NN),0.998461,69.93865
RF - Max Depth 5,0.953386,0.05524047
RF - 250 estimators,0.998265,0.002056215
NN - lbfgs solver,0.998945,47.95226
NN - sgd solver,-3.2e-05,45459.53
NN - more hidden layers,0.998445,70.66568
LR - No train-test,0.999024,0.001018259
LR - PCA Feature,0.998641,0.00134853


Using Multiple Linear Regression the MSE is reduced (improve the model) if DepDelya is not consider.

## Random Forest 


In [166]:
# Instantiate model with 100 decision trees
rf = RandomForestRegressor(n_estimators = 100, random_state = 42)

# Train the model on training data
rf.fit(X_train_sample, y_train_sample);

In [167]:
# Use the forest's predict method on the test data
y_pred_RF = rf.predict(X_test_sample)

In [168]:
r_sq = rf.score(X_test_sample, y_test_sample)
r_sq

0.9936551307545963

In [169]:
# Mean Square Error(MSE)/Root Mean Square Error(RMSE)
MSE_RF = mean_squared_error(y_test_sample, y_pred_RF)
RMSE_RF = math.sqrt(mean_squared_error(y_test_sample, y_pred_RF))
print('Mean Squared Error =', MSE_RF)
print('Root Mean Squared Error =', RMSE_RF)

Mean Squared Error = 0.007519007454090806
Root Mean Squared Error = 0.08671221052476293


Using Random Forest the model still get good results but just slightly worst than when the DepDelay was in the dataframe, which is shown by a higher RMSE.

In [170]:
score = pd.DataFrame([{'R-Square': r_sq, 'MSE': MSE_RF}], index=['RF - without DepDelay'])
df_mse_r2 = df_mse_r2.append(score)
df_mse_r2

Unnamed: 0,R-Square,MSE
Linear Regression (LR),0.998989,79318360.0
Random Forest (RF),0.998234,0.002092232
Neural Network (NN),0.998461,69.93865
RF - Max Depth 5,0.953386,0.05524047
RF - 250 estimators,0.998265,0.002056215
NN - lbfgs solver,0.998945,47.95226
NN - sgd solver,-3.2e-05,45459.53
NN - more hidden layers,0.998445,70.66568
LR - No train-test,0.999024,0.001018259
LR - PCA Feature,0.998641,0.00134853


In [171]:
df_mse_r2.sort_values(by="MSE")

Unnamed: 0,R-Square,MSE
LR - No train-test,0.999024,0.001018259
LR - PCA Feature,0.998641,0.00134853
RF - 250 estimators,0.998265,0.002056215
Random Forest (RF),0.998234,0.002092232
RF - without DepDelay,0.993655,0.007519007
RF - Max Depth 5,0.953386,0.05524047
NN - lbfgs solver,0.998945,47.95226
Neural Network (NN),0.998461,69.93865
NN - more hidden layers,0.998445,70.66568
LR - without DepDelay,0.998988,209.246


**Conclusion**

The Random Forest model provide quite consistently the best results with small differences in the MSE depending of how the data has been prepared for the model or the model parameters used.

The Linear Regression provide a good result after the data has been reduced using PCA.

Linear Regression using all data provide the best results, but it is not recommended as we can have overfitting and we don´t have data to test the results of the model. If there is enough data train-test-split should always be consider.