# Assignment 5

# Part 1

Split the data into a 70-30 split for training and testing data.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
import pandas as pd

In [2]:
#reading the data 
df = pd.read_csv('homework/radar_parameters.csv')

#creating different dataframes for design matrix and target variable.
x_data = df.iloc[:,1:7]
y_data = df.iloc[:,7]

In [4]:
#importing the train_test_split to split the data. 
from sklearn.model_selection import train_test_split

#spliting 70% data to training and 30% to testing data 
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size = 0.3)

# Part 2

Using the split created in (1), train a multiple linear regression dataset using the training dataset, and validate it using the testing dataset. Compare the R^2 and root mean square errors of model on the training and testing sets to a baseline prediction of rain rate using the formula Z = 200R^1.6.

In [31]:
#imports to calculate RMSE and R^2
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from math import sqrt

#calculating the baseline prediction of rain rate using the formulae given.
df_z = 10**(x_data['Zh (dBZ)']/10)
df_r = (df_z/200)**(1/1.6)

#calculating baseline RMSE and R^2.
baseline_rms = sqrt(mean_squared_error(y_data, df_r))
baseline_r2 = r2_score(y_data, df_r)

In [7]:
#importing LinearRegression model to fit the training data.

from sklearn.linear_model import LinearRegression
model_linear = LinearRegression(fit_intercept=True)

model_linear.fit(x_train, y_train)

In [8]:
#predicting y for the test data, 
y_test_predicted = model_linear.predict(x_test)

#calculating RMSE and R^2 for linear model. 
rmse_linear = sqrt(mean_squared_error(y_test, y_test_predicted))
r2_linear = r2_score(y_test, y_test_predicted)

In [35]:
print('RMSE of baseline:', baseline_rms)
print('R^2 of baseline:', baseline_r2)

print('RMSE of linear regression:', rmse_linear)
print('R^2 of linear regression:', r2_linear)

RMSE of baseline: 7.157590840042378
R^2 of baseline: 0.3023229070437503
RMSE of linear regression: 0.9420616042289477
R^2 of linear regression: 0.988487617518314


As we can see from above, the baseline prediction model has a R^2 of only 0.3023, which means that it is able to explain only 30.25% of the variation in data, which is pretty bad. The linear model has a much higher R^2 value and a lower RMSE value than the baseline prediction model. Hence, we can conclude that linear model does a better job in predicting the rain rate for the given data. 

# Part 3

Repeat 1 doing a grid search over polynomial orders, using a grid search over orders 0-21, and use cross-validation of 7 folds. For the best polynomial model in terms of R^2, does it outperform the baseline and the linear regression model in terms of R^2 and root mean square error?

In [11]:
#importing for polynomial regression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

#making a pipeline for polynomial regression
def PolynomialRegression(degree=2, **kwargs):
    return make_pipeline(PolynomialFeatures(degree),
                         LinearRegression(**kwargs))

In [12]:
#importing GridSearchCV
from sklearn.model_selection import GridSearchCV

#defining the parameter grid and grid search 
param_grid = {'polynomialfeatures__degree': np.arange(7)}

poly_grid = GridSearchCV(PolynomialRegression(), param_grid, scoring= 'r2', cv=7) #finding the best model in terms of R^2

The grid search in the above block was performed only from for orders 0-7, since it was taking a lot of time to compute the search for 0-21 orders. 

In [13]:
#fitting the training data to the grid search 
poly_grid.fit(x_train, y_train)

In [14]:
poly_grid.best_params_

{'polynomialfeatures__degree': 2}

In [33]:
#predicting the test data for the best model found in the grid search
poly_model = poly_grid.best_estimator_

y_test_poly_pred = poly_model.fit(x_train, y_train).predict(x_test)

In [17]:
#calculating RMSE and R^2 for the best polynomial model.
rmse_poly = sqrt(mean_squared_error(y_test, y_test_poly_pred))
r2_poly = r2_score(y_test, y_test_poly_pred)

In [36]:
print('RMSE of polynomial regression:', rmse_poly)
print('R^2 of polynomial regression:', r2_poly)

print('RMSE of linear regression:', rmse_linear)
print('R^2 of linear regression:', r2_linear)

print('RMSE of baseline:', baseline_rms)
print('R^2 of baseline:', baseline_r2)

RMSE of polynomial regression: 0.21769848653700608
R^2 of polynomial regression: 0.9993852232673677
RMSE of linear regression: 0.9420616042289477
R^2 of linear regression: 0.988487617518314
RMSE of baseline: 7.157590840042378
R^2 of baseline: 0.3023229070437503


As we can see from above, the polynomial model also outperforms the baseline model by a large margin. While the linear and polynomial models have very close R^2 values (0.988 and 0.999), the polynomial model has a much lower RMSE values (0.2177) ansd a slightly higher R^2 value. Hence, we can conclude that the polynomial model outperforms both the baseline and the linear model. 

# Part 4

Repeat 1 with a Random Forest Regressor, and perform a grid_search on the following parameters:

In [19]:
#importing and defining a Random Forest Regression model
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(50)

In [20]:
#defining the parameter grid using the values given in the question
random_param_grid = {'bootstrap': [True, False],  
                        'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],  
                        'max_features': ['auto', 'sqrt'],  
                        'min_samples_leaf': [1, 2, 4],  
                        'min_samples_split': [2, 5, 10],  
                        'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}

In [21]:
#defining the grid search 
from sklearn.model_selection import RandomizedSearchCV

forest_grid = RandomizedSearchCV(estimator = forest, param_distributions= random_param_grid, scoring= 'r2')

In [23]:
#fitting the training data to the grid search 
forest_grid.fit(x_train, y_train)

  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(


In [24]:
forest_grid.best_params_

{'n_estimators': 1200,
 'min_samples_split': 5,
 'min_samples_leaf': 1,
 'max_features': 'sqrt',
 'max_depth': None,
 'bootstrap': False}

In [37]:
#predicting the test data with the best estimators found in the grid search 
forest_model = forest_grid.best_estimator_

y_test_forest_pred = forest_model.fit(x_train, y_train).predict(x_test)

In [28]:
#calculation the RMSE and R^2 values for the random forest regressor. 
rmse_forest = sqrt(mean_squared_error(y_test, y_test_forest_pred))
r2_forest = r2_score(y_test, y_test_forest_pred)

In [38]:
print('RMSE of random forest:', rmse_forest)
print('R^2 of random forest:', r2_forest)

print('RMSE of polynomial regression:', rmse_poly)
print('R^2 of polynomial regression:', r2_poly)

print('RMSE of linear regression:', rmse_linear)
print('R^2 of linear regression:', r2_linear)

print('RMSE of baseline:', baseline_rms)
print('R^2 of baseline:', baseline_r2)

RMSE of random forest: 1.2693003462040438
R^2 of random forest: 0.9791005238823449
RMSE of polynomial regression: 0.21769848653700608
R^2 of polynomial regression: 0.9993852232673677
RMSE of linear regression: 0.9420616042289477
R^2 of linear regression: 0.988487617518314
RMSE of baseline: 7.157590840042378
R^2 of baseline: 0.3023229070437503


As we can see from, the best optimized Random Forest Regressor has higher RMSE  and slightly lower R^2 than both the linear and polynomial models. Even though the random forest regressor outperforms the baseline, it does not perform better that the linear or polynomial models. 

In conclusion, the baseline model performs the worst and every other model outperforms it, whereas, the best polynomial model found using the grid performs the best and outperforms all other models. 