### 3 models used to see which does better in predicting rain rate.

#### - Linear Regression Model: This model is used in Part 1 to perform multiple linear regression on the dataset.

#### - Polynomial Regression Model: Polynomial regression is used in Part 2, where a grid search is conducted to find the best polynomial regression model in terms of R^2

#### - Random Forest Regressor Model: Part 3 utilizes the Random Forest Regressor, and a grid search is performed to find the best combination of hyperparameters for this model, optimizing it based on the R^2 score.

# -------------------------------------------------------------

### Part 1: Split the data into a 70-30 split for training and testing data.

In [19]:
# Import modules

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import GridSearchCV


In [9]:
# Load the dataset
data = pd.read_csv('homework/radar_parameters.csv') 

data.columns

Index(['Unnamed: 0', 'Zh (dBZ)', 'Zdr (dB)', 'Ldr (dB)', 'Kdp (deg km-1)',
       'Ah (dBZ/km)', 'Adr (dB/km)', 'R (mm/hr)'],
      dtype='object')

In [10]:
# Split the data into features and target
X = data[['Zh (dBZ)', 'Zdr (dB)', 'Ldr (dB)', 'Kdp (deg km-1)', 'Ah (dBZ/km)', 'Adr (dB/km)']] # Features
y = data['R (mm/hr)'] # Target

In [11]:
# Split the data into training and testing sets (70-30 split)
# Use train_test_split to split the data
# Set test_size parameter to 0.3 that refers to the portion of the data to be in the test split.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# ----------------------------------------------------

### Part 2: Using the split created in (1), train a multiple linear regression dataset using the training dataset, and validate it using the testing dataset. Compare the R^2 and root mean square errors of model on the training and testing sets to a baseline prediction of rain rate using the formula Z = 200 * R^1.6


In [14]:
# Baseline prediction using the provided equation Z = 200 * R^1.6
baseline_pred_train = 200 * np.power(y_train, 1.6)
baseline_pred_test = 200 * np.power(y_test, 1.6)

# Baseline R^2
baseline_r2_train = r2_score(y_train, baseline_pred_train)
baseline_r2_test = r2_score(y_test, baseline_pred_test)

# Baseline RMSE for the baseline prediction
baseline_rmse_train = np.sqrt(mean_squared_error(y_train, baseline_pred_train))
baseline_rmse_test = np.sqrt(mean_squared_error(y_test, baseline_pred_test))


print("Baseline R^2 (Train):", baseline_r2_train)
print("Baseline R^2 (Test):", baseline_r2_test)
print("Baseline RMSE (Train):", baseline_rmse_train)
print("Baseline RMSE (Test):", baseline_rmse_test)



In [15]:
# Baseline RMSE for the baseline prediction
baseline_rmse_train = np.sqrt(mean_squared_error(y_train, baseline_pred_train))
baseline_rmse_test = np.sqrt(mean_squared_error(y_test, baseline_pred_test))


print("Baseline R^2 (Train):", baseline_r2_train)
print("Baseline R^2 (Test):", baseline_r2_test)
print("Baseline RMSE (Train):", baseline_rmse_train)
print("Baseline RMSE (Test):", baseline_rmse_test)

Baseline RMSE (Train): 22008.970472421162
Baseline RMSE (Test): 24078.46941419433


In [16]:
# Linear Regression
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
lin_reg_pred_train = lin_reg.predict(X_train)
lin_reg_pred_test = lin_reg.predict(X_test)
lin_reg_rmse_train = np.sqrt(mean_squared_error(y_train, lin_reg_pred_train))
lin_reg_rmse_test = np.sqrt(mean_squared_error(y_test, lin_reg_pred_test))
lin_reg_r2_train = r2_score(y_train, lin_reg_pred_train)
lin_reg_r2_test = r2_score(y_test, lin_reg_pred_test)

print("Baseline R^2 (Train):", baseline_r2_train)
print("Baseline R^2 (Test):", baseline_r2_test)
print("Linear Regression R^2 (Train):", lin_reg_r2_train)
print("Linear Regression R^2 (Test):", lin_reg_r2_test)
print("Linear Regression RMSE (Train):", lin_reg_rmse_train)
print("Linear Regression RMSE (Test):", lin_reg_rmse_test)

Baseline R^2 (Train): -6875917.308315697
Baseline R^2 (Test): -7216632.894739466
Linear Regression R^2 (Train): 0.9879085512445995
Linear Regression R^2 (Test): 0.9890992951689396
Linear Regression RMSE (Train): 0.922940159028789
Linear Regression RMSE (Test): 0.9358124742086971


##### The negative values for the baseline R^2 scores indicate that the baseline model performs significantly worse than a horizontal line (a model that always predicts the mean of the target variable). 

##### This suggests that the baseline model is a poor fit for the data. However, the R^2 scores for the linear regression model on both the training and testing sets are close to 1, indicating that the model explains a high percentage of the variance in the target variable. 

##### Additionally, the RMSE values for the linear regression model are relatively low, indicating that the model's predictions are close to the actual values. 

##### Overall, these results suggest that the linear regression model for rain prediction performs well on both the training and testing datasets and outperforms the baseline model significantly. This indicates that the linear regression model captures the underlying patterns in the data effectively.

# -----------------------------------------------------

### Part 3: Repeat 1 doing a grid search over polynomial orders, using a grid search over orders 0-21, and use cross-validation of 7 folds. For the best polynomial model in terms of R^2, does it outperform the baseline and the linear regression model in terms of R^2 and root mean square error?

In [20]:
# Polynomial Regression with Grid Search
poly_reg = make_pipeline(PolynomialFeatures(), LinearRegression())

param_grid_poly = {'polynomialfeatures__degree': np.arange(0, 22)}

grid_search_poly = GridSearchCV(poly_reg, param_grid_poly, cv=7, scoring='r2')
grid_search_poly.fit(X_train, y_train)

best_poly_model = grid_search_poly.best_estimator_
best_poly_pred_train = best_poly_model.predict(X_train)
best_poly_pred_test = best_poly_model.predict(X_test)
best_poly_rmse_train = np.sqrt(mean_squared_error(y_train, best_poly_pred_train))
best_poly_rmse_test = np.sqrt(mean_squared_error(y_test, best_poly_pred_test))
best_poly_r2_train = r2_score(y_train, best_poly_pred_train)
best_poly_r2_test = r2_score(y_test, best_poly_pred_test)

print("Best Polynomial Regression R^2 (Train):", best_poly_r2_train)
print("Best Polynomial Regression R^2 (Test):", best_poly_r2_test)
print("Best Polynomial Regression RMSE (Train):", best_poly_rmse_train)
print("Best Polynomial Regression RMSE (Test):", best_poly_rmse_test)

# Comparison with Baseline and Linear Regression
print("\nComparison with Baseline and Linear Regression:")
print("Baseline R^2 (Test):", baseline_r2_test)
print("Linear Regression R^2 (Test):", lin_reg_r2_test)
print("Random Forest R^2 (Test):", rf_r2_test)
print("Best Polynomial Regression R^2 (Test):", best_poly_r2_test)
print("Best Polynomial Regression RMSE (Test):", best_poly_rmse_test)


### Part 4: Repeat 1 with a Random Forest Regressor, and perform a grid search on the following parameters:

In [None]:
# Random Forest Regressor with Grid Search
rf = RandomForestRegressor()

param_grid_rf = {
    'bootstrap': [True, False],
    'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
    'max_features': ['auto', 'sqrt'],
    'min_samples_leaf': [1, 2, 4],
    'min_samples_split': [2, 5, 10],
    'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]
}

grid_search_rf = GridSearchCV(rf, param_grid_rf, cv=5, scoring='r2')
grid_search_rf.fit(X_train, y_train)

best_rf_model = grid_search_rf.best_estimator_
best_rf_pred_train = best_rf_model.predict(X_train)
best_rf_pred_test = best_rf_model.predict(X_test)
best_rf_rmse_train = np.sqrt(mean_squared_error(y_train, best_rf_pred_train))
best_rf_rmse_test = np.sqrt(mean_squared_error(y_test, best_rf_pred_test))
best_rf_r2_train = r2_score(y_train, best_rf_pred_train)
best_rf_r2_test = r2_score(y_test, best_rf_pred_test)

print("Best Random Forest Regression R^2 (Train):", best_rf_r2_train)
print("Best Random Forest Regression R^2 (Test):", best_rf_r2_test)
print("Best Random Forest Regression RMSE (Train):", best_rf_rmse_train)
print("Best Random Forest Regression RMSE (Test):", best_rf_rmse_test)

In [None]:
# Compare the results of all three models to the baseline

# Baseline metrics
print("Baseline R^2 (Test):", baseline_r2_test)
print("Baseline RMSE (Test):", baseline_rmse_test)
print()

# Linear Regression metrics
print("Linear Regression R^2 (Test):", lin_reg_r2_test)
print("Linear Regression RMSE (Test):", lin_reg_rmse_test)
print()

# Polynomial Regression metrics
print("Best Polynomial Regression R^2 (Test):", best_poly_r2_test)
print("Best Polynomial Regression RMSE (Test):", best_poly_rmse_test)
print()

# Random Forest Regression metrics
print("Best Random Forest Regression R^2 (Test):", best_rf_r2_test)
print("Best Random Forest Regression RMSE (Test):", best_rf_rmse_test)
print()

# Comparing to Baseline
print("Linear Regression Improvement in R^2 over Baseline:", lin_reg_r2_test - baseline_r2_test)
print("Linear Regression Improvement in RMSE over Baseline:", baseline_rmse_test - lin_reg_rmse_test)
print()

print("Polynomial Regression Improvement in R^2 over Baseline:", best_poly_r2_test - baseline_r2_test)
print("Polynomial Regression Improvement in RMSE over Baseline:", baseline_rmse_test - best_poly_rmse_test)
print()

print("Random Forest Regression Improvement in R^2 over Baseline:", best_rf_r2_test - baseline_r2_test)
print("Random Forest Regression Improvement in RMSE over Baseline:", baseline_rmse_test - best_rf_rmse_test)