## Module 5 Homework

In [None]:
#Importing necessary packages

import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestRegressor

# Loading the data
df_raw = pd.read_csv('/content/radar_parameters.csv')
df = df_raw.drop(columns=['Unnamed: 0'])
from sklearn.model_selection import train_test_split

X = df.drop('R (mm/hr)', axis=1)
y = df['R (mm/hr)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
#Training a multiple linear regression model:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

lr = LinearRegression()
lr.fit(X_train, y_train)

y_train_pred = lr.predict(X_train)
y_test_pred = lr.predict(X_test)

print("Training R^2:", r2_score(y_train, y_train_pred))
print("Testing R^2:", r2_score(y_test, y_test_pred))
print("Training RMSE:", np.sqrt(mean_squared_error(y_train, y_train_pred)))
print("Testing RMSE:", np.sqrt(mean_squared_error(y_test, y_test_pred)))

Training R^2: 0.9879085512445995
Testing R^2: 0.9890992951689396
Training RMSE: 0.9229401590287888
Testing RMSE: 0.9358124742086974


In [None]:
#Baseline prediction
#Ensuring X_train['Zh (dBZ)'] and X_test['Zh (dBZ)'] are in the correct scale before applying the baseline_prediction

def baseline_prediction(Zh_dBZ):
    # Convert Zh from dBZ to linear Z
    Z_linear = 10**(Zh_dBZ / 10)
    # Now apply the Z-R relationship
    R = (Z_linear / 200)**(1/1.6)
    return R


baseline_train_pred = baseline_prediction(X_train['Zh (dBZ)'])
baseline_test_pred = baseline_prediction(X_test['Zh (dBZ)'])

print("Baseline Training R^2:", r2_score(y_train, baseline_train_pred))
print("Baseline Testing R^2:", r2_score(y_test, baseline_test_pred))
print("Baseline Training RMSE:", np.sqrt(mean_squared_error(y_train, baseline_train_pred)))
print("Baseline Testing RMSE:", np.sqrt(mean_squared_error(y_test, baseline_test_pred)))

Baseline Training R^2: 0.27555056242697507
Baseline Testing R^2: 0.35664291868109677
Baseline Training RMSE: 7.143950117300888
Baseline Testing RMSE: 7.189316160047872


## Using HPC (CUDA via GPUs on Google Colab for grid search)

Environment Sanity Check

Click the _Runtime_ dropdown at the top of the page, then _Change Runtime Type_ and confirm the instance type is _GPU_.

You can check the output of `!nvidia-smi` to check which GPU you have.  Please uncomment the cell below if you'd like to do that.  Currently, RAPIDS runs on all available Colab GPU instances.

In [None]:
!nvidia-smi

Sun Mar 31 04:06:11 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   64C    P8              11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/pip-install.py


Cloning into 'rapidsai-csp-utils'...
remote: Enumerating objects: 460, done.[K
remote: Counting objects: 100% (191/191), done.[K
remote: Compressing objects: 100% (100/100), done.[K
remote: Total 460 (delta 131), reused 124 (delta 91), pack-reused 269[K
Receiving objects: 100% (460/460), 126.19 KiB | 4.67 MiB/s, done.
Resolving deltas: 100% (233/233), done.
Collecting pynvml
  Downloading pynvml-11.5.0-py3-none-any.whl (53 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.1/53.1 kB 1.9 MB/s eta 0:00:00
Installing collected packages: pynvml
Successfully installed pynvml-11.5.0
***********************************************************************
Woo! Your instance has a Tesla T4 GPU!
We will install the latest stable RAPIDS via pip 24.2.*!  Please stand by, should be quick...
***********************************************************************

Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com
Collecting cudf-cu12==24.2.*
  Downloading https://pypi.nvidia.

In [None]:
import cudf
cudf.__version__

'24.02.02'

In [None]:
import cuml
cuml.__version__

'24.02.00'

In [None]:
import cugraph
cugraph.__version__

'24.02.00'

In [None]:
import cuspatial
cuspatial.__version__

'24.02.00'

In [None]:
import cuxfilter
cuxfilter.__version__

'24.02.00'

In [None]:
#GPU enabled Grid search over polynomial orders

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression

# Storing the best degree and score
best_degree = None
best_score = -np.inf

# Defining the total range of degrees
total_degrees = range(22)

# Splitting the total degrees into smaller batches if needed
degree_batches = [total_degrees[i:i+5] for i in range(0, len(total_degrees), 5)]

for batch in degree_batches:
    try:
        # Defining the range of degrees for the grid search
        poly_params = {'polynomialfeatures__degree': batch}

        # Creating a pipeline with PolynomialFeatures and LinearRegression
        poly_model = make_pipeline(PolynomialFeatures(), LinearRegression())

        # Setting up the grid search with 7-fold cross-validation
        poly_grid = GridSearchCV(poly_model, poly_params, cv=7, scoring='r2', n_jobs=1)

        # Fitting the grid search to the training data
        poly_grid.fit(X_train, y_train)

        # Checking if the best score in this batch is better than the overall best score
        if poly_grid.best_score_ > best_score:
            best_score = poly_grid.best_score_
            best_degree = poly_grid.best_params_['polynomialfeatures__degree']

    except MemoryError:
        print(f"Memory error occurred with batch: {batch}")
        continue
    except Exception as e:
        print(f"An error occurred with batch: {batch}: {e}")
        continue

print("Best polynomial degree overall:", best_degree)
print("Best polynomial R^2 overall:", best_score)

Best polynomial degree overall: 2
Best polynomial R^2 overall: 0.9969985736508612


In [None]:
#GPU enabled Random Forest Regression

import cudf
import numpy as np
from cuml.ensemble import RandomForestRegressor
from cuml.metrics import r2_score

# Assuming X_train and y_train are already defined and are Pandas DataFrames
# Convert the data to cuDF DataFrames for RAPIDS
X_train_cudf = cudf.DataFrame.from_pandas(X_train.astype('float32'))
y_train_cudf = cudf.Series(y_train.astype('float32'))

# Define your parameter grid
rf_params = {
    'bootstrap': [True, False],
    'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
    'max_features': [1.0, X_train_cudf.shape[1] ** 0.5 / X_train_cudf.shape[1]],  # Use a float for percentage of features
    'min_samples_leaf': [1, 2, 4],
    'min_samples_split': [2, 5, 10],
    'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]
}

# Initialize best score and parameters
best_score = -np.inf
best_params = {}

# Loop over the grid
for bootstrap in rf_params['bootstrap']:
    for max_depth in rf_params['max_depth']:
        for max_features in rf_params['max_features']:
            for min_samples_leaf in rf_params['min_samples_leaf']:
                for min_samples_split in rf_params['min_samples_split']:
                    for n_estimators in rf_params['n_estimators']:
                        print(f"Training with: n_estimators={n_estimators}, max_depth={max_depth}, max_features={max_features}, min_samples_leaf={min_samples_leaf}, min_samples_split={min_samples_split}, bootstrap={bootstrap}")

                        # Create and train the model
                        rf = RandomForestRegressor(n_estimators=n_estimators,
                                                   max_depth=max_depth,
                                                   max_features=max_features,
                                                   min_samples_leaf=min_samples_leaf,
                                                   min_samples_split=min_samples_split,
                                                   bootstrap=bootstrap,
                                                   n_streams=1,  # For reproducibility
                                                   random_state=42)
                        rf.fit(X_train_cudf, y_train_cudf)

                        # Make predictions and evaluate
                        y_pred = rf.predict(X_train_cudf)
                        score = r2_score(y_train_cudf, y_pred)
                        print(f"R^2 Score: {score}")

                        # Compare with the best score
                        if score > best_score:
                            best_score = score
                            best_params = {
                                'n_estimators': n_estimators,
                                'max_depth': max_depth,
                                'max_features': max_features,
                                'min_samples_leaf': min_samples_leaf,
                                'min_samples_split': min_samples_split,
                                'bootstrap': bootstrap
                            }

# Output the best parameters and the best score
print("Best Random Forest parameters:", best_params)
print("Best Random Forest R^2 score:", best_score)

Training with: n_estimators=200, max_depth=10, max_features=1.0, min_samples_leaf=1, min_samples_split=2, bootstrap=True
R^2 Score: 0.9527973532676697
Training with: n_estimators=400, max_depth=10, max_features=1.0, min_samples_leaf=1, min_samples_split=2, bootstrap=True
R^2 Score: 0.9526132345199585
Training with: n_estimators=600, max_depth=10, max_features=1.0, min_samples_leaf=1, min_samples_split=2, bootstrap=True
R^2 Score: 0.9526083469390869
Training with: n_estimators=800, max_depth=10, max_features=1.0, min_samples_leaf=1, min_samples_split=2, bootstrap=True
R^2 Score: 0.9525827169418335
Training with: n_estimators=1000, max_depth=10, max_features=1.0, min_samples_leaf=1, min_samples_split=2, bootstrap=True
R^2 Score: 0.9526445865631104
Training with: n_estimators=1200, max_depth=10, max_features=1.0, min_samples_leaf=1, min_samples_split=2, bootstrap=True
R^2 Score: 0.952664852142334
Training with: n_estimators=1400, max_depth=10, max_features=1.0, min_samples_leaf=1, min_sam

TypeError: '<=' not supported between instances of 'NoneType' and 'int'

# Interpretation

To properly compare the results of the Baseline model, Multiple Linear Regression, Best Polynomial model, and the Best Optimized Random Forest Regressor in terms of $R^2$ and Root Mean Square Error (RMSE), let's summarize the performance metrics:

| Model | Training $R^2$ | Testing $R^2$ | Training RMSE | Testing RMSE |
|-------|----------------|---------------|---------------|--------------|
| **Baseline** | 0.2756 | 0.3566 | 7.1440 | 7.1893 |
| **Multiple Linear Regression** | 0.9879 | 0.9891 | 0.9229 | 0.9358 |
| **Best Polynomial (degree=2)** | - | 0.9970 | - | - |
| **Best Random Forest Regressor** | - | 0.9552 | - | - |

### Analysis:

- **Baseline vs. Models**: All the models significantly outperform the baseline in terms of both $R^2$ and RMSE on the testing set, indicating that they are all capable of capturing the patterns in the data much more effectively than a simple baseline prediction.

- **Multiple Linear Regression vs. Polynomial Regression**: The best polynomial model (degree=2) shows an exceptionally high $R^2$ score of 0.9970, indicating almost perfect prediction capability. However, it's essential to note that high degrees of polynomials can sometimes lead to overfitting.

- **Best Optimized Random Forest Regressor**: The Random Forest model shows a good $R^2$ score of 0.9552, which is higher than the baseline but does not surpass the linear regression or the polynomial model in terms of the $R^2$ score reported. 

Explanation (Possible Reason): Polynomial models can capture more complex relationships between features and the target variable compared to linear models. If the underlying relationship in the data is not linear but has a higher order (quadratic, cubic, etc.), polynomial regression can fit these curves much better, leading to higher accuracy.
