# Further Exploration of Machine Learning Models
As explained in baselineModelTesting.ipynb, we want to predict lap distance given other lap data. We chose to use a Linear Regression algorithm for the baseline model, and got RMSE and MAE values of 1.88 and 0.8 respectively. Now, we will train models on the other candidate algorithms (both with and without cross validation) and compare to this baseline model. These algorithms include KNN, Decision Trees, Random Forests, and Gradient Boosting. 

## Import Modules and Clean the Data

In [36]:
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import KFold, train_test_split

# Drop useless columns
packetTrainingDataPath="../training_data/Elysia.Laps.feather"
df = pd.read_feather(packetTrainingDataPath)
df = df.drop(
        columns=[
            "msgType",
            "_id.$oid",
            "averagepackCurrent.$numberDouble",
            "timestamp.$numberLong",
        ]
    )

#we need the averagepackCurrent data to be numeric instead of {"$numberDouble": "NaN"}, setting errors='coerce' sets them to numerical NaN
df['averagepackCurrent'] = pd.to_numeric(df['averagepackCurrent'], errors='coerce')

#drop the 4 rows with null values
df = df.dropna(subset=['distance', 'averagepackCurrent', 'averagespeed'])

# Remove outliers from the data
#to do this, first define a threshold for outliers (3 standard deviations which contains 99.7% of data)
threshold = 3 * np.std(df['distance'])

#remove the outliers
df = df[(df['distance'] >= -threshold) & (df['distance'] <= threshold)]
#remove negative distance values
df = df[df['distance'] >= 0]

#seperate distance from the other features
X = df[['secondsdifference', 'totalpowerin', 'totalpowerout', 'netpowerout', 'amphours', 
    'averagepackCurrent', 'batterysecondsremaining', 'averagespeed']]
y = df['distance']

display(df) #should be 199x9

Unnamed: 0,secondsdifference,totalpowerin,totalpowerout,netpowerout,distance,amphours,batterysecondsremaining,averagespeed,averagepackCurrent
3,205000,793.532455,1803.385827,1009.853372,2.814824,97.800003,22425,49.584229,15.70
4,3517768,798.704841,685.739611,-112.965230,4.118535,95.300003,56521,24.939703,6.07
5,263001,585.339547,2634.298876,2048.959329,3.974127,93.599998,14468,54.412831,23.29
6,259000,681.537550,2573.828806,1892.291255,3.987067,91.900002,14517,55.419749,22.79
7,258500,559.875099,2739.047764,2179.172665,3.994517,90.099998,13283,55.529298,24.42
...,...,...,...,...,...,...,...,...,...
202,317495,537.555507,1679.129062,1141.573555,3.952078,3.800000,740,44.881966,18.49
203,325001,475.938651,1666.858482,1190.919832,3.963359,2.100000,406,43.894226,18.62
204,2494999,700.176665,566.324558,-133.852107,0.181906,4.700000,100215,1.106129,-5.78
205,343500,615.194700,1680.524756,1065.330056,3.963687,3.000000,611,41.645220,17.67


## KNN

In [37]:
#split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=69420)

#train the model with KNN
knn_model = KNeighborsRegressor(n_neighbors=5)
knn_model.fit(X_train, y_train)

#get predictions on the test data
y_pred_knn = knn_model.predict(X_test)

#evaluate using RMSE and MAE
rmse_knn = np.sqrt(mean_squared_error(y_test, y_pred_knn))
mae_knn = mean_absolute_error(y_test, y_pred_knn)

print(f"_knn for KNN is: {rmse_knn}")
print(f"MAE for KNN is: {mae_knn}")

_knn for KNN is: 0.9679120129289978
MAE for KNN is: 0.5085657314801215


## KNN with Cross Validation

In [38]:
kf = KFold(n_splits=5, shuffle=True, random_state=69420)
knn_model = KNeighborsRegressor(n_neighbors=5)

rmse_scores_knn = []
mae_scores_knn = []

for train_index, val_index in kf.split(X):
    X_train, X_val = X.iloc[train_index], X.iloc[val_index]
    y_train, y_val = y.iloc[train_index], y.iloc[val_index]
    
    knn_model.fit(X_train, y_train)
    y_pred_knn = knn_model.predict(X_val)
    
    rmse_scores_knn.append(np.sqrt(mean_squared_error(y_val, y_pred_knn)))
    mae_scores_knn.append(mean_absolute_error(y_val, y_pred_knn))

print(f"Average RMSE of KNN with Cross Validation is: {np.mean(rmse_scores_knn)}")
print(f"Average MAE of KNN with Cross Validation is: {np.mean(mae_scores_knn)}")

Average RMSE of KNN with Cross Validation is: 1.0697238402389337
Average MAE of KNN with Cross Validation is: 0.4079348846084489


## Decision Trees

In [39]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=69420)

dt_model = DecisionTreeRegressor(random_state=69420)
dt_model.fit(X_train, y_train)

y_pred_dt = dt_model.predict(X_test)

rmse_dt = np.sqrt(mean_squared_error(y_test, y_pred_dt))
mae_dt = mean_absolute_error(y_test, y_pred_dt)

print(f"RMSE for Decision Trees is: {rmse_dt}")
print(f"MAE for Decision Trees is: {mae_dt}")

RMSE for Decision Trees is: 0.7029586285313585
MAE for Decision Trees is: 0.2519436813354493


## Decision Trees with Cross Validation

In [40]:
kf = KFold(n_splits=5, shuffle=True, random_state=69420)
dt_model = DecisionTreeRegressor(random_state=69420)

rmse_scores_dt = []
mae_scores_dt = []

for train_index, val_index in kf.split(X):
    X_train, X_val = X.iloc[train_index], X.iloc[val_index]
    y_train, y_val = y.iloc[train_index], y.iloc[val_index]
    
    dt_model.fit(X_train, y_train)
    y_pred_dt = dt_model.predict(X_val)
    
    rmse_scores_dt.append(np.sqrt(mean_squared_error(y_val, y_pred_dt)))
    mae_scores_dt.append(mean_absolute_error(y_val, y_pred_dt))

print(f"Average RMSE of Decision Trees with Cross Validation is: {np.mean(rmse_scores_dt)}")
print(f"Average MAE of Decision Trees with Cross Validation is: {np.mean(mae_scores_dt)}")

Average RMSE of Decision Trees with Cross Validation is: 0.5440475844819632
Average MAE of Decision Trees with Cross Validation is: 0.16347548274385076


## Random Forests

In [41]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=69420)

rf_model = RandomForestRegressor(random_state=69420)
rf_model.fit(X_train, y_train)

y_pred_rf = rf_model.predict(X_test)

rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
mae_rf = mean_absolute_error(y_test, y_pred_rf)

print(f"RMSE for Random Forests is: {rmse_rf}")
print(f"MAE for Random Forests is: {mae_rf}")

RMSE for Random Forests is: 0.9136314836987844
MAE for Random Forests is: 0.37793798720556715


## Random Forests with Cross Validation

In [42]:
kf = KFold(n_splits=5, shuffle=True, random_state=69420)
rf_model = RandomForestRegressor(random_state=69420)

rmse_scores_rf = []
mae_scores_rf = []

for train_index, val_index in kf.split(X):
    X_train, X_val = X.iloc[train_index], X.iloc[val_index]
    y_train, y_val = y.iloc[train_index], y.iloc[val_index]
    
    rf_model.fit(X_train, y_train)
    y_pred_rf = rf_model.predict(X_val)
    
    rmse_scores_rf.append(np.sqrt(mean_squared_error(y_val, y_pred_rf)))
    mae_scores_rf.append(mean_absolute_error(y_val, y_pred_rf))

print(f"Average RMSE of Random Forests with Cross Validation is: {np.mean(rmse_scores_rf)}")
print(f"Average MAE of Random Forests with Cross Validation is: {np.mean(mae_scores_rf)}")

Average RMSE of Random Forests with Cross Validation is: 0.5354677709271433
Average MAE of Random Forests with Cross Validation is: 0.2040573819672439


## Gradient Boosting

In [43]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=69420)

gb_model = GradientBoostingRegressor(random_state=69420)
gb_model.fit(X_train, y_train)

y_pred_gb = gb_model.predict(X_test)

rmse_gb = np.sqrt(mean_squared_error(y_test, y_pred_gb))
mae_gb = mean_absolute_error(y_test, y_pred_gb)

print(f"RMSE for Gradient Boosting is: {rmse_gb}")
print(f"MAE for Gradient Boosting is: {mae_gb}")

RMSE for Gradient Boosting is: 0.9272863715854762
MAE for Gradient Boosting is: 0.3428877079711604


## Gradient Boosting with Cross Validation

In [44]:
kf = KFold(n_splits=5, shuffle=True, random_state=69420)
gb_model = GradientBoostingRegressor(random_state=69420)

rmse_scores_gb = []
mae_scores_gb = []

for train_index, val_index in kf.split(X):
    X_train, X_val = X.iloc[train_index], X.iloc[val_index]
    y_train, y_val = y.iloc[train_index], y.iloc[val_index]
    
    gb_model.fit(X_train, y_train)
    y_pred_gb = gb_model.predict(X_val)
    
    rmse_scores_gb.append(np.sqrt(mean_squared_error(y_val, y_pred_gb)))
    mae_scores_gb.append(mean_absolute_error(y_val, y_pred_gb))

print(f"Average RMSE of Gradient Boosting with Cross Validation is: {np.mean(rmse_scores_gb)}")
print(f"Average MAE of Gradient Boosting with Cross Validation is: {np.mean(mae_scores_gb)}")

min_distance = df['distance'].min()
max_distance = df['distance'].max()
average_distance = df['distance'].mean()
percentile_25 = np.percentile(df['distance'], 25)
percentile_50 = np.percentile(df['distance'], 50)
percentile_75 = np.percentile(df['distance'], 75)

print("Minimum distance:", min_distance)
print("Maximum distance:", max_distance)
print("Average distance:", average_distance)
print(f"25th percentile of distance: {percentile_25}")
print(f"50th percentile (median) of distance: {percentile_50}")
print(f"75th percentile of distance: {percentile_75}")

Average RMSE of Gradient Boosting with Cross Validation is: 0.4070252257043531
Average MAE of Gradient Boosting with Cross Validation is: 0.14665131782949836
Minimum distance: 0.0162265625
Maximum distance: 8.745853515625
Average distance: 4.063722782813743
25th percentile of distance: 3.963570068359375
50th percentile (median) of distance: 3.9834296875
75th percentile of distance: 4.0039501953125


## Ranking and Evaluation on Performance

### Ranking Criteria
As mentioned in the conclusion of baselineModelTesting, we want low MAE and RMSE scores, with a priority on lower MAE over RMSE. 

### Model Performance
1. **Gradient Boosting with Cross Validation**
    - **MAE:** 0.1467
    - **RMSE:** 0.4070

2. **Decision Trees with Cross Validation**
    - **MAE:** 0.1635
    - **RMSE:** 0.5440

3. **Random Forests with Cross Validation**
    - **MAE:** 0.2041
    - **RMSE:** 0.5355

4. **Decision Trees**
    - **MAE:** 0.2519
    - **RMSE:** 0.7030

5. **Gradient Boosting**
    - **MAE:** 0.3429
    - **RMSE:** 0.9273

6. **Random Forests**
    - **MAE:** 0.3779
    - **RMSE:** 0.9136

7. **KNN with Cross Validation**
    - **MAE:** 0.4079
    - **RMSE:** 1.0697

8. **KNN**
    - **MAE:** 0.5086
    - **RMSE:** 0.9679

### Summary
Based on the MAE scores, the models are ranked as follows:
1. Gradient Boosting with Cross Validation
2. Decision Trees with Cross Validation
3. Random Forests with Cross Validation
4. Decision Trees
5. Gradient Boosting
6. Random Forests
7. KNN with Cross Validation
8. KNN

Gradient Boosting with Cross Validation performed the best with the lowest MAE and RMSE scores, at 0.1467 and 0.4070 respectively. Because the majority of our data is between 3.9-4.0 and MAE is 0.1467, it means predictions are normally 3.67% to 3.76% off from the real distance value. This is a significant improvement from the baseline model, which had predictions which were 20% different. 

In the real world, this model can predict the distance of a lap within 0.1467 units given features like `amphours`, `averagepackCurrent`, `batterysecondsremaining`, and `averagespeed`. This prediction will be useful when making data driven decisions like in optimizing lap performance, analysing energy consumption, and looking at overall efficiency. 