# Choice and Testing of a Baseline Machine Learning Model #
Using the lap data, we want to predict the distance given the other data. To do this, we require regression (prediction) algorithms as opposed to classification models. 

## Choice of a Baseline Model ##
For a baseline model, we want to use a popular yet simple machine learning algorithm. This includes Linear Regression, KNN, Decision Trees, Random Forests, and Gradient Boosting. 

### **Linear Regression** ###
Linear Regression models the relationship between one or more input variables and a continuous output variable by fitting a linear equation to observed data.

Good: 
* Is great for seeing how changes to a single input variable affect one output variable. 
* It is commonly used for prediction (rather than categorization), which is exactly what we want.
* Works well for small datasets because relationships are esaily interpretable.

Bad:
* If the data does not make linear relationships then this algorithm is very bad.

### **KNN** ###
K-Nearest Neighbors(KNN) is an algorithm that predicts the outcome for a new data point based on the outcomes of the 'k' nearest points in the training data.

Good:
* Can be used for both prediction as well as categorization.

Bad:
* For large datasets with a high number of columns, this will be extremely computationally expensive, and may experience the "curse of dimensionality".
* Sensitive to noise and outliers. Performance varies based on the choice of 'k', which means we will have to tune.

### **Decision Trees** ###
Decision Trees split data into branches based on feature values, ultimately assigning an outcome to each branch.

Good:
* Suitable for capturing non-linear relationships in small datasets, while maintining interpretability.

Bad:
* Prone to overfitting. Performs much better with regularization or ensemble methods, so you may as well use a Random Forest. 

### **Random Forests** ###
Random Forests create an ensemble of decision trees, each trained on a random subset of the data to improve predictive accuracy.

Good:
* Reduces overfitting.
* Handles non-linear patterns.

Bad:
* With small datasets, overfitting may be even worse. 
* Much more computationally expensive than simple decision trees as it creates an ensemble. 

### **Gradient Boosting** ###
Gradient Boosting builds an ensemble of weak models (typically decision trees), where each subsequent model corrects errors of its predecessor.

Good
* Capable of capturing complex relationships, even on small datasets.

Bad
* Computationally expensive.
* Requires fine tuning, but with small datasets that is hard to do. 

## Conclusion ##
Our goal is to predict distance given lap data. Seeing as Elysia's lap data contains 208 rows and 9 columns (after having dropped useless columns), **Linear Regression** is a suitable choice for the base model. As we are just aiming to display a simple model on the website for now, we want something that is lightweight, simple, and easy to interpret, and Linear Regression matches all these criteria. 

We do not want to risk running into KNN and dimensionality issues. Decision Trees, Random Forests, and Gradient Boosting either take too much computation, or we may even lack the data to fine tune these models. 

### References ###
(1) Most Popular Machine Learning Algorithms: https://www.coursera.org/articles/machine-learning-algorithms \
(2) KNN and the Curse of Dimensionality: https://www.geeksforgeeks.org/k-nearest-neighbors-and-curse-of-dimensionality/



## Import Modules and Check the Data Size ##
Should be 208x9

In [40]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import KFold
import numpy as np

packetTrainingDataPath="../training_data/Elysia.Laps.feather"
df = pd.read_feather(packetTrainingDataPath)
df = df.drop(
        columns=[
            "msgType",
            "_id.$oid",
            "averagepackCurrent.$numberDouble",
            "timestamp.$numberLong",
        ]
    )

display(df)

Unnamed: 0,secondsdifference,totalpowerin,totalpowerout,netpowerout,distance,amphours,batterysecondsremaining,averagespeed,averagepackCurrent
0,3264851,0.000000,0.000000,0.000000,0.000000,69.800003,-1,0.000000,
1,88892470,0.000000,0.000000,0.000000,0.000000,126.800003,-1,0.000000,
2,95990970,0.000000,0.000000,0.000000,0.000000,98.699997,-1,0.000000,
3,205000,793.532455,1803.385827,1009.853372,2.814824,97.800003,22425,49.584229,15.70
4,3517768,798.704841,685.739611,-112.965230,4.118535,95.300003,56521,24.939703,6.07
...,...,...,...,...,...,...,...,...,...
203,325001,475.938651,1666.858482,1190.919832,3.963359,2.100000,406,43.894226,18.62
204,2494999,700.176665,566.324558,-133.852107,0.181906,4.700000,100215,1.106129,-5.78
205,343500,615.194700,1680.524756,1065.330056,3.963687,3.000000,611,41.645220,17.67
206,330500,624.008188,1543.656620,919.648432,3.966359,1.400000,300,43.276732,16.79


# Building the Model #

In [41]:
#we need the averagepackCurrent data to be numeric instead of {"$numberDouble": "NaN"}, setting errors='coerce' sets them to numerical NaN
df['averagepackCurrent'] = pd.to_numeric(df['averagepackCurrent'], errors='coerce')

#drop the 4 rows with null values
df = df.dropna(subset=['distance', 'averagepackCurrent', 'averagespeed'])

#seperate distance from the other features
X = df[['secondsdifference', 'totalpowerin', 'totalpowerout', 'netpowerout', 'amphours', 
        'averagepackCurrent', 'batterysecondsremaining', 'averagespeed']]
y = df['distance']

#split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=69420)

#train the baseline model with linear regression
linear_regression_model = LinearRegression()
linear_regression_model.fit(X_train, y_train)

#get predictions on test data
y_pred = linear_regression_model.predict(X_test)

#evaluate using RMSE and MAE
RMSE = np.sqrt(mean_squared_error(y_test, y_pred))
MAE = mean_absolute_error(y_test, y_pred)

print(f"RMSE (root mean squared error) is: {RMSE}")
print(f"MAE (mean absolute error) is: {MAE}")

def print_distance_statistics():
        min_distance = df['distance'].min()
        max_distance = df['distance'].max()
        average_distance = df['distance'].mean()
        percentile_25 = np.percentile(df['distance'], 25)
        percentile_50 = np.percentile(df['distance'], 50)
        percentile_75 = np.percentile(df['distance'], 75)

        print("Minimum distance:", min_distance)
        print("Maximum distance:", max_distance)
        print("Average distance:", average_distance)
        print(f"25th percentile of distance: {percentile_25}")
        print(f"50th percentile (median) of distance: {percentile_50}")
        print(f"75th percentile of distance: {percentile_75}")

print_distance_statistics()

RMSE (root mean squared error) is: 36.22178511618355
MAE (mean absolute error) is: 6.9643536149352085
Minimum distance: -227.01091494750978
Maximum distance: 77.62573291015624
Average distance: 3.219106722576028
25th percentile of distance: 3.962990234375
50th percentile (median) of distance: 3.983119140625
75th percentile of distance: 4.00380712890625


# Interpretation of Performance #
Given such little spread in the IQR (middle 50% of distance values), but RMSE of 36.22 and MAE of 6.96, it is clear that this baseline model is very inaccurate. There should be very few predictions of distance which differ from a value of 4. This suggests that there could be large outliers, and so lets try this again, but cleaning outliers and negative distance values (it does not make sense for distance to be negative.)

In [42]:
# Remove outliers from the data
#to do this, first define a threshold for outliers (3 standard deviations which contains 99.7% of data)
threshold = 3 * np.std(df['distance'])

#remove the outliers
df = df[(df['distance'] >= -threshold) & (df['distance'] <= threshold)]
#remove negative distance values
df = df[df['distance'] >= 0]
#seperate distance from the other features
X = df[['secondsdifference', 'totalpowerin', 'totalpowerout', 'netpowerout', 'amphours', 
    'averagepackCurrent', 'batterysecondsremaining', 'averagespeed']]
y = df['distance']

#split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=69420)

#retrain with linear regression
linear_regression_model.fit(X_train, y_train)

#get predictions on test data
y_pred = linear_regression_model.predict(X_test)

#re-evaluate RMSE and MAE
RMSE = np.sqrt(mean_squared_error(y_test, y_pred))
MAE = mean_absolute_error(y_test, y_pred)

print(f"RMSE (root mean squared error) after removing outliers is: {RMSE}")
print(f"MAE (mean absolute error) after removing outliers is: {MAE}")

print_distance_statistics()

RMSE (root mean squared error) after removing outliers is: 1.711367174002224
MAE (mean absolute error) after removing outliers is: 0.9110000515477485
Minimum distance: 0.0162265625
Maximum distance: 8.745853515625
Average distance: 4.063722782813743
25th percentile of distance: 3.963570068359375
50th percentile (median) of distance: 3.9834296875
75th percentile of distance: 4.0039501953125


# Re-evaluation of Performance #
It is clear to see that the accuracy has improved significantly, as RMSE has gone from 36.22 to 1.7, and MAE hsa gone from 6.96 to 0.9. This means the difference in predictions of distance from actual values are much smaller. However, there are other simple tests we can try to improve the model. 

In [43]:
#initialize the KFold cross validator
kf = KFold(n_splits=5, shuffle=True, random_state=69420)

#lists to store RMSE and MAE for each fold (each subset of data)
rmse_list = []
mae_list = []

#initialize a linear regression model
linear_regression_model = LinearRegression()

#cross-validation
for train_index, test_index in kf.split(X):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    #train model
    linear_regression_model.fit(X_train, y_train)
    
    #make predictions
    y_pred = linear_regression_model.predict(X_test)
    
    #calculate RMSE and MAE for the fold
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    mae = mean_absolute_error(y_test, y_pred)
    rmse_list.append(rmse)
    mae_list.append(mae)

#get the average RMSE and MAE across all folds
average_rmse = np.mean(rmse_list)
average_mae = np.mean(mae_list)

print(f"Cross-validated RMSE: {average_rmse}")
print(f"Cross-validated MAE: {average_mae}")

print_distance_statistics()

Cross-validated RMSE: 1.8882794269274583
Cross-validated MAE: 0.8029010278510377
Minimum distance: 0.0162265625
Maximum distance: 8.745853515625
Average distance: 4.063722782813743
25th percentile of distance: 3.963570068359375
50th percentile (median) of distance: 3.9834296875
75th percentile of distance: 4.0039501953125


# Final Evaluation of Performance #
Cross validation seems to worsen RMSE from 1.7 to 1.88, but improves MAE from 0.9 to 0.8. In general, low RMSE means the model is better at estimating outliers, whereas low MAE means the model is more robust and can predict around the median of data. Seeing as we will clean and remove the outliers anyway, we want to focus on predicting the majority of input data. So, we are prioritising a low MAE. Thus, this model using Cross-validation is the best so far. 

Since the majority of our data is between 3.9-4.0 and MAE is 0.8, it means predictions are normally 20% off from the real distance value. That is, in the real world, this model can be used to predict the distance of a lap given input features such as power, `amphours`, `averagepackCurrent`, `batterysecondsremaining`, and `averagespeed`. With a MAE of 0.8, the model's predictions are typically within 0.8 units of the actual distance. This prediction can be useful for making data driven decisions when optimizing lap performance, planning energy consumption, and looking at improving overall efficiency. By reducing the MAE, we ensure that the model is able to predict the majority of the input data, which has many applications in optimizing different metrics of the car during race.