# Project Deliverable 1 - Group 33
#### Group Members: Bethany Findlay, Charlotte Albert, Kaykay Akpama, Kosi Udechukwu

In [31]:
#Import necessary libraries 
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
from sklearn import metrics
from sklearn.model_selection import train_test_split, cross_validate, KFold
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [32]:
#Read cleaned data from project_deliverable_1
df=pd.read_csv("project_deliverable_1_cleaned.csv")

## 1. Baseline Model 

The linear regression model was chosen as the baseline model because it is one of the simplest and most interpretable machine learning algorithms for regression tasks. It provides a clear mathematical relationship between the input features and the target variable, making it easy to understand how each factor influences trip duration. Additionally, it also has a low computational cost and helps identify whether the relationship in the data are linear or require non linear modeling approaches later on. 

In [33]:
day_mapping = {
    'Monday': 0, 'Tuesday': 1, 'Wednesday': 2,'Thursday': 3, 'Friday': 4, 'Saturday': 5, 'Sunday': 6
}
df['pickup_day_num'] = df['pickup_day'].map(day_mapping)

In [34]:
#Build a simple, interpretable baseline model in scikit-learn
features=['trip_distance_km','pickup_hour','pickup_day_num']
target='trip_duration'

X=df[features]
y=df[target]

In [35]:
#Split the data
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

In [36]:
#Train the model
linreg=LinearRegression()
linreg.fit(X_train,y_train)

In [37]:
#Print learned parameters
print("Intercept(b0)",linreg.intercept_)
print("Coefficient for trip_distance_km(b1)",linreg.coef_[0])
print("Coefficient for pickup_hour(b2)",linreg.coef_[1])
print("Coefficient for pickup_day_num(b3)",linreg.coef_[2])

Intercept(b0) 372.314192928335
Coefficient for trip_distance_km(b1) 131.6400959178373
Coefficient for pickup_hour(b2) 4.053814545218675
Coefficient for pickup_day_num(b3) -12.601269709007402


After training the linear regression model, the following was gotten as the results for the parameters: an intercept of approximately 372.31, and three coefficients for our independent variable 131.64 for trip_distance_km, 4.05 for pickup_hour and -12.6 for pickup_day_num.

The intercept represents the baseline predicted trip duration in seconds when all features are zero.

The trip_distance_km means that for every additional km traveled, the trip duration increases by about 131.6 seconds(≈2.2 minutes), which aligns with expectations since longer distances take more time.

The pickup_hour means that trips happening later in the day tend to last slightly longer, possibly due to traffic or rush hour timings.

The pickup_day_num implies that as the week progresses, the average trip duration decreases by about 12.6 seconds per day, this may be due to lighter traffic as the weekend approaches.

In [38]:
#Predict data
y_pred=linreg.predict(X)

#Calculate regression metrics
mae=metrics.mean_absolute_error(y,y_pred)
mse=metrics.mean_squared_error(y,y_pred)
rmse=metrics.root_mean_squared_error(y,y_pred)
r2=metrics.r2_score(y,y_pred)

print("Mean absolute error (MAE):",mae)
print("Mean squared error (MSE):",mse)
print("Root mean squared error (RMSE):",rmse)
print("R-squared (R^2)",r2)


Mean absolute error (MAE): 283.4947152256648
Mean squared error (MSE): 172079.75096985762
Root mean squared error (RMSE): 414.82496425583844
R-squared (R^2) 0.592035246372939


The model evaluated the MAE, MSE, RMSE, and R^2. The mean squared error are squared so this metric heavily penalizes large mistakes, and is sensitive to outliers (in this case unusually short or long trip durations). The relatively large MSE value observed here(≈172,080) might initially appear high, but this is expected since it is measured in seconds^2.

To make the results easier to interpret, the RMSE was taken to convert it back to seconds. In this case, the RMSE is approximately 415 seconds (≈6.9 minutes), meaning that, on average, our model's predictions are off by about seven minutes.

Meanwhile, the MAE of about 283 seconds (≈4.7 mins) provides a more direct measure of the typical prediction error without squaring, and the R^2 value of 0.59 indicates that around 59% of the variability in trip durations can be explained by our chosen features.

In [39]:
#Report performance via train–validation splits
train_rmse = np.sqrt(metrics.mean_squared_error(y_train, linreg.predict(X_train)))
test_rmse = np.sqrt(metrics.mean_squared_error(y_test, linreg.predict(X_test)))

print("Train RMSE:", train_rmse)
print("Test RMSE:", test_rmse)

Train RMSE: 415.3955489395688
Test RMSE: 412.53474169344736


The linear regression model appears to generalize well based on the evaluation metrics. The reaining RMSE(415.4 seconds) and testing RMSE(412.53 seconds) are very close, which suggests that the model performs consistently on both seen and unseen data. This means that the model has a good balance between bias and variance.

From a bias-variance trade-off perspective, the linear regression model has low varianace since its performance remains fairly stable accross datasets, and moderate bias because it simplifies trip duration prediction into a linear relationship. The model does not underfit or overfit, because it captures the main trend between trip duration, distance, and time features and also does not perform significantly worse on new data.

## 2. Cross Validation 

useing kfold bc regression, no distinct classes

Stratification doesn’t make sense because target values are continuous — there are no “classes” to preserve. Standard KFold is appropriate.

In [None]:
#Define 5-fold cross validation splits
cv = KFold(n_splits=5, shuffle=True, random_state=42)

In [None]:
#Building pipelines for each regression model
linreg_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('linreg', LinearRegression())
])

knn_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsRegressor(n_neighbors=5))
])

In [None]:
#Evaluate linear regression with cross validation
linreg_results = cross_validate(
    linreg_pipe,
    X,
    y,
    cv=cv,
    scoring=['r2', 'neg_mean_absolute_error', 'neg_root_mean_squared_error'],
    return_train_score=True,
    n_jobs=-1
)

#Evaluate KNN with cross validation
knn_results = cross_validate(
    knn_pipe,
    X,
    y,
    cv=cv,
    scoring=['r2', 'neg_mean_absolute_error', 'neg_root_mean_squared_error'],
    return_train_score=True,
    n_jobs=-1
)

In [None]:
#Extract metrics for Linear Regression
r2_linreg = linreg_results['test_r2']
mae_linreg = -linreg_results['test_neg_mean_absolute_error']
rmse_linreg = -linreg_results['test_neg_root_mean_squared_error']

#Extract metrics for KNN Regression
r2_knn = knn_results['test_r2']
mae_knn = -knn_results['test_neg_mean_absolute_error']
rmse_knn = -knn_results['test_neg_root_mean_squared_error']


In [48]:
#Create DataFrame for Linear Regression results
cv_table_linreg = pd.DataFrame({
    'Fold': [f'Fold {i+1}' for i in range(len(linreg_results['test_r2']))],
    'R² Score': linreg_results['test_r2'],
    'MAE (s)': -linreg_results['test_neg_mean_absolute_error'],
    'RMSE (s)': -linreg_results['test_neg_root_mean_squared_error']
})

print("Linear Regression Cross Validation Results by Fold:")
display(cv_table_linreg)

#Create DataFrame for KNN Regression results
cv_table_knn = pd.DataFrame({
    'Fold': [f'Fold {i+1}' for i in range(len(knn_results['test_r2']))],
    'R² Score': knn_results['test_r2'],
    'MAE (s)': -knn_results['test_neg_mean_absolute_error'],
    'RMSE (s)': -knn_results['test_neg_root_mean_squared_error']
})

print("\n K-Nearest Neighbours Regression Cross Validation Results by Fold:")
display(cv_table_knn)

Linear Regression Cross Validation Results by Fold:


Unnamed: 0,Fold,R² Score,MAE (s),RMSE (s)
0,Fold 1,0.593599,282.4958,412.534742
1,Fold 2,0.592821,283.277138,412.309412
2,Fold 3,0.592045,284.3089,417.80228
3,Fold 4,0.589063,283.68131,416.560437
4,Fold 5,0.59264,283.729715,414.899942



 K-Nearest Neighbours Regression Cross Validation Results by Fold:


Unnamed: 0,Fold,R² Score,MAE (s),RMSE (s)
0,Fold 1,0.656521,252.010641,379.256354
1,Fold 2,0.657596,252.403089,378.094034
2,Fold 3,0.657662,252.610475,382.729688
3,Fold 4,0.653005,253.072912,382.782328
4,Fold 5,0.658261,252.781127,380.015255


In [None]:
#Printing Mean & STD for Linear Regression
print("Linear Regression Cross-Validation Results:")
print(f"R2:   Mean = {r2_linreg.mean():f},  STD = {r2_linreg.std():f}")
print(f"MAE:  Mean = {mae_linreg.mean():f} s,  STD = {mae_linreg.std():f} s")
print(f"RMSE: Mean = {rmse_linreg.mean():f} s,  STD = {rmse_linreg.std():f} s")

#Printing Mean & STD for KNN
print("\n K-Nearest Neighbours Regression Cross Validation Results:")
print(f"R2:   Mean = {r2_knn.mean():f},  STD = {r2_knn.std():f}")
print(f"MAE:  Mean = {mae_knn.mean():f} s,  STD = {mae_knn.std():f} s")
print(f"RMSE: Mean = {rmse_knn.mean():f} s,  STD = {rmse_knn.std():f} s")

Linear Regression Cross-Validation Results:
R2:   Mean = 0.592033,  STD = 0.001566
MAE:  Mean = 283.498572 s,  STD = 0.599673 s
RMSE: Mean = 414.821363 s,  STD = 2.165869 s

 K-Nearest Neighbours Regression Cross Validation Results:
R2:   Mean = 0.656609,  STD = 0.001887
MAE:  Mean = 252.575649 s,  STD = 0.357691 s
RMSE: Mean = 380.575532 s,  STD = 1.882674 s
