In [1]:
#Import necessary libraries 
import pandas as pd
import numpy as mp 
import matplotlib.pyplot as plt 
import seaborn as sns 
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

ModuleNotFoundError: No module named 'sklearn'

In [None]:
#Read cleaned data from project_deliverable_1
df=pd.read_csv("project_deliverable_1.csv")
print(df.columns.tolist())

## 1. Baseline Model 

The linear regression model was chosen as the baseline model because it is one of the simplest and most interpretable machine learning algorithms for regression tasks. It provides a clear mathematical relationship between the input features and the target variable, making it easy to understand how each factor influences trip duration. Additionally, it also has a low computational cost and helps identify whether the relationship in the data are linear or require non linear modeling approaches later on. 

In [None]:
day_mapping = {
    'Monday': 0, 'Tuesday': 1, 'Wednesday': 2,'Thursday': 3, 'Friday': 4, 'Saturday': 5, 'Sunday': 6
}
df['pickup_day_num'] = df['pickup_day'].map(day_mapping)

In [None]:
#Build a simple, interpretable baseline model in scikit-learn
features=['trip_distance_km','pickup_hour','pickup_day_num']
target='trip_duration'

X=df[features]
y=df[target]

In [None]:
#Split the data
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

In [None]:
#Train the model
linreg=LinearRegression()
linreg.fit(X_train,y_train)

In [None]:
#print learned parameters
print("Intercept(b0)",linreg.intercept_)
print("Coefficient for trip_distance_km(b1)",linreg.coef_[0])
print("Coefficient for pickup_hour(b2)",linreg.coef_[1])
print("Coefficient for pickup_day_num(b3)",linreg.coef_[2])

After training the linear regression model, the following was gotten as the results for the parameters: an intercept of approximately 372.31, and three coefficients for our independent variable -131.64 for trip_distance_km, 4.05 for pickup_hour and -12.6 for pickup_day_num.

The intercept represents the baseline predicted trip duration in seconds when all features are zero. 

The trip_distance_km means that for every additional km traveled, the trip duration increases by about 131.6 seconds(≈2.2 minutes), which aligns with expectations since longer distances take more time. 

The pickup_hour means that trips happening later in the day tend to last slightly longer, possibly due to traffic or rush hour timings. 

The pickup_day_num implies that as the week progresses, the average trip duration decreases by about 12.6 seconds per day, this may be due to lighter traffic as the weekend approaches.

In [None]:
#Predict data
y_pred=linreg.predict(X)

#Calculate regression metrics
mae=metrics.mean_absolute_error(y,y_pred)
mse=metrics.mean_squared_error(y,y_pred)
rmse=metrics.root_mean_squared_error(y,y_pred)
r2=metrics.r2_score(y,y_pred)

print("Mean absolute error (MAE):",mae)
print("Mean squared error (MSE):",mse)
print("Root mean squared error (RMSE):",rmse)
print("R-squared (R^2)",r2)




The model evaluated the MAE, MSE, RMSE, and R^2. The mean squared error are squared so this metric heavily penalizes large mistakes, and is sensitive to outliers (in this case unusually short or long trip durations). The relatively large MSE value observed here(≈172,080) might initially appear high, but this is expected since it is measured in seconds^2. 

To make the results easier to interpret, the RMSE was taken to convert it back to seconds. In this case, the RMSE is approximately 415 seconds (≈6.9 minutes), meaning that, on average, our model's predictions are off by about seven minutes. 

Meanwhile, the MAE of about 283 seconds (≈4.7 mins) provides a more direct measure of the typical prediction error without squaring, and the R^2 value of 0.59 indicates that around 59% of the variability in trip durations can be explained by our chosen features.

In [None]:
#Report performance via train–validation splits
train_rmse = np.sqrt(metrics.mean_squared_error(y_train, linreg.predict(X_train)))
test_rmse = np.sqrt(metrics.mean_squared_error(y_test, linreg.predict(X_test)))

print("Train RMSE:", train_rmse)
print("Test RMSE:", test_rmse)

The linear regression model appears to generalize well based on the evaluation metrics. The reaining RMSE(415.4 seconds) and testing RMSE(412.53 seconds) are very close, which suggests that the model performs consistently on both seen and unseen data. This means that the model has a good balance between bias and variance. 

From a bias-variance trade-off perspective, the linear regression model has low varianace since its performance remains fairly stable accross datasets, and moderate bias because it simplifies trip duration prediction into a linear relationship. The model does not underfit or overfit, because it captures the main trend between trip duration, distance, and time features and also does not perform significantly worse on new data. 