#Environment Setup and Project Initialization

This section prepares the Google Colab environment for the project by connecting it to Google Drive and navigating to the project directory where all code and data files are stored. Mounting Google Drive ensures that files persist across Colab sessions, while changing the working directory allows the notebook to access local project files such as scripts and datasets. The directory check at the end serves as a simple sanity check to confirm that the correct project folder and expected files are available before proceeding.

In [1]:
from google.colab import drive
drive.mount("/content/drive")
PROJECT_PATH = "/content/drive/MyDrive/comp551-Ass1"
%cd $PROJECT_PATH

import os
print("Current directory:", os.getcwd())
print("Files here:", os.listdir())

Mounted at /content/drive
/content/drive/MyDrive/comp551-Ass1
Current directory: /content/drive/MyDrive/comp551-Ass1
Files here: ['.git', 'data_parser.py', '__pycache__', 'data', 'linear_regression.py', 'COMP551-1.ipynb']


Importing the Data Processing Utilities

In this step, we import the preprocessing function responsible for loading and preparing the Bike Sharing dataset, as well as NumPy for numerical operations. The process_csv function handles data cleaning, encoding of categorical variables, feature scaling, and the train–test split, allowing the rest of the notebook to work directly with clean, machine-learning-ready NumPy arrays.

In [2]:
from data_parser import process_csv
import numpy as np

Loading and Preprocessing the Dataset

Here, we load the Bike Sharing dataset and apply all preprocessing steps in one call. This function reads the raw CSV file, validates the data, encodes categorical variables, standardizes continuous features, and performs a time-aware train–test split. The result is four NumPy arrays: training inputs, testing inputs, and their corresponding target values, which are ready to be used directly for model training and evaluation.

In [3]:
X_train, X_test, y_train, y_test = process_csv("data/day.csv")

Ensuring Numeric Data Types

Here, we convert all input and target arrays to float64 to guarantee compatibility with NumPy’s linear algebra operations and avoid issues caused by mixed data types during model training.

In [4]:
X_train = X_train.astype(np.float64)
X_test = X_test.astype(np.float64)
y_train = y_train.astype(np.float64)
y_test = y_test.astype(np.float64)

In [5]:
print("X_train dtype:", X_train.dtype)
print("y_train dtype:", y_train.dtype)

X_train dtype: float64
y_train dtype: float64


#Inspecting Dataset Dimensions

This cell prints the shapes of the training and testing inputs and targets to verify that the data has been split correctly and that each input matrix aligns with its corresponding label vector.

In [6]:
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

X_train shape: (584, 29)
X_test shape: (147, 29)
y_train shape: (584,)
y_test shape: (147,)


Previewing Sample Data

This step displays a few rows from the training data and their corresponding target values as a quick sanity check. It helps confirm that the features and labels look reasonable before training the model.

In [7]:
print(X_train[:3])
print(y_train[:3])

[[ 1.          0.          0.         -0.81385634  1.23613525 -0.44488315
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          1.          1.          0.          0.        ]
 [ 1.          0.          0.         -0.71121466  0.49355952  0.71244778
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          1.          0.          0.        ]
 [ 1.          0.          0.         -1.59945776 -1.25765687  0.70942613
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          1.          0.          0.          0.
   0.          0.          0.          0.          0.     

#Implementing Linear Regression from Scratch

This part creates the linear_regression.py and defines a simple linear regression model using NumPy. The fit method learns the weight vector using a stable closed-form solution, while the predict method uses those learned weights to generate predictions through matrix multiplication.

In [8]:
%%writefile linear_regression.py
import numpy as np

class LinearRegression:
    def __init__(self):
        self.w = None

    def fit(self, X, y):
        self.w = np.linalg.lstsq(X, y, rcond=None)[0]

    def predict(self, X):
        return X @ self.w

Overwriting linear_regression.py


#Training the Linear Regression Model

Here, we initialize the linear regression model and fit it to the training data. This process learns the optimal weight vector that best maps the input features to the target bike rental counts.

In [17]:
from linear_regression import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

In [11]:
print("Weight vector shape:", model.w.shape)

Weight vector shape: (29,)


Generating Model Predictions

Here, we use the trained linear regression model to generate predictions for both the training and testing datasets. These predicted values will be used to evaluate how well the model fits the data and how it generalizes to unseen examples.

In [12]:
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

In [13]:
print(y_train_pred.shape)
print(y_test_pred.shape)

(584,)
(147,)


In [14]:
print("True:", y_train[:5])
print("Pred:", y_train_pred[:5])

True: [ 985.  801. 1349. 1562. 1600.]
Pred: [1303.61781844  993.39686106 1206.20983334 1385.36075853 1567.21709833]


#Defining the Mean Squared Error Metric

This cell defines a simple function to compute Mean Squared Error, which measures the average squared difference between the true target values and the model’s predictions.

In [15]:
def mse(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)



Evaluating Baseline Model Performance

This step computes and reports the Mean Squared Error on both the training and testing sets, providing a baseline measure of how well the linear regression model fits the data and generalizes to unseen observations.

In [16]:
train_mse = mse(y_train, y_train_pred)
test_mse = mse(y_test, y_test_pred)

print("Training MSE:", train_mse)
print("Test MSE:", test_mse)

Training MSE: 462669.79441381706
Test MSE: 1218080.5473418413


Interpreting Error in Original Units

Here, we compute the Root Mean Squared Error to express the model’s prediction error in the original unit of bike rentals, making the results easier to interpret in practical terms.

In [18]:
print("Training RMSE:", np.sqrt(train_mse))
print("Test RMSE:", np.sqrt(test_mse))

Training RMSE: 680.1983493171805
Test RMSE: 1103.6668642945847


#Feature Engineering
Identifying Continuous Feature Indices

This parts inspects the feature matrix to locate the indices of continuous variables used for feature engineering. We are examining feature magnitudes in a sample row to distinguish standardized continuous features from binary one-hot encoded columns, allowing us to target only the appropriate variables for nonlinear transformations.

In [19]:
print("Number of features:", X_train.shape[1])

Number of features: 29


In [20]:
X_train[0]

array([ 1.        ,  0.        ,  0.        , -0.81385634,  1.23613525,
       -0.44488315,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        1.        ,  1.        ,  0.        ,  0.        ])

In [21]:
for i, v in enumerate(X_train[0]):
    if abs(v) > 0.2 and abs(v) < 2:
        print(i, v)

0 1.0
3 -0.8138563421371906
4 1.236135247071073
5 -0.4448831480680497
25 1.0
26 1.0


In this step, we explicitly record the column indices corresponding to temperature, humidity, and wind speed. These indices are used to apply polynomial and interaction-based feature engineering only to the continuous variables.

In [23]:
temp_idx = 3
hum_idx = 4
wind_idx = 5

In [24]:
print(X_train[0, temp_idx],
      X_train[0, hum_idx],
      X_train[0, wind_idx])

-0.8138563421371906 1.236135247071073 -0.4448831480680497


Adding Nonlinear Feature Engineering

This part defines a utility function that augments the original feature matrix with polynomial and interaction terms derived from temperature, humidity, and wind speed. These additional features allow the linear regression model to capture nonlinear effects and interactions between weather variables while keeping the model itself linear in its parameters.

In [25]:
%%writefile feature_engineering.py
import numpy as np

def add_nonlinear_features(X, temp_idx, hum_idx, wind_idx):
    """
    Adds polynomial and interaction features for selected continuous columns.

    Parameters:
    - X: NumPy array (already includes bias column)
    - temp_idx, hum_idx, wind_idx: indices of continuous features

    Returns:
    - X_new: expanded feature matrix
    """

    # Extract continuous features
    temp = X[:, temp_idx]
    hum = X[:, hum_idx]
    wind = X[:, wind_idx]

    # Polynomial features
    temp_sq = temp ** 2
    hum_sq = hum ** 2
    wind_sq = wind ** 2

    # Interaction features
    temp_hum = temp * hum
    temp_wind = temp * wind
    hum_wind = hum * wind

    # Concatenate original features with new ones
    X_new = np.column_stack([
        X,
        temp_sq,
        hum_sq,
        wind_sq,
        temp_hum,
        temp_wind,
        hum_wind
    ])

    return X_new

Writing feature_engineering.py


Applying Feature Engineering to the Dataset

Here, we apply the nonlinear feature expansion to both the training and testing sets, ensuring consistency between them. Printing the shapes confirms that new features were added correctly while keeping the number of data points unchanged.

In [26]:
from feature_engineering import add_nonlinear_features

In [27]:
X_train_fe = add_nonlinear_features(X_train, temp_idx, hum_idx, wind_idx)
X_test_fe = add_nonlinear_features(X_test, temp_idx, hum_idx, wind_idx)

In [28]:
print("Original X_train shape:", X_train.shape)
print("Feature-engineered X_train shape:", X_train_fe.shape)

Original X_train shape: (584, 29)
Feature-engineered X_train shape: (584, 35)


Training and Evaluating the Feature-Engineered Model

This step retrains the linear regression model using the feature-engineered inputs and evaluates its performance. By comparing the training and test MSE before and after feature engineering, we can see the impact of adding polynomial and interaction terms, confirming that the enhanced features lead to improved model fit.

In [29]:
from linear_regression import LinearRegression

model_fe = LinearRegression()
model_fe.fit(X_train_fe, y_train)

In [30]:
y_train_pred_fe = model_fe.predict(X_train_fe)
y_test_pred_fe = model_fe.predict(X_test_fe)

In [31]:
train_mse_fe = mse(y_train, y_train_pred_fe)
test_mse_fe = mse(y_test, y_test_pred_fe)

print("Training MSE (before):", train_mse)
print("Training MSE (after): ", train_mse_fe)

print("Test MSE (before):", test_mse)
print("Test MSE (after): ", test_mse_fe)

Training MSE (before): 462669.79441381706
Training MSE (after):  376557.30994621356
Test MSE (before): 1218080.5473418413
Test MSE (after):  1023757.6634041732


In [33]:
!pwd
!ls

/content/drive/MyDrive/comp551-Ass1
COMP551-1.ipynb  data_parser.py		 linear_regression.py
data		 feature_engineering.py  __pycache__


In [35]:
!git config --global user.name "Victor-Akinode"
!git config --global user.email "victorakinode@gmail.com"

In [None]:
%%writefile .gitignore
# Google Drive metadata files
*.gsheet
*.gdoc
*.gslides

# Colab / Jupyter
.ipynb_checkpoints/

# Python cache
__pycache__/
*.pyc

Overwriting .gitignore
