# Machine Learning Coursework 1
## Regression Analysis

**Student ID:** Z22590018   
**Module:** CMP-X303-0 Machine Learning  
**Objective:** This notebook demonstrates the use of supervised learning algorithms on the dataset `cw1data.csv`.       
**Date Updated** 23/10/2025

We will:
1. Import and explore the dataset  
2. Visualize and prepare the data  
3. Train multiple regression models  
4. Evaluate and compare their performance  


## Step 1: Importing Libraries
We begin by importing essential Python libraries for data analysis, visualization, and machine learning.

In [65]:
# Import required libraries
#
# Matplotlib for visualization
# Seaborn for advanced metrics
# Pandas and Numpy Data manipultion
# SkLearn for ready to use ML algorithms
#
#
# Data handling and analysis
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import seaborn as sns     

# ML algorithms
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score


## Step 2: Importing and Exploring the Dataset
We now load the dataset and inspect its structure, including data types and missing values. We visualize the dataset to understand relationships between variables and detect potential correlations.



In [71]:
# Load the dataset
df = pd.read_csv('cw1data.csv')

# Check table info
#df.info()

# Check if any data missing
#df.isnull().values.any()

# Run if everything is fine in the data to reveal top 5 rows
df.head()


Unnamed: 0,y,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13
0,49.83,1.68,82.8,24,6.554,6.538,6.438,6.39,6.318,29.44,39.83,59.1,54.11,40.72
1,50.12,1.71,86.5,53,6.593,6.578,6.465,6.42,6.356,19.11,40.19,57.34,53.6,39.24
2,49.02,1.65,91.0,45,6.488,6.466,6.36,6.313,6.251,31.0,41.56,56.69,50.99,38.08
3,61.7,1.69,100.7,42,6.361,6.334,6.209,6.16,6.087,33.39,44.33,52.26,45.33,29.23
4,40.83,1.72,62.3,37,6.667,6.644,6.539,6.491,6.417,34.33,48.35,69.03,62.02,44.97


In [None]:
# Heatmap to analyze correlations between values
plt.figure(figsize=(12, 8))
# use seaborns heatmap
sns.heatmap(df.corr(), annot=True, cmap='Purples', center=0, fmt='.2f')
plt.title('Correlation Matrix')
plt.show()

In [None]:
# Regression analysis of Highest and Lowest values affecting Y colonum
#
# Creates figure with subplots inside to visualize side by side
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Regression Plot from seaborn. Put into axes 0
sns.regplot(x='x2', y='y', data=df, ax=axes[0])
axes[0].set_title('Relationship between x3 and y')

sns.regplot(x='x5', y='y', data=df, ax=axes[1])
axes[1].set_title('Relationship between x5 and y')

# Tight the layout to display neatly
plt.tight_layout()
plt.show()

## Step 4: Data Preparation
We separate the dataset into input features and the target variable, then split into training and testing subsets.


In [None]:
# Define features of X and target y
X = df.drop(columns=['y'])
y = df['y']

# Split the dataset into training and testing sets. In this case we split 20% for test and 80 for training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#For Debugging purposes.To check if split performed correctly
#print("Training set size:", X_train.shape)
#print("Testing set size:", X_test.shape)

## Step 5: Model Training and Comparison
### Linear Regression
We train a simple Linear Regression model as a baseline for comparison.


In [None]:
# Initialize and train Linear Regression model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Predict on test data
y_pred_lr = lr_model.predict(X_test)

# Model Evaluation
mse_lr = mean_squared_error(y_test, y_pred_lr)
r2_lr = r2_score(y_test, y_pred_lr)


### Decsion Tree

We will train and evaluate decision tree to see if non-linear relationships improve performance.

In [None]:
# Initialize the Decision Tree Regressor
dt_model = DecisionTreeRegressor(
    max_depth=None,      # you can limit depth to prevent overfitting
    random_state=42
)

# Model Training
dt_model.fit(X_train, y_train)

# Prediction
y_pred_dt = dt_model.predict(X_test)

# Model Evaluation
mse_dt = mean_squared_error(y_test, y_pred_dt)
r2_dt = r2_score(y_test, y_pred_dt)

### Random Forest

We implement Random Forest in purpose of comparison with decision tree

In [None]:
# Random Forest Initialization
rf_model = RandomForestRegressor(
    n_estimators=100,      # number of trees
    random_state=42,       # reproducibility
    # Limit depth of Trees. None without limit - can go till last nodes of trees are pure
    max_depth=None,    
    # -1 for all cpu cores
    n_jobs=-1         
)

# Model Training
rf_model.fit(X_train, y_train)

# Prediction
y_pred_rf = rf_model.predict(X_test)

# Model Evaluation
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

## Step 6: Model Comparison
We evaluate models using common regression metrics and visualize how well their predictions align with actual values. Also in this visualization we can compare values of Decision Tree and more advanced version of it, Random Forest


In [None]:
# Comparison table
results = pd.DataFrame({
    'Model': ['Linear Regression', 'Random Forest', 'Decision Tree'],
    'MSE': [mse_lr, mse_rf, mse_dt],
    'RÂ²': [r2_lr, r2_rf, r2_dt]
})

print(results)

# Create a figure with 2 plots side by side
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Scatter Plot of Actual and Predicted values by algorithms
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', label='Perfect Fit')
axes[0].scatter(y_test, y_pred_lr, alpha=0.6, label='Linear Regression')
axes[0].scatter(y_test, y_pred_rf, alpha=0.6, label='Random Forest')
axes[0].scatter(y_test, y_pred_dt, alpha=0.6, label='Decision Tree')
axes[0].set_xlabel('Actual')
axes[0].set_ylabel('Predicted')
axes[0].legend()
axes[0].set_title('Predicted vs Actual Comparison')

# BarChart comparison of evaluation metrics of each algorithm
results_melted = results.melt(id_vars='Model', var_name='Metric', value_name='Score')
sns.barplot(x='Model', y='Score', hue='Metric', data=results_melted, ax=axes[1])
axes[1].set_title('Model Performance Comparison')

# Show Plots with tight layout
plt.tight_layout()
plt.show()