# **Experiment Objectives and Hypotheses**

## **Hypotheses:**



*   H1: Training separate regression models for each of the three identified clusters will improve performance (RMSE, MAE, R²) over a single unified model.

* H2: Clustering using Mahalanobis distance will yield more effective segments for modeling than Euclidean distance.

* H3: Training separate models for each season (Spring, Summer, Fall, Winter) will improve predictive accuracy by capturing seasonal variation in bike rental behavior.

* H4: Neural Network outperforms our baseline




## **Research Questions:**
* RQ1: Does cluster-based modeling improve regression performance compared to a single global model?

* RQ2: Which clustering distance metric (Mahalanobis vs. Euclidean) produces better downstream model performance?

* RQ3: Can season-specific models capture trends in bike demand better than a unified model?

* RQ4: Among XGBoost, neural networks, and cluster-wise regressors, which method yields the best performance?



## **Experimental Design:**
### Baseline Model:
XGBoost Regressor
### Target Variable:
Rented Bike Count
### Dataset:
Seoul Bike Sharing Demand
### Data Split:
80% training, 20% testing

### Method:
[Insert a brief description of your method here]

### Visualization:

[Insert if applicable]


### Performance Metrics:
RMSE, MAE, and R²

## **Expected Outcomes:**


*  [Insert expectations from experiments]
*  [Insert expectations from experiments]




# **Import Libraries**

include additional libraries for development in this section

In [None]:
pip install ucimlrepo

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.decomposition import PCA
import math
import seaborn as sns
import xgboost as xgb
import matplotlib.pyplot as plt
from scipy.spatial.distance import mahalanobis
from ucimlrepo import fetch_ucirepo
from sklearn.metrics import mean_squared_error, r2_score
from statsmodels.stats.outliers_influence import variance_inflation_factor


from matplotlib.colors import LinearSegmentedColormap
import matplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D


plt.style.use('seaborn-v0_8-whitegrid')
mpl.rcParams['axes.edgecolor'] = '#333333'
mpl.rcParams['axes.linewidth'] = 0.8
mpl.rcParams['xtick.color'] = '#333333'
mpl.rcParams['ytick.color'] = '#333333'


# **Import Data**

In [None]:
# Fetch Seoul Bike Sharing Demand dataset from UCI ML Repository
seoul_bike_sharing_demand = fetch_ucirepo(id=560)

In [None]:
# Data (as pandas dataframes)
X_original = seoul_bike_sharing_demand.data.features
y_original = seoul_bike_sharing_demand.data.targets


In [None]:
# Print dataset information
print("Dataset Metadata:")
print(seoul_bike_sharing_demand.metadata)
print("\nVariable Information:")
print(seoul_bike_sharing_demand.variables)

In [None]:
# Examine feature information
print("\nOriginal feature columns:")
print(X_original.columns.tolist())
print("\nOriginal target variable:")
print(y_original.columns.tolist())

# **Data Preprocessing**

**Data Preprocessing Steps**



*   Makes "Rented Bike Count" new target. Move former target from dataset api to features
*   Check for missing values
*  Transform date column to date time
*  Transform categorical features using one hot encoding
* Split dataset into test (80%) and training (20%)
* Create heatmap to identify correlated features
* Compute VIF (Variance Inflation Factor) to measure multicolinearity
* Define function to remove colinear features using a 0.8 Colinearity threshold
* Compare VIF before and after removal





In [None]:
# Make 'Rented Bike Count' the new target if it exists

if 'Rented Bike Count' in X_original.columns:
    # Make 'Rented Bike Count' the new target
    y = X_original[['Rented Bike Count']]
    # Remove 'Rented Bike Count' from features
    X = X_original.drop('Rented Bike Count', axis=1)
    # Add original target to features
    X = pd.concat([X, y_original], axis=1)

else:
    # If 'Rented Bike Count' is already the target, just confirm
    print("'Rented Bike Count' is already the target variable.")
    y = y_original
    X = X_original

In [None]:
print("\nNew feature columns:")
print(X.columns.tolist())
print("\nNew target variable:")
print(y.columns.tolist())

In [None]:
# Check for missing values
print("\nMissing values in features:")
print(X.isnull().sum())
print("\nMissing values in target:")
print(y.isnull().sum())

In [None]:
# 1. Convert date column to datetime and extract useful components. date format is DD/MM/YYYY
if 'Date' in X.columns:
    # Specify the correct date format as DD/MM/YYYY
    X['Date'] = pd.to_datetime(X['Date'], format='%d/%m/%Y')
    X['Year'] = X['Date'].dt.year
    X['Month'] = X['Date'].dt.month
    X['Day'] = X['Date'].dt.day
    X['DayOfWeek'] = X['Date'].dt.dayofweek
    X = X.drop('Date', axis=1)

In [None]:
# 2. Convert categorical features to numeric using one-hot encoding
X = pd.get_dummies(X, drop_first=True)

In [None]:
X

In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
X_train_df = pd.DataFrame(X_train, columns=X_train.columns)

correlation_matrix = X_train_df.corr()

plt.figure(figsize=(12, 8))
# Use the calculated correlation matrix in the heatmap
sns.heatmap(correlation_matrix, cmap='coolwarm', annot=False, fmt='.2f')
plt.title("Feature Correlation Heatmap")
plt.show()

In [None]:
# Function to calculate VIF (Variance Inflation Factor)
def calculate_vif(X):
    # Select only numeric columns to avoid issues with non-numeric data types
    X_numeric = X.select_dtypes(include=np.number)
    # Drop columns with any non-finite values if they exist (though the type error suggests something else)
    # X_numeric = X_numeric.replace([np.inf, -np.inf], np.nan).dropna(axis=1)

    # Check if there are still columns left
    if X_numeric.shape[1] == 0:
        print("Warning: No numeric columns found for VIF calculation.")
        return pd.DataFrame(columns=["feature", "VIF"])

    vif_data = pd.DataFrame()
    vif_data["feature"] = X_numeric.columns
    # Ensure the values passed to variance_inflation_factor are finite and the correct type
    vif_data["VIF"] = [variance_inflation_factor(X_numeric.values, i) for i in range(X_numeric.shape[1])]
    return vif_data.sort_values("VIF", ascending=False)

# Function to remove collinear features based on correlation threshold
def remove_collinear_features(X, threshold=0.8):
    # Select only numeric columns before calculating correlation
    X_numeric = X.select_dtypes(include=np.number)

    corr_matrix = X_numeric.corr().abs()
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

    # Plot correlation matrix
    plt.figure(figsize=(12, 10))
    mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
    sns.heatmap(corr_matrix, mask=mask, cmap='coolwarm', annot=False,
                square=True, linewidths=.5, cbar_kws={"shrink": .5})
    plt.title('Feature Correlation Matrix')
    plt.tight_layout()
    plt.savefig('correlation_matrix.png')
    plt.show()

    # Find features with correlation greater than threshold
    to_drop = [column for column in upper.columns if any(upper[column] > threshold)]

    print(f"Dropping {len(to_drop)} collinear features: {', '.join(to_drop)}")

    # Return dataframe with collinear features removed (applied to original X)
    return X.drop(to_drop, axis=1)


In [None]:
# Calculate VIF before feature removal
print("Variance Inflation Factors before colinearity removal:")
# Ensure only numeric columns are passed to calculate_vif
X_train_numeric = X_train.select_dtypes(include=np.number)
vif_before = calculate_vif(X_train_numeric)
print(vif_before.head(10))  # Show top 10 highest VIF values

# Remove collinear features
# The remove_collinear_features function now handles numeric selection internally for correlation
X_train_filtered = remove_collinear_features(X_train, threshold=0.8)
# Ensure test set has the same columns as the filtered training set
X_test_filtered = X_test[X_train_filtered.columns]

# Calculate VIF after feature removal
print("\nVariance Inflation Factors after collinearity removal:")
# Ensure only numeric columns are passed to calculate_vif
X_train_filtered_numeric = X_train_filtered.select_dtypes(include=np.number)
vif_after = calculate_vif(X_train_filtered_numeric)
print(vif_after.head(10))

In [None]:
# Standardize the filtered features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_filtered)
X_test_scaled = scaler.transform(X_test_filtered)

# **Baseline Model Development**

In [None]:
# Baseline XGBoost model
print("Training baseline XGBoost regression model...")
baseline_model = xgb.XGBRegressor(random_state=42)
baseline_model.fit(X_train_scaled, y_train.values.ravel())
baseline_preds = baseline_model.predict(X_test_scaled)
baseline_metrics = {
    'rmse': math.sqrt(mean_squared_error(y_test, baseline_preds)),
    'r2': r2_score(y_test, baseline_preds)
}

# **Model Development**

# **Evaluation and Visualizations**

In [None]:
# Create a bar chart comparing the metrics
plt.figure(figsize=(12, 6))
fig.patch.set_facecolor('#f8f9fa')

# Define the colors from our palette
colors = ['#39A0ED', '#FF5E5B', '#D8D8D8']  # Blue, Coral, Light Gray

# RMSE comparison (lower is better)
plt.subplot(1, 2, 1)
models = ['Baseline', 'Model_developed_1', 'Model_developed_2']
rmse_values = [baseline_metrics['rmse'], Model_developed_1['rmse'], Model_developed_2['rmse']]
bars = plt.bar(models, rmse_values, color=colors,
               edgecolor='white', linewidth=0.8, width=0.7)

# Add values on top of bars
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height + 0.1,
            f'{height:.2f}', ha='center', va='bottom', fontweight='bold')


plt.title('RMSE Comparison (lower is better)', fontsize=16, pad=20, fontweight='bold', color='#333333')
plt.ylabel('RMSE', fontsize=12, color='#333333')
plt.grid(axis='y', alpha=0.3)
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)


min_rmse = min(rmse_values)
plt.axhline(y=min_rmse, color='#333333', linestyle='--', alpha=0.5)
plt.text(len(models)-1, min_rmse-0.5, f'Best: {min_rmse:.2f}',
         ha='right', va='top', color='#333333', alpha=0.7)

In [None]:
# R² comparison (higher is better)
plt.subplot(1, 2, 2)
r2_values = [baseline_metrics['r2'], Model_developed_1['r2'], Model_developed_2['r2']]
bars = plt.bar(models, r2_values, color=colors,  # Using the same color palette
               edgecolor='white', linewidth=0.8, width=0.7)

# Add values on top of bars
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height + 0.01,
            f'{height:.4f}', ha='center', va='bottom', fontweight='bold')

# Customize appearance
plt.title('R² Comparison (higher is better)', fontsize=16, pad=20, fontweight='bold', color='#333333')
plt.ylabel('R²', fontsize=12, color='#333333')
plt.grid(axis='y', alpha=0.3)
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)

# Add a horizontal line at the maximum R² for reference
max_r2 = max(r2_values)
plt.axhline(y=max_r2, color='#333333', linestyle='--', alpha=0.5)
plt.text(len(models)-1, max_r2-0.02, f'Best: {max_r2:.4f}',
         ha='right', va='top', color='#333333', alpha=0.7)

plt.tight_layout()
plt.savefig('seoul_bike_metrics_comparison.png', dpi=300, bbox_inches='tight', facecolor='#f8f9fa')
plt.show()

In [None]:
# MAE comparison (lower is better)
plt.subplot(1, 2, 2)

# Calculate MAE for each model
from sklearn.metrics import mean_absolute_error
baseline_mae = mean_absolute_error(y_test, baseline_preds)
Model_developed_1_mae = mean_absolute_error(y_test, Model_developed_1_preds)
Model_developed_2_mae = mean_absolute_error(y_test, Model_developed_2_preds)

mae_values = [baseline_mae, Model_developed_1, Model_developed_2]
bars = plt.bar(models, mae_values, color=colors,  # Using the same color palette
               edgecolor='white', linewidth=0.8, width=0.7)

# Add values on top of bars
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height + 0.1,
            f'{height:.2f}', ha='center', va='bottom', fontweight='bold')


plt.title('MAE Comparison (lower is better)', fontsize=16, pad=20, fontweight='bold', color='#333333')
plt.ylabel('MAE', fontsize=12, color='#333333')
plt.grid(axis='y', alpha=0.3)
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)

# Add a horizontal line at the minimum MAE for reference
min_mae = min(mae_values)
plt.axhline(y=min_mae, color='#333333', linestyle='--', alpha=0.5)
plt.text(len(models)-1, min_mae-0.5, f'Best: {min_mae:.2f}',
         ha='right', va='top', color='#333333', alpha=0.7)

plt.tight_layout()
plt.savefig('seoul_bike_metrics_comparison.png', dpi=300, bbox_inches='tight', facecolor='#f8f9fa')
plt.show()

In [None]:
# Update imports to use correct location for ColumnTransformer
import tensorflow as tf
from tensorflow.keras import layers, models, callbacks
from sklearn.compose import ColumnTransformer  # Changed from sklearn.preprocessing
from sklearn.preprocessing import OneHotEncoder

In [None]:
# First import pandas and numpy and other required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Load the Seoul Bike Sharing dataset
df = pd.read_csv('SeoulBikeData.csv')

# The error occurs because the column name in the CSV might be different
# Let's first check the actual column names in the dataset
print("Available columns in dataset:", df.columns.tolist())

# Get original features and target
X_original = df.drop('Rented Bike Count', axis=1)
y_original = df[['Rented Bike Count']]

# Load and prepare the data
y = df[['Rented Bike Count']]
X = df.drop('Rented Bike Count', axis=1)

# Convert date column to datetime and extract useful components
if 'Date' in X.columns:
    X['Date'] = pd.to_datetime(X['Date'], format='%d/%m/%Y')
    X['Year'] = X['Date'].dt.year
    X['Month'] = X['Date'].dt.month
    X['Day'] = X['Date'].dt.day
    X['DayOfWeek'] = X['Date'].dt.dayofweek
    X = X.drop('Date', axis=1)

# Convert categorical features to numeric using one-hot encoding
X = pd.get_dummies(X, drop_first=True)

# Create train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Now standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create the filtered dataframes with scaled values
X_train_filtered = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_filtered = pd.DataFrame(X_test_scaled, columns=X_test.columns)

# Print available columns in the filtered dataset to debug
print("\nAvailable columns in X_train_filtered:", X_train_filtered.columns.tolist())

# Create cyclical features for hour
if 'Hour' in X_train_filtered.columns:
    X_train_filtered['Hour_sin'] = np.sin(2 * np.pi * X_train_filtered['Hour'] / 24)
    X_train_filtered['Hour_cos'] = np.cos(2 * np.pi * X_train_filtered['Hour'] / 24)
    X_test_filtered['Hour_sin'] = np.sin(2 * np.pi * X_test_filtered['Hour'] / 24)
    X_test_filtered['Hour_cos'] = np.cos(2 * np.pi * X_test_filtered['Hour'] / 24)

# Create interaction features - using exact column names from the dataset
# The column might be named differently in the CSV, possibly 'Temp' instead of 'Temperature'
# We'll add a check to use the correct column names
temp_col = 'Temperature' if 'Temperature' in X_train_filtered.columns else 'Temp'
humidity_col = 'Humidity' if 'Humidity' in X_train_filtered.columns else 'Humidity (%)'

if temp_col in X_train_filtered.columns and 'Hour' in X_train_filtered.columns:
    X_train_filtered['Temp_Hour'] = X_train_filtered[temp_col] * X_train_filtered['Hour']
    X_test_filtered['Temp_Hour'] = X_test_filtered[temp_col] * X_test_filtered['Hour']

if temp_col in X_train_filtered.columns and humidity_col in X_train_filtered.columns:
    X_train_filtered['Humidity_Temp'] = X_train_filtered[humidity_col] * X_train_filtered[temp_col]
    X_test_filtered['Humidity_Temp'] = X_test_filtered[humidity_col] * X_test_filtered[temp_col]

print("\nEnhanced features created")
print("New feature columns:", [col for col in X_train_filtered.columns if col not in X_train.columns])

In [None]:
# Prepare features for neural network
# Looking at the actual column names from earlier output, let's fix the feature names
numeric_features = ['Hour', 'Temperature(�C)', 'Humidity(%)', 'Wind speed (m/s)',
                   'Visibility (10m)', 'Dew point temperature(�C)', 'Solar Radiation (MJ/m2)',
                   'Rainfall(mm)', 'Snowfall (cm)', 'Year', 'Month', 'Day', 'DayOfWeek']

categorical_features = ['Seasons_Spring', 'Seasons_Summer', 'Seasons_Winter',
                       'Holiday_No Holiday', 'Functioning Day_Yes']

# First create the additional features
# Add cyclical hour features
X['Hour_sin'] = np.sin(2 * np.pi * X['Hour'] / 24)
X['Hour_cos'] = np.cos(2 * np.pi * X['Hour'] / 24)

# Add interaction features - using the correct column names from the dataset
X['Temp_Hour'] = X['Temperature(�C)'] * X['Hour']
X['Humidity_Temp'] = X['Humidity(%)'] * X['Temperature(�C)']

# Update numeric features to include new derived features
numeric_features.extend(['Hour_sin', 'Hour_cos', 'Temp_Hour', 'Humidity_Temp'])

# Create feature matrix for neural network
X_nn = pd.concat([X[numeric_features], X[categorical_features]], axis=1)

print("Neural network feature matrix shape:", X_nn.shape)
print("Features:", X_nn.columns.tolist())

In [None]:
# Split data for neural network
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Split the data
X_train_val, X_test_nn, y_train_val, y_test_nn = train_test_split(
    X_nn, y, test_size=0.2, random_state=42
)
X_train_nn, X_val_nn, y_train_nn, y_val_nn = train_test_split(
    X_train_val, y_train_val, test_size=0.25, random_state=42
)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_nn)
X_val_scaled = scaler.transform(X_val_nn)
X_test_scaled = scaler.transform(X_test_nn)

print("Neural Network Data Shapes:")
print(f"X_train: {X_train_scaled.shape}")
print(f"X_val: {X_val_scaled.shape}")
print(f"X_test: {X_test_scaled.shape}")


In [None]:
# Build and compile the neural network model
# Set random seeds for reproducibility
tf.random.set_seed(42)
np.random.seed(42)

# Define learning rate schedule
initial_learning_rate = 0.001
decay_steps = 1000
decay_rate = 0.9
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate, decay_steps, decay_rate
)

# Build the model
model = models.Sequential([
    layers.Input(shape=(X_train_scaled.shape[1],)),

    layers.Dense(128, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(0.3),

    layers.Dense(64, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(0.3),

    layers.Dense(32, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(0.2),

    layers.Dense(1)
])

# Compile the model
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=lr_schedule),
    loss='mse',
    metrics=['mae']
)

print("Neural network model built and compiled successfully")
model.summary()


In [None]:
# Train the neural network
# Define callbacks
early_stopping = callbacks.EarlyStopping(
    monitor='val_loss',
    patience=15,
    restore_best_weights=True
)

model_checkpoint = callbacks.ModelCheckpoint(
    'best_seoul_bike_model.h5',
    monitor='val_loss',
    save_best_only=True
)

# Train the model
print("Training neural network...")
history = model.fit(
    X_train_scaled, y_train_nn,
    validation_data=(X_val_scaled, y_val_nn),
    epochs=100,
    batch_size=32,
    callbacks=[early_stopping, model_checkpoint],
    verbose=1
)

print("Neural network training completed!")


In [None]:
# Evaluate neural network performance
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Make predictions
y_pred_nn = model.predict(X_test_scaled).flatten()

# Calculate metrics
rmse_nn = np.sqrt(mean_squared_error(y_test_nn, y_pred_nn))
mae_nn = mean_absolute_error(y_test_nn, y_pred_nn)
r2_nn = r2_score(y_test_nn, y_pred_nn)

print("Neural Network Performance Metrics:")
print(f"RMSE: {rmse_nn:.2f}")
print(f"MAE: {mae_nn:.2f}")
print(f"R²: {r2_nn:.4f}")

# Store results for comparison
nn_results = {
    'Model': 'Neural Network',
    'RMSE': rmse_nn,
    'MAE': mae_nn,
    'R²': r2_nn
}


In [None]:
# Import matplotlib first
import matplotlib.pyplot as plt

# Visualize neural network results
plt.figure(figsize=(15, 5))

# Training history
plt.subplot(1, 3, 1)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Neural Network Training Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

plt.subplot(1, 3, 2)
plt.plot(history.history['mae'], label='Training MAE')
plt.plot(history.history['val_mae'], label='Validation MAE')
plt.title('Neural Network Training MAE')
plt.xlabel('Epoch')
plt.ylabel('MAE')
plt.legend()

# Predictions vs Actual
plt.subplot(1, 3, 3)
plt.scatter(y_test_nn, y_pred_nn, alpha=0.5)
plt.plot([y_test_nn.min(), y_test_nn.max()], [y_test_nn.min(), y_test_nn.max()], 'r--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title(f'Neural Network: R² = {r2_nn:.4f}')

plt.tight_layout()
plt.show()

In [None]:
# Compare with baseline (add this after you implement your XGBoost baseline)
# This code assumes you have baseline results stored
print("\n" + "="*50)
print("MODEL COMPARISON RESULTS")
print("="*50)

# Display neural network results
print(f"Neural Network - RMSE: {rmse_nn:.2f}, MAE: {mae_nn:.2f}, R²: {r2_nn:.4f}")

# Add comparison with other models when available
print("\nNeural Network successfully integrated into the Seoul Bike prediction framework!")
