# Assignment 2

I have outlined each persons part and given them code cells.

Try not to change other peoples code cells.

If you can, try not to add new code cells or markdown cells as git can get confused when merging. If you absolutely need to add a new cell, make sure everyone else has saved and commited their work, commit your change with the added cells, and then make sure everyone has updated to that new change before working on the doc again.

For your part try define what inputs you need and what outputs you are giving to the next person. 

Eg, 

    Preprocessing (Molly) should specify the processed dataset they want everyone to use with their models & the different pipelines they want performance metrics to be ran on

    People defining models (Ross & Elsie) should clearly state what models they want to performance and efficency tests to be performed on

    And people doing performance metrics (Jude & Krishan) should save and plot the metrics clearly for the conclusion
    

In [1]:
# Import libraries
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, KFold, cross_validate, learning_curve
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, RobustScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, BayesianRidge, SGDRegressor, RidgeCV
from sklearn.metrics import (
    mean_squared_error, mean_absolute_error, r2_score,
)

# For reproducibility
RNG = 42
np.random.seed(RNG)

Code cells for defining functions or importing special libraries:

In [None]:
# Molly's code cell

In [None]:
# Ross's code cell

In [None]:
# Elsie's code cell

In [None]:
# Jude's code cell

In [None]:
# Krishan's code cell

In [None]:
# Rawan's code cell

# Pre-Processing

Ive just added my pre-processing from part 1, we can build on this or your own pre processing, whicherver you prefer :)

### Molly:
    
    Removing outliers/ missing data 

    train-test split 

    Data analysis 

    Co-variance 

    Correlation 

    histograms/ box plots to find data distribution 

    Investigate column encoding vs number encoding efficiency 

    For every different type pre-processing you want to performance and efficiency test define a pipeline with ridge regression so that Krishan & Jude can do performance analysis

In [None]:
# Molly's code cell

# dataset path
DATA_PATH = r"abalone.data.csv" 

# Load dataset as pandas DataFrame
df = pd.read_csv(DATA_PATH, sep=",")
assert "Rings" in df.columns, "Expected 'Rings' column." # Assert Rings column exists

# Converting gender to 3 separate one-hot encoded columns
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), ['gender'])  # Apply OneHotEncoder to column 3
    ],
    remainder='passthrough'  # Keep the other columns
)

datEnc = preprocessor.fit_transform(df)

# Get the new column names created by the transformer
new_columns = preprocessor.get_feature_names_out()

# Create a new DataFrame with the transformed data and new column names
dfEnc = pd.DataFrame(datEnc, columns=new_columns)
dfEnc = dfEnc.rename(columns={"cat__gender_F": "gender_F", "cat__gender_I": "gender_I", "cat__gender_M": "gender_M", "remainder__Length": "length", "remainder__Diameter": "diameter", "remainder__Height" : "height", "remainder__Whole weight" : "whole_weight", "remainder__Shucked weight": "shucked_weight", "remainder__Viscera weight" : "viscera_weight", "remainder__Shell weight" : "shell_weight", "remainder__Rings" : "Rings"})

# Summary statistics and check for missing values
display(dfEnc.describe(include='all'))
print("\nMissing values per column:")
print(dfEnc.isna().sum().sort_values(ascending=False))

# Check for duplicates
num_duplicates = dfEnc.duplicated().sum()
print(f"Number of duplicates: {num_duplicates}")

# Plot histograms for each feature
df.hist(bins=15, figsize=(10, 10))
plt.suptitle("Feature Distributions", fontsize=16)
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

# Remove entries with outliers in 'height' feature
height_threshold = dfEnc['height'].quantile(0.99)  # 99th percentile
dfEnc = dfEnc[dfEnc['height'] <= height_threshold]
dfEnc = dfEnc[dfEnc['height'] > 0]

# re-plot height distribution after removing outliers
print("After pre-processing:")
display(dfEnc.describe(include='all'))
dfEnc.hist(bins=15, figsize=(10, 10))
plt.suptitle("Feature Distributions After pre-processing", fontsize=16)
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

# Split data into features and target variable
X = dfEnc.drop(columns=["Rings"])
y_reg = dfEnc["Rings"]

# Split data into training and testing sets for regression
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(
    X, y_reg, test_size=0.2, random_state=RNG
)

# Fold datasets for cross validation
kf = KFold(n_splits=5, shuffle=True, random_state=RNG)

#

### Jude:

    Do performance analysis of different pre-processing

In [None]:
# Jude's code cell

### Krishan:

    Do efficiency analysis of different pre-processing

In [None]:
# Krishan's code cell

# Comparing Different Models


Im just putting ideas for models we could use, feel free to change :)

## Ridge regression

### Ross:

    Implement model pipeline with tuned hyperparameters

    Train model

    Use Model to predict

In [None]:
# Ross's code cell

### Jude:

    Performance analysis

In [None]:
# Jude's code cell

### Krishan:

    Efficiency analysis

In [None]:
# Krishan's code cell

## Nieve Bayes

### Ross:

    Implement model pipeline with tuned hyperparameters

    Train model

    Use Model to predict

In [None]:
# Ross's code cell

### Jude:

    Performance analysis

In [None]:
# Jude's code cell

### Krishan:

    Efficiency analysis

In [None]:
# Krishan's code cell

## Decision Tree

### Ross:

    Implement model pipeline with tuned hyperparameters

    Train model

    Use Model to predict

In [None]:
# Ross's code cell


### Jude:

    Performance analysis

In [None]:
# Jude's code cell

### Krishan:

    Efficiency analysis

In [None]:
# Krishan's code cell

## Forest of Tree

### Elsie:

    Implement model pipeline with tuned hyperparameters

    Train model

    Use Model to predict

In [None]:
# Elsie's code cell


### Jude:

    Performance analysis

In [None]:
# Jude's code cell

### Krishan:

    Efficiency analysis

In [None]:
# Krishan's code cell

## Voting Regressor

### Elsie:

    Implement model pipeline with tuned hyperparameters

    Train model

    Use Model to predict

In [None]:
# Elsie's code cell


### Jude:

    Performance analysis

In [None]:
# Jude's code cell

### Krishan:

    Efficiency analysis

In [None]:
# Krishan's code cell

## Stacking Regressor

### Elsie:

    Implement model pipeline with tuned hyperparameters

    Train model

    Use Model to predict

In [None]:
# Elsie's code cell


### Jude:

    Performance analysis

In [None]:
# Jude's code cell

### Krishan:

    Efficiency analysis

In [None]:
# Krishan's code cell

# Model Comparison

### Rawan:

    Create any graphs to compare model metrics


In [None]:
# Rawan's code cell