# Etivity 3 - Task 2: Regression
## Name: Brian Mortimer
## Student ID: 20258763

Open a new Jupyter notebook and name it etivity3_regression.ipynb. In this notebook, train three regression pipelines with Random Forest, Linear Regression and a third regressor of your choice as the final estimator, respectively, for predicting the value of `insurance_cost`.

Requirements:
- For each regressor, include data preparation and dimensionality reduction steps in the main pipeline.
- You can choose any regressor as the third one. Some options are SVR and MLPRegressor, but you are not limited to them.
- For the dimensionality reduction step use PCA, RFE and a third dimensionality reduction (incl. feature selection) technique in at least one pipeline.
- Use grid search for hyperparameter tuning and replicate the process in the example notebook Tutorial 3-2 - Regression and Dimensionality Reduction.ipynb to evaluate and compare the models you have trained and pick the best one.
- Summarise your experience in a markdown cell (max 150 words in a markdown cell).

### Data Loading & Preparation

In [12]:
# Imports
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import (ColumnTransformer, TransformedTargetRegressor)
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler, RobustScaler, FunctionTransformer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import svm
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import roc_curve, auc, accuracy_score, precision_recall_fscore_support
from sklearn import set_config
from sklearn.decomposition import PCA
from sklearn.feature_selection import RFE


from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer

import matplotlib.pyplot as plt
%matplotlib inline


In [7]:
# Functions
def load_insurance_data():
    """
    Load the insurance dataset from a CSV file.
    Returns a pandas DataFrame.
    """
    # Load the dataset
    df = pd.read_csv('insurance.csv')

    return df

In [8]:
# Load the dataset
df_original = load_insurance_data()
df = df_original.copy()

df.head()

Unnamed: 0,age,gender,bmi,children,smoker,region,insurance_cost
0,18,male,33.77,1,no,southeast,1725.5523
1,18,male,34.1,0,no,southeast,1137.011
2,18,female,26.315,0,no,northeast,2198.18985
3,18,female,38.665,2,no,northeast,3393.35635
4,18,female,35.625,0,no,northeast,2211.13075


### Define Preprocessing Pipeline

In [9]:
# transformers
# Transform gender to binary values "male"=0, "female"=1
gender_transformer = FunctionTransformer(
    lambda x: np.where(x == 'male', 0, 1)
)

# Transform region to binary values "northeast"=0, "southeast"=1, "southwest"=2, "northwest"=3
region_transformer = FunctionTransformer(
    lambda x: pd.get_dummies(x, drop_first=True)
)

# Transform smoker to binary values "yes"=1, "no"=0
smoker_transformer = FunctionTransformer(
    lambda x: np.where(x == 'yes', 1, 0)
)

# Transform BMI using log transformation to reduce skewness and impact of outliers
bmi_transformer  = Pipeline(
    steps=[
        ("log_transform", FunctionTransformer(np.log)), 
        ("scaler", RobustScaler())
    ]
)

# Transform children using cubic root transformation to reduce skewness
children_transformer = Pipeline(
    steps = [
        ("cubic_root_transform", FunctionTransformer(np.cbrt)),
        ("scaler", RobustScaler())
    ]
)

# Define the preprocessor
preprocessor_pipeline = ColumnTransformer(
    transformers=[
        ('bmi', bmi_transformer, ['bmi']),
        ('age', StandardScaler(), ['age']),
        ('children', children_transformer, ['children']),
        ('gender', gender_transformer, ['gender']),
        ('region', region_transformer, ['region']),
        ('smoker', smoker_transformer, ['smoker'])
    ]
)


In [10]:
# split the data into features and target variable
X = df.drop(columns=['insurance_cost'])
y = df['insurance_cost']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Define & Optimise Models

#### Random Forest

In [13]:
# Define the model pipeline for a Random Forest Regressor
rf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor_pipeline),
    ("reduce_dim", "passthrough"),
    ('ttr', TransformedTargetRegressor(
        regressor=RandomForestRegressor(random_state=42, n_jobs=-1, n_estimators=10),
        func=np.log,
        inverse_func=np.exp
    ))
])

In [15]:
N_FEATURES_OPTIONS = [2, 4, 6, 8, 10]
MAX_DEPTH_OPTIONS = [2, 4, 6, 8]

param_grid = [
    {
        'reduce_dim': [PCA()],
        'reduce_dim__n_components': N_FEATURES_OPTIONS,
        'ttr__regressor__max_depth': MAX_DEPTH_OPTIONS,
    },
    {
        'reduce_dim': [RFE(RandomForestRegressor(random_state=42, n_jobs=-1))],
        'reduce_dim__n_features_to_select': N_FEATURES_OPTIONS,
        'ttr__regressor__max_depth': MAX_DEPTH_OPTIONS,
    },
    {
        'reduce_dim': ["passthrough"],
        'ttr__regressor__n_estimators': [10, 50, 100],
        'ttr__regressor__max_depth': MAX_DEPTH_OPTIONS,
    }
]

search = GridSearchCV(rf_pipeline, param_grid, n_jobs=-1, cv=5, refit=True)
search.fit(X_train, y_train)

print("Best CV score = %0.3f:" % search.best_score_)
print("Best parameters: ", search.best_params_)

# store the best params and best model for later use
RF_best_params = search.best_params_
RF_best_model = search.best_estimator_

Best CV score = 0.852:
Best parameters:  {'reduce_dim': 'passthrough', 'ttr__regressor__max_depth': 6, 'ttr__regressor__n_estimators': 100}


20 fits failed out of a total of 260.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
20 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\Brian\anaconda3\envs\ai_env\Lib\site-packages\sklearn\model_selection\_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\Brian\anaconda3\envs\ai_env\Lib\site-packages\sklearn\base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Brian\anaconda3\envs\ai_env\Lib\site-packages\sklearn\pipeline.py", line 654, in fit
    Xt = self._fit(X, y, routed_params, raw_params=params)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

#### Linear Regression

#### MLP Regressor

### Model Evaluation & Comparison

# Conclusion