CAPSTONE PROJECT 1

Task: Develop a regression model to predict the Premium amount based on the data provided.
Objectives are:

1.) Clean and preprocess the data

2.) Explore feature importance and relationships

3.) Build and evaluate a robust predictive model

Import necessary libraries

In [15]:
import pandas as pd
import numpy as py
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency
import matplotlib.patches as mpatches  # Import mpatches

Read the dataset into pandas and preview the first five rows

In [16]:
df = pd.read_csv("Insurance Premium Prediction Dataset.csv")
df.head()

Unnamed: 0,Age,Gender,Annual Income,Marital Status,Number of Dependents,Education Level,Occupation,Health Score,Location,Policy Type,Previous Claims,Vehicle Age,Credit Score,Insurance Duration,Premium Amount,Policy Start Date,Customer Feedback,Smoking Status,Exercise Frequency,Property Type
0,56.0,Male,99990.0,Married,1.0,Master's,,31.074627,Urban,Comprehensive,,13,320.0,5,308.0,2022-12-10 15:21:39.078837,Poor,Yes,Daily,Condo
1,46.0,Male,2867.0,Single,1.0,Bachelor's,,50.271335,Urban,Comprehensive,,3,694.0,4,517.0,2023-01-31 15:21:39.078837,Good,Yes,Monthly,House
2,32.0,Female,30154.0,Divorced,3.0,Bachelor's,,14.714909,Suburban,Comprehensive,2.0,16,652.0,8,849.0,2023-11-26 15:21:39.078837,Poor,No,Monthly,House
3,60.0,Female,48371.0,Divorced,0.0,PhD,Self-Employed,25.346926,Rural,Comprehensive,1.0,11,330.0,7,927.0,2023-02-27 15:21:39.078837,Poor,No,Rarely,Condo
4,25.0,Female,54174.0,Divorced,0.0,High School,Self-Employed,6.659499,Urban,Comprehensive,,9,,8,303.0,2020-11-25 15:21:39.078837,Poor,No,Rarely,Condo


peep into the dataset to get basic information about the data type and column names

In [17]:
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278860 entries, 0 to 278859
Data columns (total 20 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   Age                   274175 non-null  float64
 1   Gender                278860 non-null  object 
 2   Annual Income         264905 non-null  float64
 3   Marital Status        273841 non-null  object 
 4   Number of Dependents  250974 non-null  float64
 5   Education Level       278860 non-null  object 
 6   Occupation            197572 non-null  object 
 7   Health Score          268263 non-null  float64
 8   Location              278860 non-null  object 
 9   Policy Type           278860 non-null  object 
 10  Previous Claims       197572 non-null  float64
 11  Vehicle Age           278860 non-null  int64  
 12  Credit Score          250974 non-null  float64
 13  Insurance Duration    278860 non-null  int64  
 14  Premium Amount        277019 non-null  float64
 15  

Unnamed: 0,Age,Annual Income,Number of Dependents,Health Score,Previous Claims,Vehicle Age,Credit Score,Insurance Duration,Premium Amount
count,274175.0,264905.0,250974.0,268263.0,197572.0,278860.0,250974.0,278860.0,277019.0
mean,41.020771,42089.085329,1.998048,28.58429,0.998117,9.520283,574.362049,5.007764,966.118667
std,13.549683,35444.517255,1.412312,15.966208,1.000795,5.767915,158.792037,2.581349,909.404567
min,18.0,0.0,0.0,0.035436,0.0,0.0,300.0,1.0,0.0
25%,29.0,13588.0,1.0,16.14989,0.0,5.0,437.0,3.0,286.0
50%,41.0,32191.0,2.0,26.451244,1.0,10.0,575.0,5.0,688.0
75%,53.0,62164.0,3.0,38.966369,2.0,15.0,712.0,7.0,1367.0
max,64.0,149997.0,4.0,93.87609,9.0,19.0,849.0,9.0,4999.0


Check for duplicates in the dataset

In [18]:
# Check for duplicates in the entire DataFrame
duplicates = df[df.duplicated(keep=False)]  # keep=False to show all duplicate rows

if not duplicates.empty:
    print("Duplicates found:")
    display(duplicates)
    initial_rows = df.shape[0]
    df.drop_duplicates(inplace=True)
    rows_after_dropping_duplicates = df.shape[0]

else:
    print("No duplicates found in the DataFrame.")

No duplicates found in the DataFrame.


Converting column values into appropriate formats

In [19]:
# 1. Examine data types
print("Original data types:")
print(df.dtypes)

# 2. Convert numerical columns currently as objects to numeric
numeric_cols_to_convert = ['Annual Income', 'Premium Amount', 'Credit Score', 'Health Score', 'Previous Claims']
for col in numeric_cols_to_convert:
    if col in df.columns and df[col].dtype == 'object':
        df[col] = pd.to_numeric(df[col], errors='coerce')

# 3. Ensure consistent formatting for categorical text data
text_cols_to_clean = ['Gender', 'Marital Status', 'Education Level', 'Occupation', 'Location', 'Policy Type', 'Customer Feedback', 'Smoking Status', 'Exercise Frequency', 'Property Type']
for col in text_cols_to_clean:
    if col in df.columns and df[col].dtype == 'object':
        df[col] = df[col].str.lower().str.strip()

# 4. Convert date columns to datetime objects
date_cols_to_convert = ['Policy Start Date']
for col in date_cols_to_convert:
    if col in df.columns:
        df[col] = pd.to_datetime(df[col], errors='coerce')

# 5. Display data types after conversions
print("\nData types after conversions and formatting:")
print(df.dtypes)

Original data types:
Age                     float64
Gender                   object
Annual Income           float64
Marital Status           object
Number of Dependents    float64
Education Level          object
Occupation               object
Health Score            float64
Location                 object
Policy Type              object
Previous Claims         float64
Vehicle Age               int64
Credit Score            float64
Insurance Duration        int64
Premium Amount          float64
Policy Start Date        object
Customer Feedback        object
Smoking Status           object
Exercise Frequency       object
Property Type            object
dtype: object

Data types after conversions and formatting:
Age                            float64
Gender                          object
Annual Income                  float64
Marital Status                  object
Number of Dependents           float64
Education Level                 object
Occupation                      object
Health

Understanding the percentage of missing values to the whole column values

In [20]:
missing_values = df.isnull().sum()
missing_values = missing_values[missing_values > 0].sort_values(ascending=False)
missing_percentage = (missing_values / len(df)) * 100

missing_info = pd.DataFrame({'Missing Count': missing_values, 'Missing Percentage (%)': missing_percentage})
display(missing_info)

Unnamed: 0,Missing Count,Missing Percentage (%)
Occupation,81288,29.150111
Previous Claims,81288,29.150111
Number of Dependents,27886,10.0
Credit Score,27886,10.0
Customer Feedback,18349,6.580004
Annual Income,13955,5.004303
Health Score,10597,3.800115
Marital Status,5019,1.799828
Age,4685,1.680055
Premium Amount,1841,0.660188


Handling the missing values

In [21]:
# Impute numerical columns with median
numerical_cols_to_impute = ['Annual Income', 'Health Score', 'Age', 'Premium Amount', 'Credit Score', 'Number of Dependents', 'Previous Claims']
for col in numerical_cols_to_impute:
    if col in df.columns and df[col].isnull().any():
        median_val = df[col].median()
        df.fillna({col: median_val}, inplace=True)

# Impute categorical columns with mode
categorical_cols_to_impute = ['Marital Status', 'Customer Feedback', 'Occupation']
for col in categorical_cols_to_impute:
    if col in df.columns and df[col].isnull().any():
        mode_val = df[col].mode()[0]
        df.fillna(mode_val, inplace=True)

# Verify that missing values have been handled
missing_values_after = df.isnull().sum()
missing_values_after = missing_values_after[missing_values_after > 0]

if missing_values_after.empty:
    print("All missing values have been handled.")
else:
    print("Remaining missing values:")
    display(missing_values_after)

All missing values have been handled.


Identifying and handling skewness

In [22]:
import numpy as np
from scipy.stats import boxcox

# 1. Identify numerical columns
numerical_cols = df.select_dtypes(include=np.number).columns.tolist()

# Exclude the target variable 'Premium Amount' for now if we are only transforming features
# Depending on the modeling approach, the target variable might also need transformation,
# but for feature skewness, we focus on predictors.
# Let's keep 'Premium Amount' for now as it's a numerical column and its skewness might be relevant.

# 2. For each numerical column, calculate its skewness.
skewness = df[numerical_cols].skew().sort_values(ascending=False)
print("Original Skewness:")
print(skewness)

# 3. Determine a threshold for skewness
# A common threshold is |skewness| > 1.0 for high skewness, or |skewness| > 0.5 for moderate skewness.
# Let's use 0.7 as a threshold for transformation.
skewness_threshold = 0.7
highly_skewed_cols = skewness[(abs(skewness) > skewness_threshold)].index.tolist()

print(f"\nHighly skewed columns (absolute skewness > {skewness_threshold}):")
print(highly_skewed_cols)

# 4. Apply appropriate transformations
transformed_df = df.copy()

for col in highly_skewed_cols:
    # Check if the column has non-negative values for log and Box-Cox transformations
    if (transformed_df[col] >= 0).all():
        # Apply log transformation if skewness is positive and data is non-negative
        # Adding a small constant to handle potential zero values
        if transformed_df[col].min() == 0:
            transformed_df[col + '_log'] = np.log1p(transformed_df[col]) # log1p(x) = log(1+x)
        else:
             transformed_df[col + '_log'] = np.log(transformed_df[col])

        # Apply Box-Cox transformation if skewness is positive and data is strictly positive
        # Box-Cox requires strictly positive data.
        # Let's check for strictly positive values before applying Box-Cox
        if (transformed_df[col] > 0).all():
            try:
                transformed_df[col + '_boxcox'], fitted_lambda = boxcox(transformed_df[col])
                print(f"Applied Box-Cox transformation to '{col}' with lambda={fitted_lambda:.4f}")
            except Exception as e:
                print(f"Could not apply Box-Cox transformation to '{col}': {e}")
        else:
             print(f"Column '{col}' contains non-positive values, skipping Box-Cox.")

    else:
        print(f"Column '{col}' contains negative values, skipping log and Box-Cox transformations.")
        # For negative values, consider other transformations like Yeo-Johnson or simply not transforming if not severely skewed

# 5. Recalculate and examine the skewness of the transformed columns
transformed_numerical_cols = transformed_df.select_dtypes(include=np.number).columns.tolist()
transformed_skewness = transformed_df[transformed_numerical_cols].skew().sort_values(ascending=False)

print("\nSkewness after Transformations:")
print(transformed_skewness)

# Update the original dataframe with the transformed columns
# You might choose to keep both original and transformed columns or replace
# For this task, let's keep the transformed columns and decide later which ones to use in modeling.
df = transformed_df

Original Skewness:
Premium Amount          1.510386
Previous Claims         1.204923
Annual Income           1.061455
Health Score            0.620481
Number of Dependents    0.000744
Credit Score            0.000251
Age                    -0.001830
Insurance Duration     -0.002477
Vehicle Age            -0.003884
dtype: float64

Highly skewed columns (absolute skewness > 0.7):
['Premium Amount', 'Previous Claims', 'Annual Income']
Column 'Premium Amount' contains non-positive values, skipping Box-Cox.
Column 'Previous Claims' contains non-positive values, skipping Box-Cox.
Column 'Annual Income' contains non-positive values, skipping Box-Cox.

Skewness after Transformations:
Premium Amount          1.510386
Previous Claims         1.204923
Annual Income           1.061455
Health Score            0.620481
Number of Dependents    0.000744
Credit Score            0.000251
Age                    -0.001830
Insurance Duration     -0.002477
Vehicle Age            -0.003884
Previous Claims_lo

Encoding categorical columns

In [23]:
# Identify categorical columns
categorical_cols = df.select_dtypes(include='object').columns.tolist()

# Apply one-hot encoding to all categorical columns
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Display the head and info of the updated DataFrame
display(df_encoded.head())
display(df_encoded.info())

Unnamed: 0,Age,Annual Income,Number of Dependents,Health Score,Previous Claims,Vehicle Age,Credit Score,Insurance Duration,Premium Amount,Policy Start Date,...,Policy Type_premium,Customer Feedback_good,Customer Feedback_poor,Customer Feedback_single,Smoking Status_yes,Exercise Frequency_monthly,Exercise Frequency_rarely,Exercise Frequency_weekly,Property Type_condo,Property Type_house
0,56.0,99990.0,1.0,31.074627,1.0,13,320.0,5,308.0,2022-12-10 15:21:39.078837,...,False,False,True,False,True,False,False,False,True,False
1,46.0,2867.0,1.0,50.271335,1.0,3,694.0,4,517.0,2023-01-31 15:21:39.078837,...,False,True,False,False,True,True,False,False,False,True
2,32.0,30154.0,3.0,14.714909,2.0,16,652.0,8,849.0,2023-11-26 15:21:39.078837,...,False,False,True,False,False,True,False,False,False,True
3,60.0,48371.0,0.0,25.346926,1.0,11,330.0,7,927.0,2023-02-27 15:21:39.078837,...,False,False,True,False,False,False,True,False,True,False
4,25.0,54174.0,0.0,6.659499,1.0,9,575.0,8,303.0,2020-11-25 15:21:39.078837,...,False,False,True,False,False,False,True,False,True,False


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278860 entries, 0 to 278859
Data columns (total 35 columns):
 #   Column                       Non-Null Count   Dtype         
---  ------                       --------------   -----         
 0   Age                          278860 non-null  float64       
 1   Annual Income                278860 non-null  float64       
 2   Number of Dependents         278860 non-null  float64       
 3   Health Score                 278860 non-null  float64       
 4   Previous Claims              278860 non-null  float64       
 5   Vehicle Age                  278860 non-null  int64         
 6   Credit Score                 278860 non-null  float64       
 7   Insurance Duration           278860 non-null  int64         
 8   Premium Amount               278860 non-null  float64       
 9   Policy Start Date            278860 non-null  datetime64[ns]
 10  Premium Amount_log           278860 non-null  float64       
 11  Previous Claims_log       

None

Adding new features for proper training 

In [24]:
# Extract year, month, and day from 'Policy Start Date'
df_encoded['Policy_Start_Year'] = df_encoded['Policy Start Date'].dt.year
df_encoded['Policy_Start_Month'] = df_encoded['Policy Start Date'].dt.month
df_encoded['Policy_Start_Day'] = df_encoded['Policy Start Date'].dt.day

# Create 'Age Group' bins from the 'Age' column
# Define bins and labels for age groups
age_bins = [0, 18, 30, 45, 60, 100]
age_labels = ['0-17', '18-29', '30-44', '45-59', '60+']
df_encoded['Age_Group'] = pd.cut(df_encoded['Age'], bins=age_bins, labels=age_labels, right=False)

# Display the head of the DataFrame with the new features
display(df_encoded[['Policy_Start_Year', 'Policy_Start_Month', 'Policy_Start_Day', 'Age_Group']].head())

Unnamed: 0,Policy_Start_Year,Policy_Start_Month,Policy_Start_Day,Age_Group
0,2022,12,10,45-59
1,2023,1,31,45-59
2,2023,11,26,30-44
3,2023,2,27,60+
4,2020,11,25,18-29


Check for need for text processing

In [25]:
# Examine the columns in the df_encoded DataFrame
print("Columns in df_encoded:")
print(df_encoded.columns)

# Check if there are any text columns left that might require text processing
# after initial cleaning and one-hot encoding
text_columns_remaining = df_encoded.select_dtypes(include='object').columns.tolist()

if text_columns_remaining:
    print("\nRemaining text columns that might require further processing:")
    print(text_columns_remaining)
    # Outline potential steps if needed (not executing code here as per instruction)
    print("\nFurther text data processing steps (if required for these columns):")
    print("1. Examine the content of these text columns to understand their nature.")
    print("2. If they contain free-form text, consider techniques like TF-IDF, Word Embeddings (e.g., Word2Vec, GloVe), or pre-trained transformer models (e.g., BERT) for feature extraction.")
    print("3. If they are structured text with specific patterns, consider using regular expressions or custom parsing functions.")
    print("4. Based on the chosen technique, implement the appropriate code to transform the text data into numerical features.")
else:
    print("\nNo text columns remaining in df_encoded that require further processing.")


Columns in df_encoded:
Index(['Age', 'Annual Income', 'Number of Dependents', 'Health Score',
       'Previous Claims', 'Vehicle Age', 'Credit Score', 'Insurance Duration',
       'Premium Amount', 'Policy Start Date', 'Premium Amount_log',
       'Previous Claims_log', 'Annual Income_log', 'Gender_male',
       'Marital Status_married', 'Marital Status_single',
       'Education Level_high school', 'Education Level_master's',
       'Education Level_phd', 'Occupation_self-employed', 'Occupation_single',
       'Occupation_unemployed', 'Location_suburban', 'Location_urban',
       'Policy Type_comprehensive', 'Policy Type_premium',
       'Customer Feedback_good', 'Customer Feedback_poor',
       'Customer Feedback_single', 'Smoking Status_yes',
       'Exercise Frequency_monthly', 'Exercise Frequency_rarely',
       'Exercise Frequency_weekly', 'Property Type_condo',
       'Property Type_house', 'Policy_Start_Year', 'Policy_Start_Month',
       'Policy_Start_Day', 'Age_Group'],
     

Separate features (X) and target variable (y), and split the data into training and testing sets.

In [26]:
from sklearn.model_selection import train_test_split
# 1. Define the features X by dropping the 'Premium Amount' column and other irrelevant columns.
# 'Policy Start Date' is a datetime object, and 'Age_Group' is an object (categorical)
# We will drop the original 'Premium Amount' and the transformed log columns for now, focusing on the original numerical features and one-hot encoded features.
columns_to_drop = ['Premium Amount', 'Policy Start Date', 'Age_Group', 'Premium Amount_log', 'Previous Claims_log', 'Annual Income_log']
X = df_encoded.drop(columns=columns_to_drop, errors='ignore')

# 2. Define the target variable y as the 'Premium Amount' column.
y = df_encoded['Premium Amount']

# 3. Split the data into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# 4. Print the shapes of the resulting training and testing sets.
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (209145, 33)
Shape of X_test: (69715, 33)
Shape of y_train: (209145,)
Shape of y_test: (69715,)


Feature Importance

In [27]:
from sklearn.ensemble import RandomForestRegressor

# Instantiate a RandomForestRegressor model
model = RandomForestRegressor(random_state=42)

# Fit the model to the training data
model.fit(X_train, y_train)

# Get feature importances
feature_importances = model.feature_importances_

# Create a pandas Series for better visualization
feature_importance_series = pd.Series(feature_importances, index=X_train.columns)

# Sort feature importances in descending order
sorted_feature_importances = feature_importance_series.sort_values(ascending=False)

# Display the top 15 most important features
top_n = 15
print(f"Top {top_n} Most Important Features:")
display(sorted_feature_importances.head(top_n))

Top 15 Most Important Features:


Health Score                 0.125994
Annual Income                0.125078
Credit Score                 0.107163
Age                          0.082520
Policy_Start_Day             0.075037
Vehicle Age                  0.068166
Policy_Start_Month           0.052050
Insurance Duration           0.045089
Policy_Start_Year            0.036087
Number of Dependents         0.032477
Previous Claims              0.027181
Gender_male                  0.012380
Smoking Status_yes           0.011930
Policy Type_comprehensive    0.010934
Marital Status_single        0.010932
dtype: float64

Feature selection based on importance

In [28]:
# 1. Based on the sorted_feature_importances series, decide on a number of top features to keep.
# Let's choose to keep the top 20 features as a starting point.
num_top_features = 20
selected_features = sorted_feature_importances.head(num_top_features).index.tolist()

# 2. Create a list of the selected feature names.
print(f"Selected Features ({num_top_features}):")
print(selected_features)

# 3. Filter the training and testing feature DataFrames (X_train and X_test) to include only the selected features.
X_train_selected = X_train[selected_features]
X_test_selected = X_test[selected_features]

# 4. Print the shapes of the filtered X_train and X_test to confirm the feature selection.
print("\nShape of X_train after feature selection:", X_train_selected.shape)
print("Shape of X_test after feature selection:", X_test_selected.shape)

Selected Features (20):
['Health Score', 'Annual Income', 'Credit Score', 'Age', 'Policy_Start_Day', 'Vehicle Age', 'Policy_Start_Month', 'Insurance Duration', 'Policy_Start_Year', 'Number of Dependents', 'Previous Claims', 'Gender_male', 'Smoking Status_yes', 'Policy Type_comprehensive', 'Marital Status_single', 'Location_suburban', 'Policy Type_premium', 'Marital Status_married', 'Property Type_house', 'Occupation_single']

Shape of X_train after feature selection: (209145, 20)
Shape of X_test after feature selection: (69715, 20)


Experimenting with different regression algorithym

In [29]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor

# Create a dictionary to store the instantiated models
models = {
    "Linear Regression": LinearRegression(),
    "Ridge": Ridge(random_state=42),
    "Lasso": Lasso(random_state=42),
    "Elastic Net": ElasticNet(random_state=42),
    "Decision Tree Regressor": DecisionTreeRegressor(random_state=42),
    "Random Forest Regressor": RandomForestRegressor(random_state=42),
    "Gradient Boosting Regressor": GradientBoostingRegressor(random_state=42),
    "Support Vector Regressor": SVR(),
    "K-Neighbors Regressor": KNeighborsRegressor()
}

# Dictionary to store trained models
trained_models = {}

# Iterate through the models and train them
print("Training Models:")
for name, model in models.items():
    print(f"Training {name}...")
    try:
        model.fit(X_train_selected, y_train)
        trained_models[name] = model
        print(f"{name} training completed.")
    except Exception as e:
        print(f"Error training {name}: {e}")

print("\nAll specified models have been attempted for training.")

Training Models:
Training Linear Regression...
Linear Regression training completed.
Training Ridge...
Ridge training completed.
Training Lasso...
Lasso training completed.
Training Elastic Net...
Elastic Net training completed.
Training Decision Tree Regressor...
Decision Tree Regressor training completed.
Training Random Forest Regressor...
Random Forest Regressor training completed.
Training Gradient Boosting Regressor...
Gradient Boosting Regressor training completed.
Training Support Vector Regressor...
Support Vector Regressor training completed.
Training K-Neighbors Regressor...
K-Neighbors Regressor training completed.

All specified models have been attempted for training.


Model evaluation

In [30]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.linear_model import Ridge # Including Ridge as it often performs well
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import pandas as pd

# Dictionary to store evaluation results
evaluation_results = {}

# Evaluate the trained models first to get evaluation_df
print("Evaluating Models for Hyperparameter Tuning Model Selection:")
for name, model in trained_models.items():
    print(f"Evaluating {name}...")
    try:
        # Make predictions on the testing data
        y_pred = model.predict(X_test_selected)

        # Calculate evaluation metrics
        mae = mean_absolute_error(y_test, y_pred)
        mse = mean_squared_error(y_test, y_pred)
        r2 = r2_score(y_test, y_pred)

        # Store the results
        evaluation_results[name] = {
            'MAE': mae,
            'MSE': mse,
            'R-squared': r2
        }
        print(f"{name} evaluation completed.")

    except Exception as e:
        print(f"Error evaluating {name}: {e}")

# Create evaluation_df
evaluation_df = pd.DataFrame(evaluation_results).T
print("\nInitial Model Evaluation Results for Tuning Selection:")
display(evaluation_df.sort_values(by='R-squared', ascending=False))


# Identify the best performing models based on R-squared from the previous evaluation
# Let's select the top 3 models for hyperparameter tuning
top_models = evaluation_df.sort_values(by='R-squared', ascending=False).head(3).index.tolist()
print(f"Top models selected for hyperparameter tuning: {top_models}")

tuned_models = {}

for model_name in top_models:
    print(f"\nPerforming hyperparameter tuning for {model_name}...")

    # Define parameter grids for each model
    param_grid = {}
    model = None

    if model_name == "Random Forest Regressor":
        model = RandomForestRegressor(random_state=42)
        param_grid = {
            'n_estimators': [100, 200],
            'max_depth': [10, 20, None],
            'min_samples_split': [2, 5],
            'min_samples_leaf': [1, 2]
        }
    elif model_name == "Gradient Boosting Regressor":
        model = GradientBoostingRegressor(random_state=42)
        param_grid = {
            'n_estimators': [100, 200],
            'learning_rate': [0.01, 0.1],
            'max_depth': [3, 5],
            'min_samples_split': [2, 5],
            'min_samples_leaf': [1, 2]
        }
    elif model_name == "Support Vector Regressor":
        model = SVR()
        # SVR can be computationally expensive, use a smaller subset of data or a simpler grid for initial tuning
        param_grid = {
            'C': [0.1, 1],
            'epsilon': [0.1, 0.2],
            'kernel': ['rbf']
        }
        # For SVR, consider using a smaller sample of the training data if it takes too long
        # X_train_subset, _, y_train_subset, _ = train_test_split(X_train_selected, y_train, train_size=0.1, random_state=42)
        # print(f"Using a subset of training data ({X_train_subset.shape[0]} samples) for SVR tuning.")
        # X_train_tuned = X_train_subset
        # y_train_tuned = y_train_subset
        X_train_tuned = X_train_selected # Use full data for now
        y_train_tuned = y_train # Use full data for now

    elif model_name == "Ridge":
        model = Ridge(random_state=42)
        param_grid = {
            'alpha': [0.1, 1.0, 10.0]
        }
        X_train_tuned = X_train_selected
        y_train_tuned = y_train

    else:
        print(f"Skipping tuning for {model_name} as no parameter grid is defined.")
        continue

    if model is not None and param_grid:
        # Set up GridSearchCV
        grid_search = GridSearchCV(model, param_grid, cv=3, scoring='r2', n_jobs=-1, verbose=1)

        try:
            # Perform the grid search
            if model_name == "Support Vector Regressor":
                 grid_search.fit(X_train_tuned, y_train_tuned)
            else:
                 grid_search.fit(X_train_selected, y_train)


            # Store the best model and its parameters
            tuned_models[model_name] = grid_search.best_estimator_
            print(f"Best parameters for {model_name}: {grid_search.best_params_}")
            print(f"Best R-squared for {model_name}: {grid_search.best_score_:.4f}")

        except Exception as e:
            print(f"Error during GridSearchCV for {model_name}: {e}")

print("\nHyperparameter tuning completed for selected models.")

Evaluating Models for Hyperparameter Tuning Model Selection:
Evaluating Linear Regression...
Linear Regression evaluation completed.
Evaluating Ridge...
Ridge evaluation completed.
Evaluating Lasso...
Lasso evaluation completed.
Evaluating Elastic Net...
Elastic Net evaluation completed.
Evaluating Decision Tree Regressor...
Decision Tree Regressor evaluation completed.
Evaluating Random Forest Regressor...
Random Forest Regressor evaluation completed.
Evaluating Gradient Boosting Regressor...
Gradient Boosting Regressor evaluation completed.
Evaluating Support Vector Regressor...
Support Vector Regressor evaluation completed.
Evaluating K-Neighbors Regressor...


[WinError 2] The system cannot find the file specified
  File "C:\Users\EAC\anaconda\Lib\site-packages\joblib\externals\loky\backend\context.py", line 257, in _count_physical_cores
    cpu_info = subprocess.run(
               ^^^^^^^^^^^^^^^
  File "C:\Users\EAC\anaconda\Lib\subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\EAC\anaconda\Lib\subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\Users\EAC\anaconda\Lib\subprocess.py", line 1538, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


K-Neighbors Regressor evaluation completed.

Initial Model Evaluation Results for Tuning Selection:


Unnamed: 0,MAE,MSE,R-squared
Lasso,693.342222,817299.8,-4.9e-05
Elastic Net,693.344722,817305.0,-5.6e-05
Ridge,693.363451,817358.9,-0.000121
Linear Regression,693.363452,817358.9,-0.000121
Gradient Boosting Regressor,693.401282,817498.6,-0.000292
Random Forest Regressor,721.05651,842247.9,-0.030576
Support Vector Regressor,655.533656,891777.4,-0.09118
K-Neighbors Regressor,754.493552,985899.8,-0.206348
Decision Tree Regressor,978.238342,1773784.0,-1.170404


Top models selected for hyperparameter tuning: ['Lasso', 'Elastic Net', 'Ridge']

Performing hyperparameter tuning for Lasso...
Skipping tuning for Lasso as no parameter grid is defined.

Performing hyperparameter tuning for Elastic Net...
Skipping tuning for Elastic Net as no parameter grid is defined.

Performing hyperparameter tuning for Ridge...
Fitting 3 folds for each of 3 candidates, totalling 9 fits
Best parameters for Ridge: {'alpha': 10.0}
Best R-squared for Ridge: -0.0002

Hyperparameter tuning completed for selected models.


Save the best model

In [34]:
import pickle

# Assuming 'tuned_models' is a dictionary containing the best trained models after tuning
# and you want to save the best performing one based on some criteria (e.g., R-squared).
# In the previous cell, only Ridge was tuned. Let's select Ridge as the best tuned model for saving.
if 'Ridge' in tuned_models:
    best_model_name = 'Ridge'
    best_model = tuned_models['Ridge']

    # Define the filename for the best model
    best_model_filename = f"{best_model_name.replace(' ', '_').lower()}_best_model.pkl"

    # Save the best model to a .pkl file
    try:
        with open(best_model_filename, 'wb') as f:
            pickle.dump(best_model, f)
        print(f"Best model ('{best_model_name}') saved to '{best_model_filename}' successfully.")
    except Exception as e:
        print(f"Error saving the best model: {e}")
else:
    print("Ridge model was not found in tuned_models. Cannot save the best model.")

Best model ('Ridge') saved to 'ridge_best_model.pkl' successfully.


Save the processed dataset (including engineered features) to an Excel file for download.

In [32]:
# Save the processed DataFrame to an Excel file
output_excel_file = "processed_insurance_premium_data.xlsx"
try:
    df_encoded.to_excel(output_excel_file, index=False)
    print(f"Processed dataset saved to '{output_excel_file}' successfully.")
except Exception as e:
    print(f"Error saving processed dataset to Excel: {e}")

Processed dataset saved to 'processed_insurance_premium_data.xlsx' successfully.
