## Objective
Objective is to build a predictive model that predicts the cost of health insurance based on the customer's demography and habits.

In this excercise, we will try out multiple regression algorithm and see which one gives the best result.

Reference:
https://www.kaggle.com/code/ahmetemirdundar/medical-cost-prediction/notebook

In [None]:
# Connect to Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## EDA

In [None]:
# Import required packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
# Load Data
df = pd.read_csv('/content/drive/MyDrive/edurekaai/_data/insurance_cost.csv')
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [None]:
# Basic Stastical Analysis
df.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


**Insight - 1**
- sex, smoker, and region are categorical features
- age, bmi, children, and charges are numerical features.
- No null or missing value present.
- Age IQR (75% - 25%) is 24
- Age Relative IQR (IQR/median(50%)) = 24/39 = 0.6153 = 61.53%. It means age has high variation or spread. **_Random Forest and XGBoost handles variaations better._**
- Mean and Median of  Age ~39. Suggesting minimal or no skew.

- BMI IQR (75% - 25%) is  8.3975
- BMI Relative IQR (IQR/Median) = 8.3975/30.40 = .27 = 27%. Moderate variation.
- Mean and Median of BMI ~30. Suggesting mininal or no skew.

- Age Min-Max range is 48m, bmi range is 38, and charges range is 51000. It mean, we must apply scaling.

In [None]:
# Verify the skewness
import pandas as pd
print(df[['age', 'bmi', 'charges']].skew())


# **Skewness Interpretation:**
# ```
# -   ≈ 0 → fairly symmetric
# -   > 0.5 → moderate right skew
# -   > 1 → strong right skew
# -   < -0.5 → moderate left skew
# -   < -1 → strong left skew
# ```

age        0.055673
bmi        0.284047
charges    1.515880
dtype: float64


In [None]:
# Verify outliers
def detect_outliers_iqr(series):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = series[(series < lower_bound) | (series > upper_bound)]
    return outliers

age_outliers = detect_outliers_iqr(df['age'])
bmi_outliers = detect_outliers_iqr(df['bmi'])
charges_outliers = detect_outliers_iqr(df['charges'])

print(f"Age outliers count: {len(age_outliers)}")
print(f"Salary outliers count: {len(bmi_outliers)}")
print(f"Charges outliers count: {len(charges_outliers)}")

# Interpretation:
# If the count of outliers is significant, it may indicate that the data has extreme values
# that could affect the model's performance. Further investigation is needed to determine
# whether to remove or treat these outliers.


Age outliers count: 0
Salary outliers count: 9
Charges outliers count: 139


**Insight - 2**
- target feature (charges) has high skewness, and has multiple outliers. Hence, the feature must be scaled using logarithim scaling.


#### Peform Categorical Analysis

In [None]:
print(df['sex'].value_counts(normalize=True) * 100)
print(df['smoker'].value_counts(normalize=True) * 100)
print(df['region'].value_counts(normalize=True) * 100)

sex
male      50.523169
female    49.476831
Name: proportion, dtype: float64
smoker
no     79.521674
yes    20.478326
Name: proportion, dtype: float64
region
southeast    27.204783
southwest    24.289985
northwest    24.289985
northeast    24.215247
Name: proportion, dtype: float64


**Insight - 3**
- sex and region classes seems equally distributed.
- feature smoker look imbalanced in the dataset. Hence, Consider using stratified sampling when splitting train/test.

## Preprocessing the data

In [23]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# Load Data
df = pd.read_csv('/content/drive/MyDrive/edurekaai/_data/insurance_cost.csv')
df.head()

# Fix missing values
## No missing or null values
print(df.info())

# Features & target
X = df.drop(columns=["charges"])
y = df["charges"]

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=42,
                                                    stratify=X["smoker"]
                                                    )


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB
None


## Linear Regression

In [25]:
"""
Insurance Cost Prediction using Linear Regression with a Scikit-Learn Pipeline
------------------------------------------------------------------------------

This script demonstrates how to build a machine learning pipeline for regression
tasks (predicting insurance charges). It includes preprocessing steps for
numerical and categorical features, log transformation of skewed values,
and model evaluation with regression metrics.

Key Steps:
1. Preprocessing numerical & categorical features.
2. Building a full ML pipeline with Linear Regression.
3. Training & predicting on train/test data.
4. Evaluating performance using regression metrics.
"""

# -----------------------------
# Imports
# -----------------------------
import numpy as np
import pandas as pd

# Preprocessing & pipeline tools
from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Model
from sklearn.linear_model import LinearRegression

# Evaluation metrics (for regression)
from sklearn.metrics import (
    mean_squared_error,
    mean_absolute_error,
    mean_absolute_percentage_error,
    r2_score,
    explained_variance_score
)

# -----------------------------
# Load Dataset
# -----------------------------
# Example: load your insurance dataset
# df = pd.read_csv("/content/drive/MyDrive/edurekaai/_data/insurance_cost.csv")

# Define features and target
# X = df.drop(columns=["charges"])
# y = df["charges"]

# Assume X_train, X_test, y_train, y_test are already created
# (e.g., using train_test_split)

# -----------------------------
# Feature Types
# -----------------------------
# Separate numerical and categorical feature names
numerical_features = X_train.select_dtypes(exclude=["object", "category"]).columns
categorical_features = X_train.select_dtypes(include=["object", "category"]).columns

# -----------------------------
# Preprocessing Pipelines
# -----------------------------

# 1. Numerical Transformer:
#    - Fill missing values with median
#    - Apply log transformation to reduce skewness
#    - Standardize features (mean=0, variance=1)
num_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("log", FunctionTransformer(np.log1p, validate=True)),
    ("scaler", StandardScaler())
])

# 2. Categorical Transformer:
#    - Apply OneHotEncoding to handle categorical variables
#    - Ignore unknown categories at prediction time
cat_transformer = Pipeline(steps=[
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

# 3. Combine Numerical & Categorical Preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ("num", num_transformer, numerical_features),
        ("cat", cat_transformer, categorical_features)
    ]
)

# -----------------------------
# Full Model Pipeline
# -----------------------------
# Combine preprocessing and regression into a single pipeline
model = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("regressor", LinearRegression())
])

# -----------------------------
# Model Training
# -----------------------------
# Log-transform target manually before fitting
# y_train_log = np.log1p(y_train)
# Train the pipeline on training data
model.fit(X_train, y_train)

# -----------------------------
# Prediction
# -----------------------------
# Predict on test set
# Note: Target 'charges' is log-transformed inside pipeline
#       so we need to apply inverse transform (expm1)
# y_pred_log = model.predict(X_test)
# y_pred = np.expm1(y_pred_log)
y_pred = model.predict(X_test)

# -----------------------------
# Model Evaluation
# -----------------------------
# classification_metrics = {
#     "Accuracy": accuracy_score(y_test, y_pred),
#     "Precision": precision_score(y_test, y_pred, average=self.average, zero_division=0),
#     "Recall": recall_score(y_test, y_pred, average=self.average, zero_division=0),
#     "F1 Score": f1_score(y_test, y_pred, average=self.average, zero_division=0),
#     "Confusion Matrix": confusion_matrix(y_test, y_pred).tolist(),
#     "Classification Report": classification_report(y_test, y_pred, output_dict=True)
# }

regression_metrics = {
    "Mean Squared Error": mean_squared_error(y_test, y_pred),
    "Root Mean Squared Error": np.sqrt(mean_squared_error(y_test, y_pred)),
    "Mean Absolute Error": mean_absolute_error(y_test, y_pred),
    "Mean Absolute Percentage Error": mean_absolute_percentage_error(y_test, y_pred),
    "R-Squared": r2_score(y_test, y_pred),
    "Explained Variance": explained_variance_score(y_test, y_pred)
}

# Print metrics
for metric_name, metric_value in regression_metrics.items():
    print(f"{metric_name}: {metric_value:.4f}")


Mean Squared Error: 32110691.4003
Root Mean Squared Error: 5666.6296
Mean Absolute Error: 4044.3056
Mean Absolute Percentage Error: 0.4177
R-Squared: 0.7823
Explained Variance: 0.7824


## Decission Tree

In [26]:
"""
Insurance Cost Prediction using Decision Tree Regressor with a Scikit-Learn Pipeline
-----------------------------------------------------------------------------------

This script demonstrates how to build a machine learning pipeline for regression
tasks (predicting insurance charges). It includes preprocessing steps for
numerical and categorical features, and model evaluation with regression metrics.

Key Steps:
1. Preprocessing numerical & categorical features.
2. Building a full ML pipeline with Decision Tree Regressor.
3. Training & predicting on train/test data.
4. Evaluating performance using regression metrics.
"""

# -----------------------------
# Imports
# -----------------------------
import numpy as np
import pandas as pd

# Preprocessing & pipeline tools
from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Model
from sklearn.tree import DecisionTreeRegressor

# Evaluation metrics (for regression)
from sklearn.metrics import (
    mean_squared_error,
    mean_absolute_error,
    mean_absolute_percentage_error,
    r2_score,
    explained_variance_score
)

# -----------------------------
# Load Dataset
# -----------------------------
# Example: load your insurance dataset
# df = pd.read_csv("/content/drive/MyDrive/edurekaai/_data/insurance_cost.csv")

# Define features and target
# X = df.drop(columns=["charges"])
# y = df["charges"]

# Assume X_train, X_test, y_train, y_test are already created
# (e.g., using train_test_split)

# -----------------------------
# Feature Types
# -----------------------------
# Separate numerical and categorical feature names
numerical_features = X_train.select_dtypes(exclude=["object", "category"]).columns
categorical_features = X_train.select_dtypes(include=["object", "category"]).columns

# -----------------------------
# Preprocessing Pipelines
# -----------------------------

# 1. Numerical Transformer:
#    - Fill missing values with median
#    - Standardize features (optional for trees, but kept for consistency)
num_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

# 2. Categorical Transformer:
#    - Apply OneHotEncoding to handle categorical variables
#    - Ignore unknown categories at prediction time
cat_transformer = Pipeline(steps=[
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

# 3. Combine Numerical & Categorical Preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ("num", num_transformer, numerical_features),
        ("cat", cat_transformer, categorical_features)
    ]
)

# -----------------------------
# Full Model Pipeline
# -----------------------------
# Combine preprocessing and regression into a single pipeline
model = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("regressor", DecisionTreeRegressor(random_state=42, max_depth=6))
])

# -----------------------------
# Model Training
# -----------------------------
# Train the pipeline on training data
model.fit(X_train, y_train)

# -----------------------------
# Prediction
# -----------------------------
y_pred = model.predict(X_test)

# -----------------------------
# Model Evaluation
# -----------------------------
regression_metrics = {
    "Mean Squared Error": mean_squared_error(y_test, y_pred),
    "Root Mean Squared Error": np.sqrt(mean_squared_error(y_test, y_pred)),
    "Mean Absolute Error": mean_absolute_error(y_test, y_pred),
    "Mean Absolute Percentage Error": mean_absolute_percentage_error(y_test, y_pred),
    "R-Squared": r2_score(y_test, y_pred),
    "Explained Variance": explained_variance_score(y_test, y_pred)
}

# Print metrics
for metric_name, metric_value in regression_metrics.items():
    print(f"{metric_name}: {metric_value:.4f}")


Mean Squared Error: 21488822.8615
Root Mean Squared Error: 4635.6038
Mean Absolute Error: 2621.6771
Mean Absolute Percentage Error: 0.3361
R-Squared: 0.8543
Explained Variance: 0.8561


## Random Forest

In [27]:
"""
Insurance Cost Prediction using Random Forest Regressor with a Scikit-Learn Pipeline
-----------------------------------------------------------------------------------

This script demonstrates how to build a machine learning pipeline for regression
tasks (predicting insurance charges). It includes preprocessing steps for
numerical and categorical features, and model evaluation with regression metrics.

Key Steps:
1. Preprocessing numerical & categorical features.
2. Building a full ML pipeline with Random Forest Regressor.
3. Training & predicting on train/test data.
4. Evaluating performance using regression metrics.
"""

# -----------------------------
# Imports
# -----------------------------
import numpy as np
import pandas as pd

# Preprocessing & pipeline tools
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Model
from sklearn.ensemble import RandomForestRegressor

# Evaluation metrics (for regression)
from sklearn.metrics import (
    mean_squared_error,
    mean_absolute_error,
    mean_absolute_percentage_error,
    r2_score,
    explained_variance_score
)

# -----------------------------
# Load Dataset
# -----------------------------
# Example: load your insurance dataset
# df = pd.read_csv("/content/drive/MyDrive/edurekaai/_data/insurance_cost.csv")

# Define features and target
# X = df.drop(columns=["charges"])
# y = df["charges"]

# Assume X_train, X_test, y_train, y_test are already created
# (e.g., using train_test_split)

# -----------------------------
# Feature Types
# -----------------------------
# Separate numerical and categorical feature names
numerical_features = X_train.select_dtypes(exclude=["object", "category"]).columns
categorical_features = X_train.select_dtypes(include=["object", "category"]).columns

# -----------------------------
# Preprocessing Pipelines
# -----------------------------

# 1. Numerical Transformer:
#    - Fill missing values with median
#    - Standardize features (not required for trees, but kept for consistency)
num_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

# 2. Categorical Transformer:
#    - Apply OneHotEncoding to handle categorical variables
#    - Ignore unknown categories at prediction time
cat_transformer = Pipeline(steps=[
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

# 3. Combine Numerical & Categorical Preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ("num", num_transformer, numerical_features),
        ("cat", cat_transformer, categorical_features)
    ]
)

# -----------------------------
# Full Model Pipeline
# -----------------------------
# Combine preprocessing and regression into a single pipeline
model = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("regressor", RandomForestRegressor(
        n_estimators=200,       # number of trees
        max_depth=10,           # limit depth to control overfitting
        random_state=42,
        n_jobs=-1               # use all CPU cores
    ))
])

# -----------------------------
# Model Training
# -----------------------------
# Train the pipeline on training data
model.fit(X_train, y_train)

# -----------------------------
# Prediction
# -----------------------------
y_pred = model.predict(X_test)

# -----------------------------
# Model Evaluation
# -----------------------------
regression_metrics = {
    "Mean Squared Error": mean_squared_error(y_test, y_pred),
    "Root Mean Squared Error": np.sqrt(mean_squared_error(y_test, y_pred)),
    "Mean Absolute Error": mean_absolute_error(y_test, y_pred),
    "Mean Absolute Percentage Error": mean_absolute_percentage_error(y_test, y_pred),
    "R-Squared": r2_score(y_test, y_pred),
    "Explained Variance": explained_variance_score(y_test, y_pred)
}

# Print metrics
for metric_name, metric_value in regression_metrics.items():
    print(f"{metric_name}: {metric_value:.4f}")


Mean Squared Error: 21999872.9028
Root Mean Squared Error: 4690.4022
Mean Absolute Error: 2787.3832
Mean Absolute Percentage Error: 0.3733
R-Squared: 0.8509
Explained Variance: 0.8542


## Ada Boost

In [28]:
"""
Insurance Cost Prediction using AdaBoost Regressor with a Scikit-Learn Pipeline
-------------------------------------------------------------------------------

This script demonstrates how to build a machine learning pipeline for regression
tasks (predicting insurance charges). It includes preprocessing steps for
numerical and categorical features, and model evaluation with regression metrics.

Key Steps:
1. Preprocessing numerical & categorical features.
2. Building a full ML pipeline with AdaBoost Regressor.
3. Training & predicting on train/test data.
4. Evaluating performance using regression metrics.
"""

# -----------------------------
# Imports
# -----------------------------
import numpy as np
import pandas as pd

# Preprocessing & pipeline tools
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Model
from sklearn.ensemble import AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor

# Evaluation metrics (for regression)
from sklearn.metrics import (
    mean_squared_error,
    mean_absolute_error,
    mean_absolute_percentage_error,
    r2_score,
    explained_variance_score
)

# -----------------------------
# Load Dataset
# -----------------------------
# Example: load your insurance dataset
# df = pd.read_csv("/content/drive/MyDrive/edurekaai/_data/insurance_cost.csv")

# Define features and target
# X = df.drop(columns=["charges"])
# y = df["charges"]

# Assume X_train, X_test, y_train, y_test are already created
# (e.g., using train_test_split)

# -----------------------------
# Feature Types
# -----------------------------
# Separate numerical and categorical feature names
numerical_features = X_train.select_dtypes(exclude=["object", "category"]).columns
categorical_features = X_train.select_dtypes(include=["object", "category"]).columns

# -----------------------------
# Preprocessing Pipelines
# -----------------------------

# 1. Numerical Transformer:
#    - Fill missing values with median
#    - Standardize features (optional for trees, but included for consistency)
num_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

# 2. Categorical Transformer:
#    - Apply OneHotEncoding to handle categorical variables
#    - Ignore unknown categories at prediction time
cat_transformer = Pipeline(steps=[
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

# 3. Combine Numerical & Categorical Preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ("num", num_transformer, numerical_features),
        ("cat", cat_transformer, categorical_features)
    ]
)

# -----------------------------
# Full Model Pipeline
# -----------------------------
# Use AdaBoost Regressor with DecisionTreeRegressor as base estimator
model = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("regressor", AdaBoostRegressor(
        estimator=DecisionTreeRegressor(max_depth=4),
        n_estimators=200,
        learning_rate=0.05,
        random_state=42
    ))
])

# -----------------------------
# Model Training
# -----------------------------
# Train the pipeline on training data
model.fit(X_train, y_train)

# -----------------------------
# Prediction
# -----------------------------
y_pred = model.predict(X_test)

# -----------------------------
# Model Evaluation
# -----------------------------
regression_metrics = {
    "Mean Squared Error": mean_squared_error(y_test, y_pred),
    "Root Mean Squared Error": np.sqrt(mean_squared_error(y_test, y_pred)),
    "Mean Absolute Error": mean_absolute_error(y_test, y_pred),
    "Mean Absolute Percentage Error": mean_absolute_percentage_error(y_test, y_pred),
    "R-Squared": r2_score(y_test, y_pred),
    "Explained Variance": explained_variance_score(y_test, y_pred)
}

# Print metrics
for metric_name, metric_value in regression_metrics.items():
    print(f"{metric_name}: {metric_value:.4f}")


Mean Squared Error: 27208408.7183
Root Mean Squared Error: 5216.1680
Mean Absolute Error: 4360.7773
Mean Absolute Percentage Error: 0.7847
R-Squared: 0.8156
Explained Variance: 0.8620


## Gradient Boost

In [29]:
"""
Insurance Cost Prediction using Gradient Boosting Regressor with a Scikit-Learn Pipeline
----------------------------------------------------------------------------------------

This script demonstrates how to build a machine learning pipeline for regression
tasks (predicting insurance charges). It includes preprocessing steps for
numerical and categorical features, and model evaluation with regression metrics.

Key Steps:
1. Preprocessing numerical & categorical features.
2. Building a full ML pipeline with Gradient Boosting Regressor.
3. Training & predicting on train/test data.
4. Evaluating performance using regression metrics.
"""

# -----------------------------
# Imports
# -----------------------------
import numpy as np
import pandas as pd

# Preprocessing & pipeline tools
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Model
from sklearn.ensemble import GradientBoostingRegressor

# Evaluation metrics (for regression)
from sklearn.metrics import (
    mean_squared_error,
    mean_absolute_error,
    mean_absolute_percentage_error,
    r2_score,
    explained_variance_score
)

# -----------------------------
# Load Dataset
# -----------------------------
# Example: load your insurance dataset
# df = pd.read_csv("/content/drive/MyDrive/edurekaai/_data/insurance_cost.csv")

# Define features and target
# X = df.drop(columns=["charges"])
# y = df["charges"]

# Assume X_train, X_test, y_train, y_test are already created
# (e.g., using train_test_split)

# -----------------------------
# Feature Types
# -----------------------------
numerical_features = X_train.select_dtypes(exclude=["object", "category"]).columns
categorical_features = X_train.select_dtypes(include=["object", "category"]).columns

# -----------------------------
# Preprocessing Pipelines
# -----------------------------
num_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

cat_transformer = Pipeline(steps=[
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", num_transformer, numerical_features),
        ("cat", cat_transformer, categorical_features)
    ]
)

# -----------------------------
# Full Model Pipeline
# -----------------------------
model = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("regressor", GradientBoostingRegressor(
        n_estimators=300,
        learning_rate=0.05,
        max_depth=4,
        random_state=42
    ))
])

# -----------------------------
# Model Training
# -----------------------------
model.fit(X_train, y_train)

# -----------------------------
# Prediction
# -----------------------------
y_pred = model.predict(X_test)

# -----------------------------
# Model Evaluation
# -----------------------------
regression_metrics = {
    "Mean Squared Error": mean_squared_error(y_test, y_pred),
    "Root Mean Squared Error": np.sqrt(mean_squared_error(y_test, y_pred)),
    "Mean Absolute Error": mean_absolute_error(y_test, y_pred),
    "Mean Absolute Percentage Error": mean_absolute_percentage_error(y_test, y_pred),
    "R-Squared": r2_score(y_test, y_pred),
    "Explained Variance": explained_variance_score(y_test, y_pred)
}

for metric_name, metric_value in regression_metrics.items():
    print(f"{metric_name}: {metric_value:.4f}")


Mean Squared Error: 21701033.3464
Root Mean Squared Error: 4658.4368
Mean Absolute Error: 2681.1650
Mean Absolute Percentage Error: 0.3393
R-Squared: 0.8529
Explained Variance: 0.8541


## XGBoost

In [30]:
"""
Insurance Cost Prediction using XGBoost Regressor with a Scikit-Learn Pipeline
-------------------------------------------------------------------------------

This script demonstrates how to build a machine learning pipeline for regression
tasks (predicting insurance charges). It includes preprocessing steps for
numerical and categorical features, and model evaluation with regression metrics.

Key Steps:
1. Preprocessing numerical & categorical features.
2. Building a full ML pipeline with XGBoost Regressor.
3. Training & predicting on train/test data.
4. Evaluating performance using regression metrics.
"""

# -----------------------------
# Imports
# -----------------------------
import numpy as np
import pandas as pd

# Preprocessing & pipeline tools
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Model
from xgboost import XGBRegressor

# Evaluation metrics (for regression)
from sklearn.metrics import (
    mean_squared_error,
    mean_absolute_error,
    mean_absolute_percentage_error,
    r2_score,
    explained_variance_score
)

# -----------------------------
# Load Dataset
# -----------------------------
# Example: load your insurance dataset
# df = pd.read_csv("/content/drive/MyDrive/edurekaai/_data/insurance_cost.csv")

# Define features and target
# X = df.drop(columns=["charges"])
# y = df["charges"]

# Assume X_train, X_test, y_train, y_test are already created
# (e.g., using train_test_split)

# -----------------------------
# Feature Types
# -----------------------------
numerical_features = X_train.select_dtypes(exclude=["object", "category"]).columns
categorical_features = X_train.select_dtypes(include=["object", "category"]).columns

# -----------------------------
# Preprocessing Pipelines
# -----------------------------
num_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

cat_transformer = Pipeline(steps=[
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", num_transformer, numerical_features),
        ("cat", cat_transformer, categorical_features)
    ]
)

# -----------------------------
# Full Model Pipeline
# -----------------------------
model = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("regressor", XGBRegressor(
        n_estimators=300,
        learning_rate=0.05,
        max_depth=4,
        subsample=0.8,
        colsample_bytree=0.8,
        random_state=42,
        n_jobs=-1
    ))
])

# -----------------------------
# Model Training
# -----------------------------
model.fit(X_train, y_train)

# -----------------------------
# Prediction
# -----------------------------
y_pred = model.predict(X_test)

# -----------------------------
# Model Evaluation
# -----------------------------
regression_metrics = {
    "Mean Squared Error": mean_squared_error(y_test, y_pred),
    "Root Mean Squared Error": np.sqrt(mean_squared_error(y_test, y_pred)),
    "Mean Absolute Error": mean_absolute_error(y_test, y_pred),
    "Mean Absolute Percentage Error": mean_absolute_percentage_error(y_test, y_pred),
    "R-Squared": r2_score(y_test, y_pred),
    "Explained Variance": explained_variance_score(y_test, y_pred)
}

for metric_name, metric_value in regression_metrics.items():
    print(f"{metric_name}: {metric_value:.4f}")


Mean Squared Error: 20670567.0032
Root Mean Squared Error: 4546.4895
Mean Absolute Error: 2716.9160
Mean Absolute Percentage Error: 0.3478
R-Squared: 0.8599
Explained Variance: 0.8608
