## ðŸ’Š Medical Cost Prediction

Given *patient data*, let's try to predict the **charges** a given patient will incur.

We will use a variety of linear regression models to make our predictions.

Data source: https://www.kaggle.com/datasets/mirichoi0218/insurance

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, RidgeCV, LassoCV, ElasticNetCV

In [2]:
data = pd.read_csv('archive/insurance.csv')
data

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


### Preprocessing

In [4]:
df = data.copy()

In [7]:
print("Total missing values:", df.isna().sum().sum())

Total missing values: 0


In [9]:
print("Total non-numeric columns:", len(df.select_dtypes('object').columns))

Total non-numeric columns: 3


In [10]:
{column: df[column].unique() for column in df.select_dtypes('object').columns}

{'sex': array(['female', 'male'], dtype=object),
 'smoker': array(['yes', 'no'], dtype=object),
 'region': array(['southwest', 'southeast', 'northwest', 'northeast'], dtype=object)}

In [11]:
df['children'] = df['children'].astype(str)

In [12]:
{column: df[column].unique() for column in df.select_dtypes('object').columns}

{'sex': array(['female', 'male'], dtype=object),
 'children': array(['0', '1', '3', '2', '5', '4'], dtype=object),
 'smoker': array(['yes', 'no'], dtype=object),
 'region': array(['southwest', 'southeast', 'northwest', 'northeast'], dtype=object)}

In [14]:
def binary_encode(df, column, positive_value):
    df = df.copy()
    df[column] = df[column].apply(lambda x: 1 if x == positive_value else 0)
    return df

def onehot_encode(df, column, prefix):
    df = df.copy()
    dummies = pd.get_dummies(df[column], prefix=prefix, dtype=int)
    df = pd.concat([df, dummies], axis=1)
    df = df.drop(column, axis=1)
    return df

In [17]:
def preprocess_inputs(df, scaler, train_size=0.7):

    # Binary encode sex and smoker columns
    df = binary_encode(df, 'sex', 'male')
    df = binary_encode(df, 'smoker', 'yes')

    # One-hot encode the children and region columns
    df = onehot_encode(df, 'children', 'ch')
    df = onehot_encode(df, 'region', 'reg')

    # Split df into X and y
    y = df['charges'].copy()
    X = df.drop('charges', axis=1).copy()

    # Scale X with the given scaler
    X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

    # Split into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=train_size, shuffle=True, random_state=123)

    return X_train, X_test, y_train, y_test

In [18]:
df

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


In [27]:
X_train, X_test, y_train, y_test = preprocess_inputs(df, RobustScaler(), train_size=0.7)

In [28]:
X_train

Unnamed: 0,age,sex,bmi,smoker,ch_0,ch_1,ch_2,ch_3,ch_4,ch_5,reg_northeast,reg_northwest,reg_southeast,reg_southwest
300,-0.125000,0.0,-0.339387,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
904,0.875000,-1.0,0.559690,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
670,-0.375000,0.0,0.139327,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
617,0.416667,0.0,-0.571599,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
373,-0.541667,0.0,0.297708,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1238,-0.083333,0.0,-0.916344,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
1147,-0.791667,-1.0,0.181006,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
106,-0.833333,-1.0,-0.238166,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1041,-0.875000,0.0,-0.871093,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


### Training

In [29]:
models = {
    '         OLS Model': LinearRegression(),
    '          L2 Model': Ridge(),
    '          L1 Model': Lasso(),
    '  ElasticNet Model': ElasticNet(),
    '       L2 CV Model': RidgeCV(),
    '       L1 CV Model': LassoCV(),
    'ElasticNetCV Model': ElasticNetCV()
}

for model in models.values():
    model.fit(X_train, y_train)

In [30]:
print("Model R^2 Scores:\n-----------------------")

for name, model in models.items():
    print(name, model.score(X_test, y_test))

Model R^2 Scores:
-----------------------
         OLS Model 0.7593545908497942
          L2 Model 0.75936221121579
          L1 Model 0.7593785392449717
  ElasticNet Model 0.34782883553342026
       L2 CV Model 0.7593622112157778
       L1 CV Model 0.759123205682648
ElasticNetCV Model 0.07230574315404448


RobustScaler:
OLS Model 0.7593545908497942
          L2 Model 0.75936221121579
          L1 Model 0.7593785392449717
  ElasticNet Model 0.34782883553342026
       L2 CV Model 0.7593622112157778
       L1 CV Model 0.759123205682648
ElasticNetCV Model 0.07230574315404448

MinMaxScaler:
OLS Model 0.7593545908497942
          L2 Model 0.7595502079958969
          L1 Model 0.759393648596581
  ElasticNet Model 0.3082458597207589
       L2 CV Model 0.7595502079958958
       L1 CV Model 0.7597708751156751
ElasticNetCV Model 0.057339655178795135

StandardScaler: 
OLS Model 0.7593545908497942
          L2 Model 0.7593579364036089
          L1 Model 0.7593697076110314
  ElasticNet Model 0.6722813607835507
       L2 CV Model 0.7593579364036882
       L1 CV Model 0.760087586650097
ElasticNetCV Model 0.13980401601000703