# Exploring Insurance Charge through Regression Analysis
Regression analysis was performed using Supervised Machine Learning predictive methods, particularly in regression so to predict the insurance charge amount through Python. 13 models with variation of regression modelling, including Linear regression, polynomial regression, regression with feature scaling, regularized regression, and regularized polynomial regression are evaluated in the study in order to derive best method to make prediction.

## Table of Contents
- [Data Content](#Data-Content)
- [Importing Required Modules](#Importing-Required-Modules)
- [Exploring Dataset](#Exploring-Dataset)
- [Pre-processing Data](#Pre-processing-Data)
- [Model 1: Linear regression](#Model-1:-Linear-regression)
- [Model 2: Polynomial linear regression (degree = 2)](#Model-2:-Polynomial-linear-regression-(degree-=-2))
- [Model 3: Polynomial linear regression (degree = 3)](#Model-3:-Polynomial-linear-regression-(degree-=-3))
- [Model 4: Linear regression with feature scaling](#Model-4:-Linear-regression-with-feature-scaling)
- [Model 5: Regularized linear regression (Lasso)](#Model-5:-Regularized-linear-regression-(Lasso))
- [Model 6: Regularized polynomial linear regression (Lasso) (degree = 2)](#Model-6:-Regularized-polynomial-linear-regression-(Lasso)-(degree-=-2))
- [Model 7: Regularized polynomial linear regression (Lasso) (degree = 3)](#Model-7:-Regularized-polynomial-linear-regression-(Lasso)-(degree-=-3))
- [Model 8: Regularized linear regression (Ridge)](#Model-8:-Regularized-linear-regression-(Ridge))
- [Model 9: Regularized polynomial linear regression (Ridge) (degree = 2)](#Model-9:-Regularized-polynomial-linear-regression-(Ridge)-(degree-=-2))
- [Model 10: Regularized polynomial linear regression (Ridge) (degree = 3)](#Model-10:-Regularized-polynomial-linear-regression-(Ridge)-(degree-=-3))
- [Model 11: Regularized linear regression (ElasticNet)](#Model-11:-Regularized-linear-regression-(ElasticNet))
- [Model 12: Regularized polynomial linear regression (ElasticNet) (degree = 2)](#Model-12:-Regularized-polynomial-linear-regression-(ElasticNet)-(degree-=-2))
- [Model 13: Regularized polynomial linear regression (ElasticNet) (degree = 3)](#Model-13:-Regularized-polynomial-linear-regression-(ElasticNet)-(degree-=-3))
- [Summary](#Summary)

## Data Content 

age: age of primary beneficiary

sex: insurance contractor gender, female, male

bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,
objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

children: Number of children covered by health insurance / Number of dependents

smoker: Smoking

region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

charges: Individual medical costs billed by health insurance

Inspiration

Can you accurately predict insurance costs?

## Importing Required Modules

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import r2_score , mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures , LabelEncoder
from sklearn.tree import DecisionTreeRegressor , plot_tree
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline

sns.set()

## Exploring Dataset

In [2]:
ins = pd.read_csv("data/insurance.csv")
ins.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [3]:
ins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


## Pre-processing Data

Transformation non-numerical labels to numerical labels.

| Smoker | Label |
|----|---|
| No  | 0 |
| Yes  | 1 |

| Sex  | Label |
|----|---|
| Female  | 0 |
| Male  | 1 |

| Region  | Label |
|----|---|
| Northeast  | 0 |
| Northwest  | 1 |
| Southeast  | 2 |
| Southwest  | 3 |

In [4]:
le = LabelEncoder()
ins["sex"] = le.fit_transform(ins["sex"])
ins["sex"].unique()

array([0, 1])

In [5]:
ins["smoker"] = le.fit_transform(ins["smoker"])
ins["smoker"].unique()

array([1, 0])

In [6]:
ins["region"] = le.fit_transform(ins["region"])
ins["region"].unique()

array([3, 2, 1, 0])

In [7]:
ins.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,0,27.9,0,1,3,16884.924
1,18,1,33.77,1,0,2,1725.5523
2,28,1,33.0,3,0,2,4449.462
3,33,1,22.705,0,0,1,21984.47061
4,32,1,28.88,0,0,1,3866.8552


In [8]:
ins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   int32  
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   int32  
 5   region    1338 non-null   int32  
 6   charges   1338 non-null   float64
dtypes: float64(2), int32(3), int64(2)
memory usage: 57.6 KB


In [9]:
X = ins.drop('charges', axis=1)
y = ins['charges']

In [10]:
X.head()

Unnamed: 0,age,sex,bmi,children,smoker,region
0,19,0,27.9,0,1,3
1,18,1,33.77,1,0,2
2,28,1,33.0,3,0,2
3,33,1,22.705,0,0,1
4,32,1,28.88,0,0,1


In [11]:
y.head()

0    16884.92400
1     1725.55230
2     4449.46200
3    21984.47061
4     3866.85520
Name: charges, dtype: float64

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print("X_train shape: {}".format(X_train.shape))
print("X_test shape: {}".format(X_test.shape))
print("y_train shape: {}".format(y_train.shape))
print("y_test shape: {}".format(y_test.shape))

X_train shape: (936, 6)
X_test shape: (402, 6)
y_train shape: (936,)
y_test shape: (402,)


## Model 1: Linear regression

In [13]:
LR = LinearRegression().fit(X_train, y_train)
print("R2 train score: {}".format(LR.score(X_train, y_train)))
print("R2 test score: {}".format(LR.score(X_test, y_test)))

R2 train score: 0.7422571320172101
R2 test score: 0.7694415927057693


## Model 2: Polynomial linear regression (degree = 2)

In [14]:
Poly2_LR = make_pipeline(
    PolynomialFeatures(degree=2), 
    LinearRegression())
Poly2_LR.fit(X_train, y_train)
print("R2 train score (degree=2): {}".format(Poly2_LR.score(X_train, y_train)))
print("R2 test score (degree=2): {}".format(Poly2_LR.score(X_test, y_test)))

R2 train score (degree=2): 0.8384232211576609
R2 test score (degree=2): 0.8637557014627594


## Model 3: Polynomial linear regression (degree = 3)

In [15]:
Poly3_LR = make_pipeline(
    PolynomialFeatures(degree=3), 
    LinearRegression())
Poly3_LR.fit(X_train, y_train)
print("R2 train score (degree=3): {}".format(Poly3_LR.score(X_train, y_train)))
print("R2 test score (degree=3): {}".format(Poly3_LR.score(X_test, y_test)))

R2 train score (degree=3): 0.849434266026776
R2 test score (degree=3): 0.8553016870118224


In [16]:
# Polynomial transformations overfit due to the comparably small amount of data
poly = PolynomialFeatures(degree=2).fit_transform(X_train)
print("Feature shape (degree=2): {}".format(poly.shape))
poly = PolynomialFeatures(degree=3).fit_transform(X_train)
print("Feature shape (degree=3): {}".format(poly.shape))

Feature shape (degree=2): (936, 28)
Feature shape (degree=3): (936, 84)


## Model 4: Linear regression with feature scaling

In [17]:
scaler=MinMaxScaler().fit(X_train)
X_train_scaled=scaler.transform(X_train)
X_test_scaled=scaler.transform(X_test)
LR_s = LinearRegression().fit(X_train_scaled, y_train)
print("R2 train score: {}".format(LR_s.score(X_train_scaled, y_train)))
print("R2 test score: {}".format(LR_s.score(X_test_scaled, y_test)))

R2 train score: 0.7422571320172101
R2 test score: 0.7694415927057693


## Model 5: Regularized linear regression (Lasso)

In [18]:
from sklearn.linear_model import Lasso

def PolynomialLasso(degree=2, **kwargs):
    return make_pipeline(PolynomialFeatures(degree),
                         Lasso(**kwargs))

In [19]:
Lasso_LR=Lasso(alpha=.5).fit(X_train, y_train)
print("Train score", Lasso_LR.score(X_train, y_train))
print("Test score", Lasso_LR.score(X_test, y_test))

Train score 0.7422571133459799
Test score 0.7694414843791915


## Model 6: Regularized polynomial linear regression (Lasso) (degree = 2)

In [20]:
Lasso_Poly2 = PolynomialLasso(2, alpha = 0.1, max_iter=1e5)
Lasso_Poly2.fit(X_train, y_train)

print("Train score", Lasso_Poly2.score(X_train, y_train))
print("Test score", Lasso_Poly2.score(X_test, y_test))
k = Lasso_Poly2.steps[1][1].coef_
print("Features all", len(k))
print("Features used", sum(Lasso_Poly2.steps[1][1].coef_ != 0))
print("Features NOT used", sum(Lasso_Poly2.steps[1][1].coef_ == 0))

Train score 0.8384231835776196
Test score 0.8637606920483564
Features all 28
Features used 27
Features NOT used 1


## Model 7: Regularized polynomial linear regression (Lasso) (degree = 3)

In [21]:
Lasso_Poly3 = PolynomialLasso(3, alpha = 1, max_iter=1e5)
Lasso_Poly3.fit(X_train, y_train)

print("Train score", Lasso_Poly3.score(X_train, y_train))
print("Test score", Lasso_Poly3.score(X_test, y_test))
k = Lasso_Poly3.steps[1][1].coef_
print("Features all", len(k))
print("Features used", sum(Lasso_Poly3.steps[1][1].coef_ != 0))
print("Features NOT used", sum(Lasso_Poly3.steps[1][1].coef_ == 0))

Train score 0.8493280865435524
Test score 0.8573866769106779
Features all 84
Features used 80
Features NOT used 4


  model = cd_fast.enet_coordinate_descent(


## Model 8: Regularized linear regression (Ridge)

In [22]:
from sklearn.linear_model import Ridge

def PolynomialRidge(degree=2, **kwargs):
    return make_pipeline(PolynomialFeatures(degree),
                         Ridge(**kwargs))

In [23]:
Ridge_LR=Ridge(alpha=.5).fit(X_train, y_train)
print("Train score", Ridge_LR.score(X_train, y_train))
print("Test score", Ridge_LR.score(X_test, y_test))

Train score 0.7422505546331573
Test score 0.7693307567863181


## Model 9: Regularized polynomial linear regression (Ridge) (degree = 2)

In [24]:
Ridge_Poly2 = PolynomialRidge(2, alpha = 0.1, max_iter=1e5)
Ridge_Poly2.fit(X_train, y_train)
print("Train score", Ridge_Poly2.score(X_train, y_train))
print("Test score", Ridge_Poly2.score(X_test, y_test))
k = Ridge_Poly2.steps[1][1].coef_
print("Features all", len(k))
print("Features used", sum(Ridge_Poly2.steps[1][1].coef_ != 0))
print("Features NOT used", sum(Ridge_Poly2.steps[1][1].coef_ == 0))

Train score 0.8384215247926726
Test score 0.8637749658204714
Features all 28
Features used 27
Features NOT used 1


## Model 10: Regularized polynomial linear regression (Ridge) (degree = 3)

In [25]:
Ridge_Poly3 = PolynomialRidge(3, alpha = 1, max_iter=1e5)
Ridge_Poly3.fit(X_train, y_train)

print("Train score", Ridge_Poly3.score(X_train, y_train))
print("Test score", Ridge_Poly3.score(X_test, y_test))
k = Ridge_Poly3.steps[1][1].coef_
print("Features all", len(k))
print("Features used", sum(Ridge_Poly3.steps[1][1].coef_ != 0))
print("Features NOT used", sum(Ridge_Poly3.steps[1][1].coef_ == 0))

Train score 0.847559966712821
Test score 0.8602797338537619
Features all 84
Features used 83
Features NOT used 1


## Model 11: Regularized linear regression (ElasticNet)

In [26]:
from sklearn.linear_model import ElasticNet

def PolynomialElastic(degree=2, **kwargs):
    return make_pipeline(PolynomialFeatures(degree),
                         ElasticNet(**kwargs))

In [27]:
Elastic_LR=ElasticNet(alpha=1, l1_ratio=0.3).fit(X_train, y_train)
print("Train score", Elastic_LR.score(X_train, y_train))
print("Test score", Elastic_LR.score(X_test, y_test))

Train score 0.33127749632613657
Test score 0.34862984284789267


## Model 12: Regularized polynomial linear regression (ElasticNet) (degree = 2)

In [28]:
Elastic_Poly2 = PolynomialElastic(2, alpha = 1, l1_ratio=0.5, max_iter=1e5)
Elastic_Poly2.fit(X_train, y_train)
print("Train score", Elastic_Poly2.score(X_train, y_train))
print("Test score", Elastic_Poly2.score(X_test, y_test))
k = Elastic_Poly2.steps[1][1].coef_
print("Features all", len(k))
print("Features used", sum(Elastic_Poly2.steps[1][1].coef_ != 0))
print("Features NOT used", sum(Elastic_Poly2.steps[1][1].coef_ == 0))

Train score 0.8236641131918638
Test score 0.850662204346229
Features all 28
Features used 27
Features NOT used 1


## Model 13: Regularized polynomial linear regression (ElasticNet) (degree = 3)

In [29]:
Elastic_Poly3 = PolynomialElastic(3, alpha = 1, l1_ratio=0.5, max_iter=1e5)
Elastic_Poly3.fit(X_train, y_train)
print("Train score", Elastic_Poly3.score(X_train, y_train))
print("Test score", Elastic_Poly3.score(X_test, y_test))
k = Elastic_Poly3.steps[1][1].coef_
print("Features all", len(k))
print("Features used", sum(Elastic_Poly3.steps[1][1].coef_ != 0))
print("Features NOT used", sum(Elastic_Poly3.steps[1][1].coef_ == 0))

Train score 0.8415443890964753
Test score 0.8648084174557463
Features all 84
Features used 81
Features NOT used 3


# Summary

In [30]:
col1 = pd.Series({"Model 1": round(LR.score(X_train, y_train),4), "Model 2": round(Poly2_LR.score(X_train, y_train),4), 
                  "Model 3": round(Poly3_LR.score(X_train, y_train),4), "Model 4": round(LR_s.score(X_train_scaled, y_train), 4), 
                  "Model 5": round(Lasso_LR.score(X_train, y_train),4), "Model 6": round(Lasso_Poly2.score(X_train, y_train),4), 
                  "Model 7": round(Lasso_Poly3.score(X_train, y_train),4), "Model 8": round(Ridge_LR.score(X_train, y_train),4), 
                  "Model 9": round(Ridge_Poly2.score(X_train, y_train),4), "Model 10": round(Ridge_Poly3.score(X_train, y_train),4), 
                  "Model 11": round(Elastic_LR.score(X_train, y_train),4), "Model 12": round(Elastic_Poly2.score(X_train, y_train),4), 
                  "Model 13": round(Elastic_Poly3.score(X_train, y_train),4)})
col2 = pd.Series({"Model 1": round(LR.score(X_test, y_test),4), "Model 2": round(Poly2_LR.score(X_test, y_test),4), 
                  "Model 3": round(Poly3_LR.score(X_test, y_test),4), "Model 4": round(LR_s.score(X_test_scaled, y_test),4), 
                  "Model 5": round(Lasso_LR.score(X_test, y_test),4), "Model 6": round(Lasso_Poly2.score(X_test, y_test),4), 
                  "Model 7": round(Lasso_Poly3.score(X_test, y_test),4), "Model 8": round(Ridge_LR.score(X_test, y_test),4), 
                  "Model 9": round(Ridge_Poly2.score(X_test, y_test),4), "Model 10": round(Ridge_Poly3.score(X_test, y_test),4), 
                  "Model 11": round(Elastic_LR.score(X_test, y_test),4), "Model 12": round(Elastic_Poly2.score(X_test, y_test),4), 
                  "Model 13": round(Elastic_Poly3.score(X_train, y_train),4)})

In [31]:
df = pd.DataFrame(data={"Train score": col1, "Test Score": col2})
df

Unnamed: 0,Train score,Test Score
Model 1,0.7423,0.7694
Model 2,0.8384,0.8638
Model 3,0.8494,0.8553
Model 4,0.7423,0.7694
Model 5,0.7423,0.7694
Model 6,0.8384,0.8638
Model 7,0.8493,0.8574
Model 8,0.7423,0.7693
Model 9,0.8384,0.8638
Model 10,0.8476,0.8603


From the comparison above, it can be inferred that these models has performed the best, in compared to other models.

| Model No.  | Model Name  |
|------|-----|
| Model 2  | Polynomial linear regression (degree = 2) |
| Model 6  | Regularized polynomial linear regression (Lasso) (degree = 2) |
| Model 9  | Regularized polynomial linear regression (Ridge) (degree = 2) |