# Insurance Charges Prediction
Download the data from: https://www.kaggle.com/teertha/ushealthinsurancedataset (insurance.csv)  

Although the dataset is small (1338 samples), regression should give a "good" prediction for the insurance charge.

Prepare the data:

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder

data = pd.read_csv("csv_data/insurance.csv")

#binary category encoding
ordinal_encoder = OrdinalEncoder()
data[['sex','smoker']] = ordinal_encoder.fit_transform(data[['sex','smoker']])

#normalization
data[['age','children']] = data[['age','children']] * 1.
data['age'] = data['age'] / data['age'].max()
data['bmi'] = data['bmi'] / data['bmi'].max()
data['children'] = data['children'] / data['children'].max()

#adding new columns 
data['age&bmi'] = data['age']*0.5 + data['bmi']*0.5
data['sex&children'] = data['sex']*0.3 + data['children']*0.7

#category encoding using one-hot encoder
region_encoder = OneHotEncoder()
region_data = np.array(region_encoder.fit_transform(data[['region']]).toarray()).T
for col,i in zip(region_data,range(len(region_data))):
    data['region'+str(i)] = col

data = data.drop(['children','sex'],axis=1)
data = data.drop(['region'],axis=1)

cols = data.columns.tolist()
cols = cols[:3] + cols[4:] + cols[3:4]
new_data = data[cols]

train,test = train_test_split(new_data, test_size=0.3, random_state=2)
Ytrain = train.iloc[:,-1].ravel()
Xtrain = train.iloc[:,:-1]
Ytest = test.iloc[:,-1].ravel()
Xtest = test.iloc[:,:-1]

new_data.describe()

Unnamed: 0,age,bmi,smoker,age&bmi,sex&children,region0,region1,region2,region3,charges
count,1338.0,1338.0,1338.0,1338.0,1338.0,1338.0,1338.0,1338.0,1338.0,1338.0
mean,0.61261,0.577139,0.204783,0.594874,0.304858,0.242152,0.2429,0.272048,0.2429,13270.422265
std,0.219531,0.114779,0.403694,0.129301,0.227742,0.428546,0.428995,0.445181,0.428995,12110.011237
min,0.28125,0.300395,0.0,0.290823,0.0,0.0,0.0,0.0,0.0,1121.8739
25%,0.421875,0.494942,0.0,0.491557,0.14,0.0,0.0,0.0,0.0,4740.28715
50%,0.609375,0.572181,0.0,0.593421,0.3,0.0,0.0,0.0,0.0,9382.033
75%,0.796875,0.652997,0.0,0.696659,0.44,0.0,0.0,1.0,0.0,16639.912515
max,1.0,1.0,1.0,0.914823,1.0,1.0,1.0,1.0,1.0,63770.42801


Looking at the correlation matrix should help to understand the data a little better.  
After a few iteration and testing the results, adding 'avg1' and 'avg2' help a bit.

In [2]:
corr_matrix = data.corr()
corr_matrix["charges"].sort_values(ascending=False)

charges         1.000000
smoker          0.787251
age&bmi         0.341865
age             0.299008
bmi             0.198341
sex&children    0.088137
region2         0.073982
region0         0.006349
region1        -0.039905
region3        -0.043210
Name: charges, dtype: float64

Build and evaluate the model.
Use Ridge Regression and Polynomial Regression.

In [3]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

polynomial_regression = Pipeline([
 ("poly_features", PolynomialFeatures(degree=2, include_bias=False)),
 ("ridge_reg", Ridge(alpha=0.1)),
 ])
clf = polynomial_regression.fit(Xtrain,Ytrain)

score = cross_val_score(polynomial_regression,Xtrain,Ytrain,cv=5)
clf.fit(Xtrain,Ytrain)

print("Cross validation Score:",score.mean(),"+/-",score.std() * 2)
print("Test sample Score:",clf.score(Xtest,Ytest),"\n")

Cross validation Score: 0.8206528819195459 +/- 0.10226758467810575
Test sample Score: 0.8468059733423947 

