<center><h2>Medical Cost Personal Datasets</h2></center>


<center>Dataset Link <br><a href='https://www.kaggle.com/mirichoi0218/insurance'>Medical Cost Personal Datasets</a></center>

<b>Context</b><br>
<p style='font-family:verdana'>
Machine Learning with R by Brett Lantz is a book that provides an introduction to machine learning using R. As far as I can tell, Packt Publishing does not make its datasets available online unless you buy the book and create a user account which can be a problem if you are checking the book out from the library or borrowing the book from a friend. All of these datasets are in the public domain but simply needed some cleaning up and recoding to match the format in the book.
</p>
Content
<p style='font-family:verdana'>
    <b>Columns</b><br>
1. age:      Age of primary beneficiary<br>
2. sex:      Insurance contractor gender, female, male<br>
3. bmi:      Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,
              objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9<br>
4. children: Number of children covered by health insurance / Number of dependents<br>
5. smoker:   Smokes or Not.<br>
6. region:   The beneficiary's residential area in the US, northeast, southeast, southwest, northwest.<br>
7. charges:  Individual medical costs billed by health insurance<br>
    
</p>    

> Predict The Insurance Cost ?

# STEP 1: Loading The Dataset

In [None]:
## Basic Libraries
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
dark = sns.color_palette('dark')
bright = sns.color_palette('bright')
deep = sns.color_palette('deep')
pastel = sns.color_palette('pastel')

## Style to be used in plots
plt.style.use("ggplot")

import plotly.graph_objects as go
import plotly.express as px

import warnings
warnings.filterwarnings('ignore')

In [None]:
#Importing the dataset
df =pd.read_csv("../input/insurance/insurance.csv")


# look the data set
df.head()

# Step 2: EDA

## Basic EDA

In [None]:
## Checking The Shape of the data (Rows, Column)
df.shape

In [None]:
## Concise summary of a DataFrame.
df.info()

In [None]:
##  Description of data
df.describe()

In [None]:
## Checking For Any Null Values
df.isnull().sum()

Hurray🎉, There Are No Null Value.

## Checking Distribution of Each Column

In [None]:
## Numerical Columns
df.hist(bins=20,figsize=(20,10));

In [None]:
## Categorical Columns
categorical_columns = [feature for feature in df.columns if df[feature].dtype=='O']
for col in categorical_columns:
    sns.countplot(col,data=df)
    labels = (df[col].value_counts() / len(df))*100
    plt.title(col)
    plt.xlabel(f'{labels}')
    plt.show()

## Visualizing Relationship Between Features and Dependent Variable (charges)

In [None]:
## Prints All The Column Names In a List
df.columns

### Age vs Charges

In [None]:
df.groupby('age')['charges'].mean().plot()

**By Looking At The Above Graph We Can State That Insurance Charges Increases With Age, Which Is Generally True Because People With Higher Age Has A Higher Chance of Death**

In [None]:
## Age vs BMI
plt.figure(figsize=(17,7))
sns.lineplot(data=df,x="age",y="bmi",hue="sex",palette='dark')
plt.title("Body mass index with the Age")
plt.show()

### Sex vs Charges

In [None]:
temp = df.groupby('sex')['charges'].mean()
temp.plot(kind='bar',color=['pink','brown'])

### Smoker vs Charges

In [None]:
sns.barplot(data=df,x='smoker',y='charges',hue='sex')

### BMI vs Charges

In [None]:
plt.figure(figsize=(15,8))
sns.scatterplot(data=df,x='bmi',y='charges')

### Childrens Vs Charges

In [None]:
sns.catplot(x="children", y="charges",kind="swarm", data=df,height=10)

### Regions vs Charges

In [None]:
plt.figure(figsize=(12,5))
sns.barplot(data=df,x='region',y='charges')

### Regions With Amount of People

In [None]:
plt.figure(figsize=(12,5))
ax = sns.countplot(data=df,x='region')
ax.bar_label(ax.containers[0])

### Region vs Smoker 
→ Insurance Charge Based on Region and Their Habit of Smoking

In [None]:
plt.figure(figsize=(12,5))
ax = sns.barplot(data=df,x='region',y='charges',hue='smoker')

### Region With Oldest People (Age >50)

In [None]:
ax = df[df['age']>50]['region'].value_counts().plot.barh(color=pastel,figsize=(10,8))
plt.title('Regions With Oldest People')

## Finding Relationship Between Multiple Features and Charges

In [None]:
sns.pairplot(df)

## Sex, Smoker, Region vs Charges

In [None]:
## Making a Group 
temp=df.groupby(["sex","smoker","region"])["charges"].mean().round(2)
ax = temp.plot(kind="bar", figsize=(20,7),color=pastel)
ax.bar_label(ax.containers[0])
plt.title('Person With Average Charges Based On Sex, Region and Smoking Habit');

In [None]:
df.columns

In [None]:
plt.figure(figsize=(17,7))
px.scatter(data_frame=df,
           x='bmi', 
           y='charges',
           color="sex",
           size="children",
           symbol='smoker',
           hover_name='region',
           text='age',
           title='Group Information Of Insurance Data On Different Scatter Points')

# Results From The Analysis 📊

* Dataset Doesn't Have Any Missing Values
* BMI (Body Mass Index) Follows A Close To Gaussian Distribution
* There Are Three Categorical Columns -['sex', 'smoker', 'region'] 
* There Are Four Numerical Columns - ['age','bmi','children','charges']
* As the Age of The Person Increase Insurance Charges Also increases.
* A Smoker Has Higher Insurance Charges Than A Non-Smoker.
* Males Smokes More Than Females.
* Insurance Charge For Male and Female is mostly similar.
* People Having Less Children Will Most Likely To Opt for an Insurance.
* The Insurance Charge Based On a Region is Also Similar.
* Southeast region has more amount of smokers and old people than other regions.

# STEP 3: Feature Engineering

In [None]:
## Handling Categorical Data
categorical_columns

Label Encoding Refers To a Technique In Which Each Categorical Variable is Given a Numerical Label(0,1,2,3...). We Will Use This Method To Convert Oue Categories Into Numerical Form

In [None]:
## Label Encoding All Categorical Columns
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for col in categorical_columns:
    df[col] = le.fit_transform(df[col])

In [None]:
df.head()

# STEP 4: Data Splliting

In [None]:
## Data splitting
X = df.drop('charges',axis=1).values
y = df['charges'].values

In [None]:
X

In [None]:
y

In [None]:
## Train Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [None]:
X_train.shape

In [None]:
X_test.shape

# STEP 5: Model Training
We Will First Train Some Models From Different Families On Their Default Parameters and Which Ever Gives Us The Max Accuracy We Will Tune It to enhance the results even more better.

In [None]:
## Evaluation metrics
from sklearn.metrics import mean_squared_error, r2_score

### Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression  

linreg=LinearRegression()
linreg.fit(X_train,y_train)

print("Score the X-train with Y-train is : ", linreg.score(X_train,y_train))
print("Score the X-test  with Y-test  is : ", linreg.score(X_test,y_test))

### Ridge Regression

In [None]:
from sklearn.linear_model import Ridge
ridge = Ridge()
ridge.fit(X_train,y_train)

print("Score the X-train with Y-train is : ", ridge.score(X_train,y_train))
print("Score the X-test  with Y-test  is : ", ridge.score(X_test,y_test))

Well, Our Regression Models Are Underfitted On The Data So Let's Try With Tree Based Models.

### Decision Tree Regressor

In [None]:
from sklearn.tree import DecisionTreeRegressor
dtr = DecisionTreeRegressor(criterion='mse',splitter='best',random_state=42)
dtr.fit(X_train,y_train)


print("Score the X-train with Y-train is : ", dtr.score(X_train,y_train))
print("Score the X-test  with Y-test  is : ", dtr.score(X_test,y_test))

y_pred = dtr.predict(X_test)
print("MSE: " ,mean_squared_error(np.log(y_test),np.log(y_pred)))

**Decision Tree is Overfitting, But We Can Improve It Using Ensemble Methods**

### Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators=100,random_state=42)
rfr.fit(X_train,y_train)


print("Score the X-train with Y-train is : ", rfr.score(X_train,y_train))
print("Score the X-test  with Y-test  is : ", rfr.score(X_test,y_test))

y_pred = rfr.predict(X_test)
print("MSE: " ,mean_squared_error(np.log(y_test),np.log(y_pred)))

### Extra Tree Regressor

In [None]:
from sklearn.ensemble import ExtraTreesRegressor
etr = ExtraTreesRegressor(n_estimators=100,random_state=42)
etr.fit(X_train,y_train)


print("Score the X-train with Y-train is : ", etr.score(X_train,y_train))
print("Score the X-test  with Y-test  is : ", etr.score(X_test,y_test))

y_pred =etr.predict(X_test)
print("MSE: " ,mean_squared_error(np.log(y_test),np.log(y_pred)))

**Our Bagging Models Giving A Good Accuracy Now, Let's Try With Boosting Models and See Whether They Can Perform Better or Not.**

### ADA Boost Regressor

In [None]:
from sklearn.ensemble import AdaBoostRegressor
abr = AdaBoostRegressor(random_state=42)
abr.fit(X_train,y_train)

print("Score the X-train with Y-train is : ", abr.score(X_train,y_train))
print("Score the X-test  with Y-test  is : ", abr.score(X_test,y_test))

y_pred = abr.predict(X_test)
print("MSE: " ,mean_squared_error(np.log(y_test),np.log(y_pred)))

### Gradient Boost Regressor

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
gbr = GradientBoostingRegressor(random_state=42)
gbr.fit(X_train,y_train)

print("Score the X-train with Y-train is : ", gbr.score(X_train,y_train))
print("Score the X-test  with Y-test  is : ", gbr.score(X_test,y_test))

y_pred = gbr.predict(X_test)
print("MSE: " ,mean_squared_error(np.log(y_test),np.log(y_pred)))

### XGB Regressor

In [None]:
from xgboost import XGBRegressor
xgb=XGBRegressor(random_state=42)

xgb.fit(X_train,y_train)

print("Score the X-train with Y-train is : ", xgb.score(X_train,y_train))
print("Score the X-test  with Y-test  is : ", xgb.score(X_test,y_test))

**Since, We Got Highest Accuracy From `Gradient Boost Regressor` We Will Tune Its Parameters To Improve The Accuracy.**

#### Hyperparameter Tuning On GB



In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = {'learning_rate':[0.5,0.10,0.01,0.1],
             'n_estimators':[25,50,75,100,125],
              'max_depth':[5,7,9,10],
              'subsample':[1,2],
              'min_samples_split':[1,2,3],
              'min_samples_leaf':[1,1.5,2],
              'max_depth':[5,7,9,10]
              
             }

g_search = GridSearchCV(estimator = gbr, param_grid = param_grid,cv = 3, n_jobs = 1,verbose = True, return_train_score=True)
g_search.fit(X_train, y_train);

print(g_search.best_params_)
print(g_search.score(X_test, y_test))

**Accuracy is improved with 0.19%**

In [None]:
y_pred = g_search.predict(X_test)

In [None]:
sns.residplot(y_test,y_pred)

In [None]:
print("R2 Score Gradient Boost Regressor" ,r2_score(y_test,y_pred))

In [None]:
print("MSE: " ,mean_squared_error(np.log(y_test),np.log(y_pred)))

### Predicting For New Data

Let's Give Some Input Based On Our Analysis

* New Input Data 1 : [61,1,35,3,1,2]
* New Input Data 2: [19,0,23,0,0,0]



In [None]:
data = [61,1,35,3,1,2]
new_data = pd.DataFrame([data],columns=['age', 'sex', 'bmi', 'children', 'smoker', 'region'])
g_search.predict(new_data)

In [None]:
data = [19,0,23,0,0,0]
new_data = pd.DataFrame([data],columns=['age', 'sex', 'bmi', 'children', 'smoker', 'region'])
g_search.predict(new_data)

You Can See From Both The New Inputs

-> If a person is older, smokes, is a male then charges will be high.

-> if a person is younger, doesn't smoke, is a girl then charges will be low.

Our Model Is Predicting Well And Follows All The Analysis. 

Uncomment Below Code To Test With New Sample Inputs

In [None]:
# age = int(input("Enter Your Age \n"))
# sex = int(input("What's Your Gender(1:Male 0:Female) \n"))
# bmi = float(input("Enter Your Body Mass Index \n"))
# children = int(input("How Many Childrens Your Have (If None Enter 0) \n"))
# smoker = int(input("Do You Smoke? (1: Yes 0:No) \n"))
# region = int(input("What's You Region(northeast:0,northwest:1,southeast:2,southwest:3) \n"))

# data = [age,sex,bmi,children,smoker,region]
# new_data = pd.DataFrame([data],columns=['age', 'sex', 'bmi', 'children', 'smoker', 'region'])
# g_search.predict(new_data)

## FUTURE WORK

* Improve The Accuracy By Using Some Other Hyper parameter Technique Like Bayesian optimization.
* Try To Split The Data Into 90-10 or 70-30 and see if accuracy changes.
* Try To Apply Hyper parameter tuning on XGB, RF, ADA Boost and See Whether Performance Increase or Not.
* Try To Add Some Other Efficient models Like LightBGM and Catboost.

## VOTE
* Give a Upvote 🙌 if You Liked The Notebook

### CONNECT WITH ME

[LinkedIN](https://www.linkedin.com/in/abhayparashar31/) | [Medium](https://medium.com/@abhayparashar31) | [Twitter](https://twitter.com/abhayparashar31) | [Github](https://github.com/Abhayparashar31)

**HOPE TO SEE YOU IN MY NEXT KAGGLE NOTEBOOK 😀**