<a href="https://colab.research.google.com/github/WoradeeKongthong/medical_cost_regression/blob/master/07_Medical_Cost_Random_Forest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Project Name : Medical Cost (Insurance cost) Regression 


### What is the objective of the machine learning model?

The objective of this model is to predict the insurance cost for customers  
from their personal information such as age, sex, bmi, number of children,  
whether they are smoking or not and the region they live in.

### How do I download the dataset?

The dataset is from https://www.kaggle.com/mirichoi0218/insurance.

====================================================================================================

In [1]:
import numpy as np
import pandas as pd

import sklearn
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline

import pickle

## Import the dataset

In [2]:
df = pd.read_csv('insurance.csv')

In [3]:
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


## Data Preprocessing

In [3]:
X = df.iloc[:, :6]
y = df.iloc[:, 6]

In [29]:
X.head()

Unnamed: 0,age,sex,bmi,children,smoker,region
0,19,female,27.9,0,yes,southwest
1,18,male,33.77,1,no,southeast
2,28,male,33.0,3,no,southeast
3,33,male,22.705,0,no,northwest
4,32,male,28.88,0,no,northwest


In [33]:
# create transformer for X
X_col_trans = make_column_transformer(
    (OneHotEncoder(drop='first'), ['sex','smoker','region']),
    (StandardScaler(), ['age','bmi']),
    remainder='passthrough'
)

## Create the Random Forest model

In [36]:
# create the Random Forest model
# with the best estimator's hyperparameter from grid search 
#(n_estimators=35, max_depth=5)

regressor = RandomForestRegressor(n_estimators = 35, max_depth=5)

## Create the Machine Learning Pipeline

In [37]:
pipe = make_pipeline(X_col_trans, regressor)

## Train the model and save it

In [39]:
# fit the pipeline and print its score
pipe.fit(X, y)
pipe.score(X, y)

0.8858485448183654

In [41]:
# Save the trained model
import pickle

pickl = {
    'model' : pipe
}
pickle.dump( pickl, open( 'MedicalCostRandomForest' + ".pkl", "wb" ) )

In [63]:
!ls

07_Medical_Cost_Random_Forest.ipynb  MedicalCost_model.ipynb
insurance.csv			     MedicalCostRandomForest.pkl
medical_cost.ipynb		     medical_cost_regression-master


# Test the pickled model

In [4]:
data = pickle.load(open('MedicalCostRandomForest.pkl','rb'))

In [5]:
model = data['model']

In [6]:
# create input data (x)
x = [19,'female',27.9,0,'yes','southwest']
col = ['age','sex','bmi','children','smoker','region']

new_df = pd.DataFrame(data=[x],columns=col)

In [7]:
dollar = model.predict(new_df)

In [8]:
dollar

array([17637.34883671])

In [9]:
# predict dataframe
temp = X.iloc[:5]
temp

Unnamed: 0,age,sex,bmi,children,smoker,region
0,19,female,27.9,0,yes,southwest
1,18,male,33.77,1,no,southeast
2,28,male,33.0,3,no,southeast
3,33,male,22.705,0,no,northwest
4,32,male,28.88,0,no,northwest


In [10]:
model.predict(temp)

array([17637.34883671,  4076.4886682 ,  6359.38334883,  8085.24845083,
        4451.22048226])

In [11]:
import sklearn
print(sklearn.__version__)

0.22.1
