## Machine Learning

### Importing Libraries

In [50]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.utils import shuffle
from sklearn.metrics import r2_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold

### Loading Dataset

In [15]:
all_data = pd.read_csv('../Data Cleaning/cleaned_data.csv')
all_data.columns

Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')

### Data Preprocessing

1- Shuffling the data to reduce the bias.
<br><br>
2- One hot encoding (get_dummies) in order to deal with the categorical values.

In [16]:
all_data = shuffle(all_data)

all_data_dum = pd.get_dummies(all_data)
all_data_dum

Unnamed: 0,age,bmi,children,smoker,charges,sex_female,sex_male,region_northeast,region_northwest,region_southeast,region_southwest
47,28,34.77,0,0,3557.0,1,0,0,1,0,0
8,37,29.83,2,0,6406.0,0,1,1,0,0,0
123,44,31.35,1,1,39556.0,0,1,1,0,0,0
938,18,26.18,2,0,2304.0,0,1,0,0,1,0
321,26,29.64,4,0,24672.0,1,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
1313,19,34.70,2,1,36398.0,1,0,0,0,0,1
248,19,20.90,1,0,1832.0,0,1,0,0,0,1
581,19,30.59,0,0,1640.0,0,1,0,1,0,0
287,63,26.22,0,0,14256.0,1,0,0,1,0,0


### Train, Test Splitting

In [17]:
X = all_data_dum.drop(columns=['charges'])
y = all_data_dum[['charges']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=True)



### Model Selection

In this part, I will apply k fold cross validation in order to choose the best performing algorithm on our data. Example on how cross validation splits the dataset using dummy data: 

In [21]:
kf = KFold(n_splits = 3) # setting the number of splits into 3

for train_index, test_index in kf.split([1, 2, 3, 4, 5, 6]):
    print(train_index, test_index)

[2 3 4 5] [0 1]
[0 1 4 5] [2 3]
[0 1 2 3] [4 5]


As you can see the kFold is splitting the train and test partitions in a way that each test set will be the training set in some place with respect to each fold 

#### Applying cross_val_score to Evaluate Our Models

In [39]:
cross_val_score(LinearRegression(), X, y)

array([0.75229225, 0.78832069, 0.7388152 , 0.69943941, 0.7462162 ])

In [49]:
cross_val_score(Lasso(), X, y)

array([0.75224359, 0.78835653, 0.73883871, 0.69949229, 0.74622792])

In [43]:
cross_val_score(RandomForestRegressor(n_estimators = 100), X, np.ravel(y))

array([0.84178716, 0.86191325, 0.8137814 , 0.8308621 , 0.82653594])

As we can see, the RandomForestRegressor has the highest scores among others, so it will act better on our data.

### Model Tuning

In [55]:
for i in range(100, 200, 10):
    print((i, np.mean(cross_val_score(RandomForestRegressor(n_estimators = i), X, np.ravel(y)))))

(100, 0.831846125141681)
(110, 0.833364289735879)
(120, 0.8339154009224192)
(130, 0.8333984367876605)
(140, 0.834614042038114)
(150, 0.8335295861801286)
(160, 0.8340482963171366)
(170, 0.8332182585103117)
(180, 0.8341442666308645)
(190, 0.8333433079839091)


### Model Building Using RandomForestRegressor

In [57]:
from sklearn.ensemble import RandomForestRegressor

clf = RandomForestRegressor(n_estimators = 100).fit(X_train, np.ravel(y_train))

### Model Testing

In [59]:
clf.predict(np.array(list(X_test.iloc[10])).reshape(1,-1))

array([26852.88])

The above inputs were:
<br>
<br>
age                 47.0<hr align="left" width="15%">
bmi                 29.8<hr align="left" width="15%">
children             3.0<hr align="left" width="15%">
smoker               1.0<hr align="left" width="15%">
sex_female           0.0<hr align="left" width="15%">
sex_male             1.0<hr align="left" width="15%">
region_northeast     0.0<hr align="left" width="15%">
region_northwest     0.0<hr align="left" width="15%">
region_southeast     0.0<hr align="left" width="15%">
region_southwest     1.0<hr align="left" width="15%">
<br>
<br>
The predicted charge is: 26852.88