<a href="https://colab.research.google.com/github/alanpirotta/freecodecamp_certif/blob/main/First_tests_fcc_predict_health_costs_with_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*Note: You are currently reading this using Google Colaboratory which is a cloud-hosted version of Jupyter Notebook. This is a document containing both text cells for documentation and runnable code cells. If you are unfamiliar with Jupyter Notebook, watch this 3-minute introduction before starting this challenge: https://www.youtube.com/watch?v=inN8seMm7UI*

---

In this challenge, you will predict healthcare costs using a regression algorithm.

You are given a dataset that contains information about different people including their healthcare costs. Use the data to predict healthcare costs based on new data.

The first two cells of this notebook import libraries and the data.

Make sure to convert categorical data to numbers. Use 80% of the data as the `train_dataset` and 20% of the data as the `test_dataset`.

`pop` off the "expenses" column from these datasets to create new datasets called `train_labels` and `test_labels`. Use these labels when training your model.

Create a model and train it with the `train_dataset`. Run the final cell in this notebook to check your model. The final cell will use the unseen `test_dataset` to check how well the model generalizes.

To pass the challenge, `model.evaluate` must return a Mean Absolute Error of under 3500. This means it predicts health care costs correctly within $3500.

The final cell will also predict expenses using the `test_dataset` and graph the results.

In [None]:
# Import libraries. You may or may not use all of these.
!pip install -q git+https://github.com/tensorflow/docs
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers

import tensorflow_docs as tfdocs
import tensorflow_docs.plots
import tensorflow_docs.modeling

In [None]:
# Import data
!wget https://cdn.freecodecamp.org/project-data/health-costs/insurance.csv
dataset = pd.read_csv('insurance.csv')
dataset.tail()

### Change categorical data to numbers.
Binary if possible

In [None]:
feature = "region"
classes = dataset[feature].unique().tolist()
print(f"Feature classes: {classes}")

dataset[feature] = dataset[feature].map(classes.index)

# More manual Alternative 
# dataset['region'] = dataset['region'].replace({'southeast': 0,
#                                               'southwest': 1,
#                                               'northwest': 2,
#                                               'northeast': 3
#                                             })
print(dataset['region'].value_counts())
dataset.tail()

In [None]:
dataset['smoker'] = np.where(dataset['smoker']=='yes',1,0)
print(dataset['smoker'].value_counts())
dataset.tail()

In [None]:
dataset['sex'] = np.where(dataset['sex']=='male',1,0)
print(dataset['sex'].value_counts())
dataset.tail()

### Split between train and test datasets

In [None]:
from sklearn.model_selection import train_test_split
train_dataset, test_dataset = train_test_split(dataset, test_size=0.2)
print(f'train shape: {train_dataset.shape}')
print(f'test shape: {test_dataset.shape}')
train_dataset.tail()

### Checking collinearity and outliers

In [None]:
import seaborn as sns
corr = train_dataset.corr()
plt.figure(figsize=(5,5))
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True), square=True)
plt.show()

It seems that bmi and region have an anti-correlation. It probbly means that people in the north has less bmi. i think it's not too significative.
If i invert the rename of the regions, the correlation change to around 0.3

In [None]:
fig, ax = plt.subplots(1,2)
# ax[0].set_title('bmi')
# ax[1].set_title('age')
# ax[2].set_title('children')
# sns.boxplot(ax=ax[0], data=train_dataset['bmi'])
# sns.boxplot(ax=ax[1], data=train_dataset['age'])
# sns.boxplot(ax=ax[2], data=train_dataset['children'])
sns.boxplot(ax=ax[0], data=train_dataset[['age','bmi']])
sns.boxplot(ax=ax[1], data=train_dataset[['sex','children','smoker','region']])
plt.show()

In [None]:
train_dataset['smoker'].value_counts()

the 'outlier' in the smoker feature surprised me, as is binary, but after checking the amount of answers I considered it to be sufficiently balanced.
I'll check and remove the bmis outliers

In [None]:
plt.figure(figsize=(15,5))
sns.distplot(x= train_dataset['bmi'], hist=True, hist_kws={"edgecolor": 'w', "linewidth": 3}, kde_kws={"linewidth": 3}) 
# labels and title
plt.xlabel('bmi')
plt.ylabel('frequency')
plt.title('Distribution of bmi')
plt.show()

In [None]:
Q1 = train_dataset['bmi'].quantile(0.25)
Q3 = train_dataset['bmi'].quantile(0.75)  
IQR = Q3 - Q1
print (Q1 - 1.5 * IQR)
print (Q3 + 1.5 * IQR)
outliers = train_dataset[(train_dataset['bmi'] < (Q1 - 1.5 * IQR)) | (train_dataset['bmi'] > (Q3 + 1.5 * IQR))]
outliers

In [None]:
train_dataset = train_dataset.drop(outliers.index)

In [None]:
print(train_dataset.loc[train_dataset.index == 1088])
print()
print(train_dataset.loc[train_dataset.index == 88])

### Splitting the labels from the train/test dataset

In [None]:
train_labels = train_dataset.pop('expenses')
test_labels = test_dataset.pop('expenses')

### First Test with linear regression

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
lr = LinearRegression()

In [None]:
lr.fit(train_dataset, train_labels)
train_prediction = lr.predict(train_dataset)
train_mae = mean_absolute_error(train_prediction, train_labels)
train_mae

In [None]:
test_pred = lr.predict(test_dataset)
test_mae = mean_absolute_error(test_pred, test_labels)
test_mae

In [None]:
print(f'Train score: {lr.score(train_dataset,train_labels)}')
print(f'Test score: {lr.score(test_dataset,test_labels)}')

In [None]:
print(lr.intercept_)
print(lr.coef_)

### Testing with RidgeRegression

In [None]:
from sklearn.linear_model import Ridge
rdg=Ridge(alpha=1.0)
rdg.fit(train_dataset, train_labels)
train_ridge_pred = rdg.predict(train_dataset)
train_ridge_mae = mean_absolute_error(train_ridge_pred, train_labels)
train_ridge_mae

It's not useful, as there isn't collinearity and all features affect the results

### Testing with Random Forest, using TensorFlow

In [None]:
pip install tensorflow_decision_forests

In [None]:
pip install wurlitzer

In [None]:
try:
  from wurlitzer import sys_pipes
except:
  from colabtools.googlelog import CaptureLog as sys_pipes

In [None]:
import tensorflow_decision_forests as tfdf
train_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(train_dataset, task=tfdf.keras.Task.REGRESSION)
test_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(test_dataset, task=tfdf.keras.Task.REGRESSION)

In [None]:
# Import data
!wget https://cdn.freecodecamp.org/project-data/health-costs/insurance.csv
dataset = pd.read_csv('insurance.csv')
dataset.tail()

In [None]:
model = tfdf.keras.RandomForestModel(task = tfdf.keras.Task.REGRESSION)
model.compile(metrics=["mae"])
with sys_pipes():
  model.fit(x=train_dataset)

In [None]:
# RUN THIS CELL TO TEST YOUR MODEL. DO NOT MODIFY CONTENTS.
# Test model by checking how well the model generalizes using the test set.
loss, mae, mse = model.evaluate(test_dataset, test_labels, verbose=2)

print("Testing set Mean Abs Error: {:5.2f} expenses".format(mae))

if mae < 3500:
  print("You passed the challenge. Great job!")
else:
  print("The Mean Abs Error must be less than 3500. Keep trying.")

# Plot predictions.
test_predictions = model.predict(test_dataset).flatten()

a = plt.axes(aspect='equal')
plt.scatter(test_labels, test_predictions)
plt.xlabel('True values (expenses)')
plt.ylabel('Predictions (expenses)')
lims = [0, 50000]
plt.xlim(lims)
plt.ylim(lims)
_ = plt.plot(lims,lims)
