<a href="https://colab.research.google.com/github/google/applied-machine-learning-intensive/blob/master/content/03_regression/09_regression_project/colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Copyright 2020 Google LLC.

In [None]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Regression Project

We have learned about regression and how to build regression models using both scikit-learn and TensorFlow. Now we'll build a regression model from start to finish. We will acquire data and perform exploratory data analysis and data preprocessing. We'll build and tune our model and measure how well our model generalizes.

### Overview

*Friendly Insurance, Inc.* has requested we do a study for them to help predict the cost of their policyholders. They have provided us with sample [anonymous data](https://www.kaggle.com/mirichoi0218/insurance) about some of their policyholders for the previous year. The dataset includes the following information:

Column   | Description
---------|-------------
age      | age of primary beneficiary
sex      | gender of the primary beneficiary (male or female)
bmi      | body mass index of the primary beneficiary
children | number of children covered by the plan
smoker   | is the primary beneficiary a smoker (yes or no)
region   | geographic region of the beneficiaries (northeast, southeast, southwest, or northwest)
charges  | costs to the insurance company

We have been asked to create a model that, given the first six columns, can predict the charges the insurance company might incur.

The company wants to see how accurate we can get with our predictions. If we can make a case for our model, they will provide us with the full dataset of all of their customers for the last ten years to see if we can improve on our model and possibly even predict cost per client year over year.

With the predicted costs, the company can estimate how much they will be charged compared to the money that they have. So when a customer comes with some criteria, they can easily predict the charges of its customer if it is already predicted by the model. And they can decide which they can provide it or not.

---

## Exploratory Data Analysis

Now that we have considered the societal implications of our model, we can start looking at the data to get a better understanding of what we are working with.

The data we'll be using for this project can be [found on Kaggle](https://www.kaggle.com/mirichoi0218/insurance). Upload your `kaggle.json` file and run the code block below.

In [None]:
! chmod 600 kaggle.json && (ls ~/.kaggle 2>/dev/null || mkdir ~/.kaggle) && mv kaggle.json ~/.kaggle/ && echo 'Done'

### Exercise 2: EDA and Data Preprocessing

Using as many code and text blocks as you need, download the dataset, explore it, and do any model-independent preprocessing that you think is necessary. Feel free to use any of the tools for data analysis and visualization that we have covered in this course so far. Be sure to do individual column analysis and cross-column analysis. Explain your findings.

#### **Student Solution**

In [None]:
# Add code and text blocks to explore the data and explain your work
import pandas as pd
import numpy as np

df = pd.read_csv("insurance.csv")
df 

In [None]:
df.isnull().any()

In [None]:
df.region.unique()

In [None]:
df.head()

In [None]:
df.dtypes

In [None]:
df['bmi'].hist()

In [None]:
df= pd.get_dummies(df, columns=['sex', 'region'])

In [None]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le = le.fit(df['smoker'])
df['smoker']=le.transform(df['smoker'])
df

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12,10))
_ = sns.heatmap(df.corr(), cmap='coolwarm', annot=True)

In [None]:
df.describe(include='all')

Trying to do Normalization and standardization for the data

In [None]:
from sklearn.preprocessing import MinMaxScaler
# define min max scaler
normalization = MinMaxScaler()
# transform data
x_scaled = normalization.fit_transform(df.drop('charges', axis=1))

In [None]:
from sklearn.preprocessing import StandardScaler
# define standard scaler
scaler = StandardScaler()
# transform data
x_std = scaler.fit_transform(df.drop('charges', axis=1))

---

## Modeling

Now that we understand our data a little better, we can build a model. We are trying to predict 'charges', which is a continuous variable. We'll use a regression model to predict 'charges'.

### Exercise 3: Modeling

Using as many code and text blocks as you need, build a model that can predict 'charges' given the features that we have available. To do this, feel free to use any of the toolkits and models that we have explored so far.

You'll be expected to:
1. Prepare the data for the model (or models) that you choose. Remember that some of the data is categorical. In order for your model to use it, you'll need to convert the data to some numeric representation.
1. Build a model or models and adjust parameters.
1. Validate your model with holdout data. Hold out some percentage of your data (10-20%), and use it as a final validation of your model. Print the root mean squared error. We were able to get an RMSE between `3500` and `4000`, but your final RMSE will likely be different.

#### **Student Solution**

###Linear Regression

In [None]:
from sklearn.model_selection import train_test_split
X = df.drop('charges', axis=1)
y = df['charges']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
# Add code and text blocks to build and validate a model and explain your work
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
y_predict = lin_reg.predict(X_test)

In [None]:
plt.plot(X_test, y_test, 'b.')
plt.plot(X_test, y_predict, 'r.')
plt.show()

In [None]:
lin_reg.coef_, lin_reg.intercept_

####Performance Evaluation

In [None]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import math

print("R2 Score:", r2_score(y_test, y_predict))
print("Mean Squarred Error: %.3f" % mean_squared_error(y_test, y_predict))
print("RMSE:", math.sqrt(mean_squared_error(y_test, y_predict)))
print("Mean Absolute Error : " + str(mean_absolute_error(y_test,y_predict)))

####Predicted vs Actual Plot

In [None]:
plt.plot(y_predict, y_test, 'b.')
plt.plot([y_predict.min(), y_predict.max()], [y_predict.min(), y_predict.max()], 'r-')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

####Residual Plot

In [None]:
RESIDUALS = y_test - y_predict
plt.plot(y_predict, RESIDUALS, 'b.')
plt.plot([0, y_predict.max()], [0, 0], 'r-')
plt.xlabel('Predicted')
plt.ylabel('Residual')
plt.show()

###Neural Network

In [None]:
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.optimizers import Adam
model = keras.Sequential([
  layers.Dense(128, input_shape=[10]),
  layers.Dense(64),
  layers.Dense(32),
  layers.Dense(1)
])

model.compile(
  loss='mse',
  optimizer=Adam(lr=0.01),
  metrics=['mae', 'mse'],
)

model.fit(X_train,y_train, epochs=100, validation_split=0.2)

In [None]:
predictions = model.predict(X_test)

predictions

####Performance Evaluation

In [None]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import math

print("R2 Score:", r2_score(y_test, predictions))
print("Mean Squarred Error: %.3f" % mean_squared_error(y_test, predictions))
print("RMSE:", math.sqrt(mean_squared_error(y_test, predictions)))
print("Mean Absolute Error : " + str(mean_absolute_error(y_test,predictions)))

####Predicted vs Actual Plot

In [None]:
plt.plot(predictions, y_test, 'b.')
plt.plot([predictions.min(), predictions.max()], [predictions.min(), predictions.max()], 'r-')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

####Residual Plots

In [None]:
RESIDUALS = y_test - predictions.ravel()
plt.plot(predictions, RESIDUALS, 'b.')
plt.plot([0, predictions.max()], [0, 0], 'r-')
plt.xlabel('Predicted')
plt.ylabel('Residual')
plt.show()

###Gradient Boosting Regressor

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

gbr = GradientBoostingRegressor()
gbr.fit(X_train, y_train)
gbr_pred = gbr.predict(X_test)

####Performance Evaluation

In [None]:
print("R2 Score:", r2_score(y_test, gbr_pred))
print("Mean Squarred Error:", mean_squared_error(y_test, gbr_pred))
print("RMSE:", math.sqrt(mean_squared_error(y_test, gbr_pred)))
print("Mean Absolute Error : " + str(mean_absolute_error(y_test,gbr_pred)))

####Predicted vs Actual Plot

In [None]:
plt.plot(gbr_pred, y_test, 'b.')
plt.plot([gbr_pred.min(), gbr_pred.max()], [gbr_pred.min(), gbr_pred.max()], 'r-')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

In [None]:
RESIDUALS = y_test - gbr_pred
plt.plot(gbr_pred, RESIDUALS, 'b.')
plt.plot([0, gbr_pred.max()], [0, 0], 'r-')
plt.xlabel('Predicted')
plt.ylabel('Residual')
plt.show()