*Note: You are currently reading this using Google Colaboratory which is a cloud-hosted version of Jupyter Notebook. This is a document containing both text cells for documentation and runnable code cells. If you are unfamiliar with Jupyter Notebook, watch this 3-minute introduction before starting this challenge: https://www.youtube.com/watch?v=inN8seMm7UI*

---

In this challenge, you will predict healthcare costs using a regression algorithm.

You are given a dataset that contains information about different people including their healthcare costs. Use the data to predict healthcare costs based on new data.

The first two cells of this notebook import libraries and the data.

Make sure to convert categorical data to numbers. Use 80% of the data as the `train_dataset` and 20% of the data as the `test_dataset`.

`pop` off the "expenses" column from these datasets to create new datasets called `train_labels` and `test_labels`. Use these labels when training your model.

Create a model and train it with the `train_dataset`. Run the final cell in this notebook to check your model. The final cell will use the unseen `test_dataset` to check how well the model generalizes.

To pass the challenge, `model.evaluate` must return a Mean Absolute Error of under 3500. This means it predicts health care costs correctly within $3500.

The final cell will also predict expenses using the `test_dataset` and graph the results.

In [1]:
# Import libraries. You may or may not use all of these.
!pip install -q git+https://github.com/tensorflow/docs
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers

import tensorflow_docs as tfdocs
import tensorflow_docs.plots
import tensorflow_docs.modeling

In [2]:
# Import data
!wget https://cdn.freecodecamp.org/project-data/health-costs/insurance.csv
dataset = pd.read_csv('insurance.csv')
dataset.tail()

--2021-05-05 10:01:52--  https://cdn.freecodecamp.org/project-data/health-costs/insurance.csv
Resolving cdn.freecodecamp.org (cdn.freecodecamp.org)... 104.26.2.33, 172.67.70.149, 104.26.3.33, ...
Connecting to cdn.freecodecamp.org (cdn.freecodecamp.org)|104.26.2.33|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 50264 (49K) [text/csv]
Saving to: ‘insurance.csv’


2021-05-05 10:01:53 (9.30 MB/s) - ‘insurance.csv’ saved [50264/50264]



Unnamed: 0,age,sex,bmi,children,smoker,region,expenses
1333,50,male,31.0,3,no,northwest,10600.55
1334,18,female,31.9,0,no,northeast,2205.98
1335,18,female,36.9,0,no,southeast,1629.83
1336,21,female,25.8,0,no,southwest,2007.95
1337,61,female,29.1,0,yes,northwest,29141.36


In [19]:
# define the feature columns and target column
categorical_c = ['sex','smoker','region']
numerical_c = ['age','bmi','children']
target = ['expenses']
feature_c = []

In [20]:
for feature_name in categorical_c:
  vocabulary = dataset[feature_name].unique()  # gets a list of all unique values from given feature column
  feature_c.append(tf.feature_column.categorical_column_with_vocabulary_list(feature_name, vocabulary))

In [21]:
feature_c

[VocabularyListCategoricalColumn(key='sex', vocabulary_list=('female', 'male'), dtype=tf.string, default_value=-1, num_oov_buckets=0),
 VocabularyListCategoricalColumn(key='smoker', vocabulary_list=('yes', 'no'), dtype=tf.string, default_value=-1, num_oov_buckets=0),
 VocabularyListCategoricalColumn(key='region', vocabulary_list=('southwest', 'southeast', 'northwest', 'northeast'), dtype=tf.string, default_value=-1, num_oov_buckets=0)]

In [22]:
for feature_name in numerical_c:
    feature_c.append(tf.feature_column.numeric_column(feature_name, dtype = tf.float32))

In [23]:
feature_c

[VocabularyListCategoricalColumn(key='sex', vocabulary_list=('female', 'male'), dtype=tf.string, default_value=-1, num_oov_buckets=0),
 VocabularyListCategoricalColumn(key='smoker', vocabulary_list=('yes', 'no'), dtype=tf.string, default_value=-1, num_oov_buckets=0),
 VocabularyListCategoricalColumn(key='region', vocabulary_list=('southwest', 'southeast', 'northwest', 'northeast'), dtype=tf.string, default_value=-1, num_oov_buckets=0),
 NumericColumn(key='age', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 NumericColumn(key='bmi', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 NumericColumn(key='children', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)]

In [50]:
# define the linear regression model
estimator = tf.estimator.LinearRegressor(feature_columns = feature_c, model_dir = 'LinRegTrain')

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'LinRegTrain', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


In [30]:
#make a copy of the data without the target column
df = dataset.copy()
y = df.pop('expenses')

In [31]:
# define the training size of the dataset to be 0.8%
training_size = int(len(df)*0.8)

In [65]:
#define the train and test sets
train = dataset.iloc[:training_size, :]
test = dataset.iloc[training_size:, :]

In [66]:
train

Unnamed: 0,age,sex,bmi,children,smoker,region,expenses
0,19,female,27.9,0,yes,southwest,16884.92
1,18,male,33.8,1,no,southeast,1725.55
2,28,male,33.0,3,no,southeast,4449.46
3,33,male,22.7,0,no,northwest,21984.47
4,32,male,28.9,0,no,northwest,3866.86
...,...,...,...,...,...,...,...
1065,42,female,25.3,1,no,southwest,7045.50
1066,48,male,37.3,2,no,southeast,8978.19
1067,39,male,42.7,0,no,northeast,5757.41
1068,63,male,21.7,1,no,northwest,14349.85


In [67]:
print(train.shape)
print(test.shape)

(1070, 7)
(268, 7)


In [56]:
print(tf.__version__)

2.4.1


In [68]:
#make an input function
def input_fn(dataset , batch_size = 50, num_epochs = None, shuffle = True):
    #return  tf.estimator.inputs.pandas_input_fn(
    return tf.compat.v1.estimator.inputs.pandas_input_fn(
        x = dataset[df.columns]
        , y = dataset['expenses']
        , batch_size = batch_size
        , num_epochs = num_epochs
        , shuffle = shuffle  
    )

In [51]:
? estimator.train

In [75]:
# train the model
estimator.train(input_fn = input_fn(train,num_epochs=None), steps = 50)


INFO:tensorflow:Calling model_fn.




INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from LinRegTrain/model.ckpt-100
Instructions for updating:
Use standard file utilities to get mtimes.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 100...
INFO:tensorflow:Saving checkpoints for 100 into LinRegTrain/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 100...
INFO:tensorflow:loss = 258491440.0, step = 100
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 150...
INFO:tensorflow:Saving checkpoints for 150 into LinRegTrain/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 150...
INFO:tensorflow:Loss for final step: 265749950.0.


<tensorflow_estimator.python.estimator.canned.linear.LinearRegressorV2 at 0x7f87966d6580>

In [76]:
# evaluate the model
evaluation = estimator.evaluate(input_fn = input_fn(test,num_epochs=20,shuffle = True))

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2021-05-05T18:16:57Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from LinRegTrain/model.ckpt-150
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Inference Time : 0.75696s
INFO:tensorflow:Finished evaluation at 2021-05-05-18:16:58
INFO:tensorflow:Saving dict for global step 150: average_loss = 326601600.0, global_step = 150, label/mean = 13495.188, loss = 325937150.0, prediction/mean = 335.13773
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 150: LinRegTrain/model.ckpt-150


In [72]:
test_labels = feature_c

In [73]:
model = estimator
test_dataset = test

In [74]:
# RUN THIS CELL TO TEST YOUR MODEL. DO NOT MODIFY CONTENTS.
# Test model by checking how well the model generalizes using the test set.
loss, mae, mse = model.evaluate(test_dataset, test_labels, verbose=2)

print("Testing set Mean Abs Error: {:5.2f} expenses".format(mae))

if mae < 3500:
  print("You passed the challenge. Great job!")
else:
  print("The Mean Abs Error must be less than 3500. Keep trying.")

# Plot predictions.
test_predictions = model.predict(test_dataset).flatten()

a = plt.axes(aspect='equal')
plt.scatter(test_labels, test_predictions)
plt.xlabel('True values (expenses)')
plt.ylabel('Predictions (expenses)')
lims = [0, 50000]
plt.xlim(lims)
plt.ylim(lims)
_ = plt.plot(lims,lims)


TypeError: evaluate() got an unexpected keyword argument 'verbose'