<a href="https://colab.research.google.com/github/estrella-mooney/ma-learning-python/blob/main/ex3a_multiple_glms.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting Health Insurance Costs


This is a longer exercise which you will start in class and finish during your self-driven learning time after class. There may also be in-class time tomorrow to ask for help or feedback from the tutor.

This is not an assessment. It is a learning exercise that will prepare you for assessments later on.

This exercise requires that you write much more code than you have before. It is intended to be more difficult, but this struggle will improve your skills.

To make things easier and learn:
* Use other notebooks you have previously completed as a reference to help you do this work. This is NOT cheating - it's how people learn code.
* Ask the tutor for help when you get stuck
* Other references like slides and Google are OK too, though you should not need them
* Do not copy other students' answers - you need to learn now so that you do not struggle with exercises in later weeks


## The scenario

_You’ve been hired by a USA-based financial planning service._
_They want to be able to tell clients how much they can expect to pay in medical insurance charges in the future._

_They have historical data from clients with information like age, smoking status, and how much they paid in medical insurance._

_They need a model that can predict most people’s costs within $5000._
_Can you build a model that can do this?_

## Our steps

This exercise is broken into two notebooks. These walk you through building a model in a similar way to how you might in the real world.

### Notebook One

We will:

1. Import pandas
1. Open the data and look at it
1. Graph some of the data
1. Split the data into training and test sets
1. Make a simple linear regression model that predicts healthcare charges from one feature
1. Train the model simple linear regression model
1. Visualise the simple linear regression model
1. Test the simple linear regression model

### Notebook Two

We will:

1. Import all packages needed
1. Open the data
1. Graph some more of the data to see other features
1. Build a second simple linear regression model using another feature
1. Train and test it
1. Build a multiple linear regression model that uses two features
1. Train it
1. Test to see whether our model can predict most people's costs within $5000 or not
1. Demonstrate that training and test sets are not a perfect approach


# Notebook One

## Step 1: Import Pandas

Our data are in a text file and we need pandas to open it.

First, you need to import the pandas package

### Instructions

1. Write code below that imports the pandas package
1. Run the cell

### Hints

* There is an example of a package called `sys` being imported in the top line of the cell. You want something similar, but for `pandas`
* If you cannot recall how to do this, open an old notebook you previously finished and look for where pandas is imported

In [3]:
#here will import the packages such as panda and scikit-learn
import sys
import pandas


# ===========================================================
# You can ignore the code below. It will simply check your code
if 'pandas' in sys.modules:
    print("Well done! Pandas was imported successfully")
else:
    print("Pandas was not imported. Try again")


Well done! Pandas was imported successfully


## Step 2a: Open the data

Our data is in a text file at `insurance.csv`.

We want to:
* open it with a function called `read_csv()` which is inside `pandas`
* store it it in a variable called `data`.

### Instructions
1. Upload the file from your computer (in the data) folders called `insurance.csv`
1. Replace question marks with code so that you open `insurance.csv` using pandas, and store it to something called `data`
1. Run the cell

### Hints
* To use a method in a package, you write the package name, then a fullstop, then the method name. For example, if my package is called `my_package` and the method is called `read_file` I would write `my_package.read_file()`
* Remember that `"insurance.csv"` needs to be in quotes "" because it's a string
* Look at previous notebooks to remind yourself how this was done before. These old notebooks might say `, sep="\t"` or similar. You should not write that in this case.
* Make sure your variable is called `data` (all lower case letters)
* If you get an error, try to read what it says for a hint on what's wrong

In [4]:
# the insurance dataset is loaded using panda's read_csv using data as the variable

data = pandas.read_csv("insurance.csv")


# ===========================================================
# You can ignore the code below. It will simply check your code
if 'data' in locals() and type(data) is pandas.DataFrame:
    if data.size == 520:
        print("It looks like you opened the wrong file")
    else:
        print("Well done! You opened the file")
else:
    print("Not quite right! Try again")


Well done! You opened the file


## Step 2b: Look at the data as a table

Let's take a look at the dataframe you've made, as a table.

In Jupyter notebooks there are two easier ways to do this:
* By writing the name of the variable at the end of a cell
* By using `print()` to print your variable

### Instructions
1. In the cell below, use one of these methods in order to print out the pandas dataframe you've just made
1. Run the cell and take a look at the data you have
1. Make sure the table has printed out before you continue

### Hint:
* You only need to write one word to do this! You wrote the name of your table in the last cell
* It will not print the entire table, just a handful of rows. This is ok.

In [5]:
#Here we print the data variable to inspect the data set (and to find the features and labels)
#print(data)
data ## for formatting only in google collab

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


In [9]:
data.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


## Step 3a: Graph some data - import graphing

The first thing we are interested in is whether older people pay more for health insurance than younger people.

We will start by graphing the data we have. To do this, we first need the course's `graphing` package.

### Instructions
1. In the cell below, write code that will import the package called `graphing`
1. Run the cell

### Hints
* This will be very similar to when you imported pandas, except this time the package is called `graphing`
* `graphing` is entirely lower case

In [12]:
# Please execute this cell. It downloads a file called graphing for us to easily use graphing
!wget https://raw.githubusercontent.com/TomReidNZ/QRC_Datasets/main/graphing.py

--2023-08-28 22:11:31--  https://raw.githubusercontent.com/TomReidNZ/QRC_Datasets/main/graphing.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18769 (18K) [text/plain]
Saving to: ‘graphing.py’


2023-08-28 22:11:31 (71.9 MB/s) - ‘graphing.py’ saved [18769/18769]



In [13]:
# Replace the question marks with code that will import graphing
import graphing

# ===========================================================
# You can ignore the code below. It will simply check your code
if 'graphing' in sys.modules:
    print("Well done! Graphing was imported successfully")
else:
    print("Graphing was not imported. Try again")


Well done! Graphing was imported successfully


## Step 3b: Graph some data

Time to see whether older people pay more for health insurance than younger people.

Let's make a scatter plot with age on the x axis and charges on the y axis

### Instructions

1. Look at the print out of your table above. Notice that we have the feature `age` and the label `charges`
1. In the cell below, replace the left-most question mark with the name of the variable holding our table of data
1. Replace the second set of question marks with the name of our feature (`age`)
1. Replace the final set of question marks with the name of our label (`charges`)
1. Give the graph a better title
1. Run the cell
1. Look at the graph. Do prices seem to change much with age?

### Hints
* You just used the name of the data table in the previous step
* `label_x`, `label_y`, and `title` need strings, so what you write will need to be in quotes "". For example `label_x="bmi"`

In [14]:
# In this method, a 2D graph is plotted using BMI as a feature and label as ....
graphing.scatter_2D(data, label_x='age', label_y='charges', title="Age Vs Charges")

## Step 4: Train and test sets

Before we train a model, we need to split our data into training and test sets.

This means we take our dataframe called `data` and split it into two dataframes:
* one called `training_data`
* one called `testing_data`

There are lots of ways to do this, but you don't need to memorise how for this course.

You can check this worked properly by checking that the number of rows in `training_data` plus the number of rows in `testing_data` add to how much data you started with.

### Instructions

1. Read all comments in each cell below
1. Run the Cell 1 to split the data into different datasets
1. Complete Cell 2, by replacing the question marks with the word `size`.
1. Run Cell 2
1. Complete Cell 3 so that it will print out the amount of _testing_ data
1. Run Cell 3
1. Check that if you add the numbers from Cell 2 and Cell 3, you get the amount of data in the whole data set (printed in Cell 1)

### Hints
* You can see how to complete the Cell 3 by looking at Cell 2
* Make sure you print the testing data in the last cell, not the training data


In [6]:
# CELL 1

# Print the amount of data in the whole data set
print("total number of rows in data")
print(data.size)

# Create a training data set
training_data = data.sample(frac=0.8)

# Create a testing data set
testing_data = data.drop(training_data.index)

total number of rows in data
9366


In [7]:
# CELL 2
# Change this code so it prints how much training data there are
print("Amount of training data")
print(training_data.size)

Amount of training data
7490


In [15]:
# CELL 3
# Change this code so it prints how much testing data there are
print("Amount of testing data")
print(testing_data.size)

Amount of testing data
1876


## Step 5a: Build a simple linear regression model - Import StatsModels

We're almost ready to build our model.

We want to build our model using a package called `statsmodels`.

`statsmodels` is a package that contains lots of smaller packages. We only want one of them, called `statsmodels.formula.api`.

We can import this two different ways.

#### Option 1
The first way is to do what we've done before:
```
import statsmodels.formula.api
```
When we use something inside of it, we write it out in full. For example, let's say we want to use a method called `ols()`. We would write
```
statsmodels.formula.api.ols()
```

#### Option 2
The second way is to do what we've done before, but add a nickname:
```
import statsmodels.formula.api as smf
```
When we use something inside of it, we can use its nickname. For example
```
smf.ols()
```

### Instructions
1. Choose one option above to import `statsmodels.formula.api`
1. Write code in the cell below to import your model
1. Run the cell

Note: the way you do this will change how you answer questions that follow. Use whichever option is more intuitive for you.


In [16]:
# Complete the line below to import statsmodels.formula.api
import statsmodels.formula.api as smf



# ===========================================================
# You can ignore the code below. It will simply check your code
if 'statsmodels.formula.api' in sys.modules:
    print("Well done! The package was imported successfully")
else:
    print("The package was not imported. Try again")


Well done! The package was imported successfully


## Step 5b: Build a simple linear regression model

Let's build our model. We do that using a function called `ols` inside of `statsmodels.formula.api`. It makes a model, but does not train it.

In previous exercises we saw how to use formulae to build these models. For example, when we built a model to predict bean harvest sizes (label) using the number of workers (feature) our formula was:

```"tons_of_beans_harvested ~ number_of_workers"```

Now we want to be able to predict `charges` (label) using the formula `age`. This means our formula is
```"charges ~ age"```

#### A note about 'formulas'
These are short-hand ways to write the proper mathematical formulas you were shown in class that have slope and intercept. Statsmodels prefers us to write the short-hand versions.

### Instructions
1. Replace the left-most question marks with the name you gave `statsmodels.formula.api` in the previous step.
1. Replace the second question marks with the short-hand formula we need. Remember we want to predict `charges` using `age`.
1. Replace the third question marks with the name of our _training_ dataset.
1. Run the cell

### Hints
1. If you didn't give the package a nickname in the previous step, its name is `statsmodels.formula.api`
1. We've provided you with the formula needed above. Make sure you understand what it means, though.
1. The name of the training data set variable is in Step 4, Cell 2


In [17]:
# Complete the code below
untrained_model = smf.ols(formula="charges ~ age", data=training_data)


# ===========================================================
# You can ignore the code below. It will simply check your code
if 'untrained_model' in locals():
    if untrained_model.data.frame.size == 7490:
        print("Well done! Your model is ready to train")
    else:
        print("Try to use the training data. It looks like you used something else")
else:
    print("Not quite right! Try again")

Well done! Your model is ready to train


## Step 6: Train your model

Your model is ready to train!

Your model has a method called `fit()` which:
* Runs training to completion using:
    * The formula you provided
    * Sum of squared errors as an error function
* Returns a newly trained model, which we need to save to a variable

Once we have run training, we can print out the trained model's parameters (its slope and intercept)

### Instructions

1. Replace the question marks below with code that will train your model
1. Run the cell
1. Read the output:
    * The slope for age means the line's slope. This is how much the price increases for every year you have lived
    * The intercept means a base amount that everyone has to pay each year

### Hint

Remember the example in class about robots being given instructions.

If your robot was called `my_robot` and the list of instructions was called `clean_my_room()` we would write
```my_robot.clean_my_room()```

In this case, our robot is our model, called `untrained_model`, and the list of instructions is called `fit()`.


In [18]:
# Replace the question marks with code that runs
# training on your model
trained_model = untrained_model.fit()

# Print out the trained model's parameters
print("slope for age:")
print(trained_model.params.age)
print("intercept (offset):")
print(trained_model.params.Intercept)

slope for age:
261.6736106645178
intercept (offset):
2891.503669401164


## Step 7: Graph your model

Let's graph your model.

This is just like the previous graph we made, but this time:
* We will have the trained_model's predictions as a trendline (we have done this for you)
* We only want the testing data

### Instructions
1. Replace the first question marks with the name of the variable holding your _testing_ data.
1. Replace the second question marks with the name of the `age` variable
1. Replace the third question marks with the name of the `charges` variable

### Hints

* This is just like Step 3B, except you must use the testing data, _not_ `data` and _not_ `training_data`.
* You can find the name of the testing data in Step 4
* Remember that `label_x` and `label_y` should be strings - put the names in quotes.

In [19]:
# Replace question marks below with code that makes this work
graphing.scatter_2D(testing_data, label_x='age', label_y='charges', trendline=trained_model.predict, title="Test Data (dots) and model (line)")

## Test the Model

You've just graphed how well your model does at predicting the test dataset.

We want to know how close the model is to predicting charges for the average person, though.

For this, we will calculate the mean absolute difference. This tells us how far off the model is, on average, at predicting charges. We can even ask `statsmodels` to calculate it for us.

We will need to:
1. Make predictions for each person in the testing dataset
1. Ask statsmodels to compare those predictions to the actual values in the testing dataset

We will do this in three steps.


### Instructions (1 of 3)

Let's make predictions for each person in the testing set.

In the cell below:
1. Read the code carefully
1. Replace the first set of question marks with the name of the method we use to `predict` things
1. Replace the second set of question marks with the name of our testing data
1. Run the cell
1. Once it works, check you understand what it is doing

### Hints
* The name of the method that we use to `predict` things has a fairly obvious name that we have mentioned repeatedly in class. It was used in Step 7. It stats with `pr` and ends in `ict`...
* You just used the testing data in the previous step. If you've forgotten its name, see you answers there.

In [20]:
# Make predictions using the testing data
predictions = trained_model.predict(testing_data)

print("Predictions:")
print(predictions)


# ===========================================================
# You can ignore the code below. It will simply check your code
if 'predictions' in locals():
    if predictions.size == len(testing_data):
        print("Well done!")
    else:
        print("Did you predict using the correct dataset? Try again")
else:
    print("Try again!")

Predictions:
3       11526.732821
7       12573.427264
14       9956.691157
18      17545.225867
29      11003.385600
            ...     
1317     7601.628661
1318    12050.080043
1323    13881.795317
1325    18853.593920
1335     7601.628661
Length: 268, dtype: float64
Well done!


### Instruction (2 of 3)

The part of `statsmodels` that can compare our predictions and the actual answers is in `statsmodels.tools.eval_measures`
1. In the cell below, import `statsmodels.tools.eval_measures`
1. Run the cell

### Hints

We've imported a few things now. There is nothing special about this time. The name of the package is just different.

In [21]:
# Replace the ??? with code to import statsmodels.tools.eval_measures
# This time, don't give it a nickname
import statsmodels.tools.eval_measures
## calculate means sqaured errors

### Instructions (3 of 3)

Time to calculate the mean (average) absolute difference between the predictions and the correct answers.

To do this, we need to use the package we just imported. It contains a method called `meanabs` that can do the job for us.

In the cell below we compare the results to the actual answers
1. Read all the code and its comments
1. Replace the first set of question marks with the name of the package we just imported
1. Replace the second set of question marks with the method we want to use (`meanabs`).
1. Replace the third set of question marks with the variable holding predictions we just made
1. Run the cell
1. Once it works, review what you've just written
1. Write down the error value. You will need it in the next notebook

### Hints

* When you use the package, you need to write it out in full (including the fullstops)
* You made the variables holding the predictions you made a minute or two ago. Scroll up if you've forgotten its name.
* Your error value should be more than $8000. If it is less, something is wrong.

In [25]:
# Get the correct answers
correct_answers = testing_data.charges

# Replace question marks below so that we can compare
# these correct answers to the predictions we have made
statsmodels.tools.eval_measures.meanabs(correct_answers, predictions)

9356.249186826748

## Final Step

Wow! You've covered a lot.

Here's what you've done:

1. Imported a number of ML libraries
1. Opened raw data into a format the computer understands
1. Looked at it
1. Graphed it
1. Made a model
1. Trained the model
1. Looked at how well it works

The only downside is that the model has not worked as well as we want. Remember, we wanted most estimations of healthcare insurance charges to be within $5000 of the real thing.

The next notebook we will build a model that uses more than one feature, hoping that it will achieve our goal.

### Instructions

1. Take a quick breather, then read this completed notebook from start to finish. Try to see everything you've done and understand how you did it.
1. Move onto the second notebook. Keep this one open though - you can use it as a reference to help you complete that second.