# Lab 10 - Linear Models

In [1]:
%matplotlib inline

## Directions


The due dates for each are indicated in the Syllabus and the course calendar. If anything is unclear, please email EN685.648@gmail.com the official email for the course or ask questions in the Lab discussion area on Blackboard.

The Labs also present technical material that augments the lectures and "book".  You should read through the entire lab at the start of each module.

<div style="background: mistyrose; color: firebrick; border: 2px solid darkred; padding: 5px; margin: 10px;">
Please follow the directions and make sure you provide the requested output. Failure to do so may result in a lower grade even if the code is correct or even 0 points.
</div>

### General Instructions

1.  You will be submitting your assignment to Blackboard. If there are no accompanying files, you should submit *only* your notebook and it should be named using *only* your JHED id: fsmith79.ipynb for example if your JHED id were "fsmith79". If the assignment requires additional files, you should name the *folder/directory* your JHED id and put all items in that folder/directory, ZIP it up (only ZIP...no other compression), and submit it to Blackboard.
    
    * do **not** use absolute paths in your notebooks. All resources should appear in the same directory as the rest of your assignments.
    * the directory **must** be named your JHED id and **only** your JHED id.
    
2. Data Science is as much about what you write (communicating) as the code you execute (researching). In many places, you will be required to execute code and discuss both the purpose and the result. Additionally, Data Science is about reproducibility and transparency. This includes good communication with your team and possibly with yourself. Therefore, you must show **all** work.

3. Avail yourself of the Markdown/Codecell nature of the notebook. If you don't know about Markdown, look it up. Your notebooks should not look like ransom notes. Don't make everything bold. Clearly indicate what question you are answering.

4. Submit a cleanly executed notebook. It should say `In [1]` for the first codecell and increase by 1 throughout.

## Linear Regression

In a previous module (Lab 5), you performed EDA on the insurance data set. In this Lab, you should build a linear regression model trying to estimate `charges`.

In [2]:
import numpy as np
import random as py_random
import numpy.random as np_random
import time
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import scipy.stats as stats

sns.set(style="whitegrid")

In [29]:
import models

# Answer

Let's start by reading in the data and looking at the variables we have in the dataframe.

In [38]:
insurance = pd.read_csv("https://raw.githubusercontent.com/fundamentals-of-data-science/datasets/master/insurance.csv", header=0)

In [4]:
insurance.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


So we have 6 variables not including our target variable. We can view their types as well.

In [5]:
insurance.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


We see no missing entries, age and children are numeric, as is bmi but technically is a float. Smoker and sex are binary categorical variables, and region is categorical with 4 values (from our previous EDA analysis). Let's take a quick look at region again.

In [6]:
insurance['region'].value_counts()

region
southeast    364
southwest    325
northwest    325
northeast    324
Name: count, dtype: int64

We'll first start by converting the categorical variables into a one hot encoding. We'll look at transforming region first.

In [39]:
regions = insurance['region'].unique()
region = {'region': ['southeast', 'southwest', 'northwest', 'northeast' ]}
data = pd.DataFrame(region)

insurance = pd.concat([insurance, pd.get_dummies(insurance['region'])], axis=1)
insurance.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,northeast,northwest,southeast,southwest
0,19,female,27.9,0,yes,southwest,16884.924,False,False,False,True
1,18,male,33.77,1,no,southeast,1725.5523,False,False,True,False
2,28,male,33.0,3,no,southeast,4449.462,False,False,True,False
3,33,male,22.705,0,no,northwest,21984.47061,False,True,False,False
4,32,male,28.88,0,no,northwest,3866.8552,False,True,False,False


Looks pretty good. Now we can do a similar transformation for sex and smoker. We'll make sure the transformations don't cause unexpected changes.

In [21]:
pd.DataFrame(insurance['sex'].value_counts())

Unnamed: 0_level_0,count
sex,Unnamed: 1_level_1
male,676
female,662


In [41]:
insurance['female'] = insurance['sex'].apply(lambda x: 1 if x == 'female' else 0)

pd.DataFrame(insurance['female'].value_counts())

Unnamed: 0_level_0,count
female,Unnamed: 1_level_1
0,676
1,662


Look like we have the same value counts after the encoding. We can apply the same thing for smoker.

In [23]:
pd.DataFrame(insurance['smoker'].value_counts())

Unnamed: 0_level_0,count
smoker,Unnamed: 1_level_1
no,1064
yes,274


In [42]:
insurance['smokes'] = insurance['smoker'].apply(lambda x: 1 if x == 'yes' else 0)

pd.DataFrame(insurance['smokes'].value_counts())

Unnamed: 0_level_0,count
smokes,Unnamed: 1_level_1
0,1064
1,274


Looks good. For putting our encoded variables in our regression model, we are leaving out one value of the encodings. For region, we will leave out southeast, for sex we will leave out male, and for smoker we will leave out no.

In [43]:
insurance.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,northeast,northwest,southeast,southwest,female,smokes
0,19,female,27.9,0,yes,southwest,16884.924,False,False,False,True,1,1
1,18,male,33.77,1,no,southeast,1725.5523,False,False,True,False,0,0
2,28,male,33.0,3,no,southeast,4449.462,False,False,True,False,0,0
3,33,male,22.705,0,no,northwest,21984.47061,False,True,False,False,0,0
4,32,male,28.88,0,no,northwest,3866.8552,False,True,False,False,0,0


Here we just see all our encodings should be present in our dataframe.

Before we create our model, we can think about how each variable could affect charges (note we've already seen the pairwise EDA, so I will just summarize here):
<ul>
    <li> age - positive most likely, as age increases, so do health issues
    <li> bmi - positive, charges would increase for obese people
    <li> children - positive, as number of children increase, charges are more costly
    <li> northeast - ? we don't have enough to go on here, possibly positive if there are higher population centers in the northeast
    <li> northwest - again we don't know
    <li> southwest - let's guess negative, just to oppose the north
    <li> female - negative, women tend to have higher life expectancy than men
    <li> smokes - positive, we know from EDA that smoking largely increases charges

<ul>

Let's start with the all variables in model:

In [44]:
model = '''charges ~ age + bmi + children + northeast
        + northwest + southwest + female + smokes'''

result = models.bootstrap_linear_regression(model, data=insurance)

In [46]:
models.describe_bootstrap_lr(result)

0,1,2,3,4
,,,95% BCI,
Coefficients MeanLoHi,,,,
,$\beta_{0}$,-13104.87,-15432.92,-11554.32
age,$\beta_{1}$,1035.02,-60.66,1960.3
bmi,$\beta_{2}$,682.06,-485.90,1663.38
children,$\beta_{3}$,74.97,-877.87,1010.67
northeast,$\beta_{4}$,256.86,228.94,280.49
northwest,$\beta_{5}$,339.19,276.99,402.77
southwest,$\beta_{6}$,475.5,212.20,694.02
female,$\beta_{7}$,131.31,-696.74,631.17


We see the $R^2$ is 0.75, so the model explains 75% of the variability in charges. We should also look at the adjusted $R^2$ since we have so many variables in the model. I will copy the `adjusted_r_squared` function from *Fundamentals, page 779*.

In [47]:
def adjusted_r_squared(result):
    adjustment = (result['n'] - 1)/ (result['n'] - len(result['coefficients']) - 1 - 1)
    return 1 - (1 - result['r_squared']) * adjustment

In [48]:
adjusted_r_squared(result)

0.7490359662835133

We see about the same result here, but note the adjusted R squared could be handy to look at as we proceed to refine our model.

Let's describe our initial charges category as a reminder

In [49]:
insurance['charges'].describe()

count     1338.000000
mean     13270.422265
std      12110.011237
min       1121.873900
25%       4740.287150
50%       9382.033000
75%      16639.912515
max      63770.428010
Name: charges, dtype: float64

We see a std of 12,110, while our $\sigma$ is 6062.10, almost half. Coefficients aare all positive, but only charges hugely so. Aside from region, the other variable credible intervals contain 0. We can look at our strong/weak/mixed our evidence is that we predicted these variables correctly

In [50]:
predictions = {'age': '+', 'bmi': '+', 'children': '+', 'northeast': '+',
            'northwest': '+', 'southwest': '-', 'female': '-', 'smokes': '+'}

models.evaluate_coefficient_predictions(predictions, result)

age P(>0)=0.970 (strong)
bmi P(>0)=0.940 (strong)
children P(>0)=0.620 (mixed)
northeast P(>0)=1.000 (strong)
northwest P(>0)=1.000 (strong)
southwest P(<0)=0.000 (weak)
female P(<0)=0.370 (mixed)
smokes P(>0)=1.000 (strong)


We have strong evidence that age, bmi, northeast, northwest, and smokes are all positive. Weak evidence that southwest is negative, but we weren't really positive (heh), and then mixed results for children and female. From our EDA, we saw men and women didn't have much of a difference in charges, and we had a weak correlation between number of children and charges.

According to our table for evaluating coefficients, we should remove variables that had an unexpected sign and includes 0 - female falls into that category, so I will take it out of the model (we should not expect to see much of a difference between men and women anyway based on our EDA).