<a href="https://colab.research.google.com/github/datacommonsorg/api-python/blob/master/notebooks/intro_data_science/Regression_Evaluation_and_Interpretation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Copyright 2022 Google LLC.
SPDX-License-Identifier: Apache-2.0

# Regression: Evaluation and Interpretation
In part 1, we saw how powerful regression can be as a tool for prediction. In this colab, we'll take that exploration one step further: what can regression models tell us about the statistical relationships between variables?

In particular, this colab will take a more rigorous statistical approach to regressions. We'll look at how to evaluate and interpret our regression models using statistical methods.

## Learning Objectives:
* Hypothesis testing with regression
* Regression tables
* Pearson Correlation coefficient, $r$
* $R^2$ and adjusted $R^2$
* Interpreting weights and intercepts
* How correlated variables affect models
---
**Need extra help?**

If you're new to Google Colab, take a look at [this getting started tutorial](https://colab.research.google.com/notebooks/intro.ipynb).

To build more familiarity with the Data Commons API, check out these [Data Commons Tutorials](https://docs.datacommons.org/tutorials/).

And for help with Pandas and manipulating data frames, take a look at the [Pandas Documentation](https://pandas.pydata.org/docs/reference/index.html).

We'll be using the scikit-learn library for implementing our models today. Documentation can be found [here](https://scikit-learn.org/stable/modules/classes.html). 

As usual, if you have any other questions, please reach out to your course staff!

# Getting Set Up


Run the following code boxes to load the python libraries and data we'll be using today.

In [None]:
# Setup/Imports
!pip install datacommons --upgrade --quiet
!pip install datacommons_pandas --upgrade --quiet

In [None]:
# Data Commons Python and Pandas APIs
import datacommons
import datacommons_pandas

# For manipulating data
import numpy as np
import pandas as pd

# For implementing models and evaluation methods
from sklearn import linear_model
from sklearn.metrics import r2_score, mean_squared_error
from statsmodels import api as sm


# For plotting/printing
from matplotlib import pyplot as plt
import seaborn as sns

## The Data

In this assignment, we'll be returning to the scenario we started in Part 1. As a refresher, we'll be exploring how obesity rates vary with different health or societal factors across US cities.

Our data science question: **What can we learn about the relationship of those health and lifestyle factors to obesity rates?**

In [None]:
# Load the data we'll be using
city_dcids = datacommons.get_property_values(["CDC500_City"],
                                             "member",
                                             limit=500)["CDC500_City"]

# We've compiled a list of some nice Data Commons Statistical Variables
# to use as features for you
stat_vars_to_query = [
  "Count_Person",
  "Percent_Person_PhysicalInactivity",
  "Percent_Person_SleepLessThan7Hours",
  "Percent_Person_WithHighBloodPressure",
  "Percent_Person_WithMentalHealthNotGood",
  "Percent_Person_WithHighCholesterol",
  "Percent_Person_Obesity"
                      
]

# Query Data Commons for the data and remove any NaN values
raw_features_df = datacommons_pandas.build_multivariate_dataframe(city_dcids,stat_vars_to_query)
raw_features_df.dropna(inplace=True)

# order columns alphabetically
raw_features_df = raw_features_df.reindex(sorted(raw_features_df.columns), axis=1)

# Add city name as a column for readability.
# --- First, we'll get the "name" property of each dcid
# --- Then add the returned dictionary to our data frame as a new column
df = raw_features_df.copy(deep=True)
city_name_dict = datacommons.get_property_values(city_dcids, 'name')
city_name_dict = {key:value[0] for key, value in city_name_dict.items()}
df.insert(0, 'City Name', pd.Series(city_name_dict))

# Display results
display(df)

Unnamed: 0_level_0,City Name,Count_Person,Percent_Person_Obesity,Percent_Person_PhysicalInactivity,Percent_Person_SleepLessThan7Hours,Percent_Person_WithHighBloodPressure,Percent_Person_WithHighCholesterol,Percent_Person_WithMentalHealthNotGood
place,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
geoId/0107000,Birmingham,200733,40.1,33.9,44.0,44.9,34.6,17.4
geoId/0135896,Hoover,92606,28.8,20.2,33.4,32.4,33.2,12.0
geoId/0137000,Huntsville,215006,34.7,28.4,38.7,37.4,33.5,15.4
geoId/0150000,Mobile,187041,38.0,29.9,41.0,42.2,35.2,16.3
geoId/0151000,Montgomery,200603,36.7,30.7,42.6,41.2,34.5,17.0
...,...,...,...,...,...,...,...,...
geoId/5548000,Madison,269840,24.9,17.1,31.2,22.1,28.2,13.0
geoId/5553000,Milwaukee,577222,38.9,28.9,38.8,31.2,29.9,16.6
geoId/5566000,Racine,77816,43.2,28.8,36.7,31.6,33.1,15.1
geoId/5584250,Waukesha,71158,28.6,19.3,29.5,26.3,30.8,12.5


## The Model

Run the following code box to fit an [ordinary least squares](https://en.wikipedia.org/wiki/Ordinary_least_squares) regression model to our data.

In [None]:
# fit a regression model
dep_var = "Percent_Person_Obesity"
y = df[dep_var].to_numpy().reshape(-1, 1)
x = df.loc[:, ~df.columns.isin([dep_var, "City Name"])]
x = sm.add_constant(x)


model = sm.OLS(y, x)
results = model.fit()

# Part 0) Regression Tables

When performing regression analyses, statistical packages will usually provide a _**regression table**_, which summarizes the results of the analysis.

Run the following codebox to display the regression table for our original model. In this colab, we'll go over some of the statistics included in the table.


In [None]:
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.715
Model:                            OLS   Adj. R-squared:                  0.711
Method:                 Least Squares   F-statistic:                     205.7
Date:                Tue, 11 Jan 2022   Prob (F-statistic):          1.29e-130
Time:                        23:46:37   Log-Likelihood:                -1283.7
No. Observations:                 499   AIC:                             2581.
Df Residuals:                     492   BIC:                             2611.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                                             coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------

# Part 1) Hypothesis Testing


## 1.1) Null Hypotheses

When performing statistical analyses, one usually starts with a statement of the null hypothesis. Typically for regression models, these take the form of the coefficient for a variable equaling zero.

**Q1.2)** Write out the null hypotheses for each of our independent variables.

## 1.2) T-test

So how do we test our null hypotheses? We use the [T-test](https://en.wikipedia.org/wiki/Student%27s_t-test#Slope_of_a_regression_line).

Take a look at the regression table above to answer the following questions

**Q1.2A)** According to the t-test, which variables are statistically significant?

**Q1.2B)** For variables that are not statistically significant, should we keep them in our model? Why or why not?

## 1.3) F-test

Beyond testing the significance of our individual variables independently, we can also test the significance of our model overall using the [F-test](https://en.wikipedia.org/wiki/F-test#Regression_problems). In particular, the F-test compares our model to one without predictors (aka, just an intercept). In other words, can our model do statistically better than just predicting the mean?

Again use the regression table above to answer the following questions:

**Q1.3A)** What is the null hypothesis for the F-test?

**Q1.3B)** Can we reject the null hypothesis for our model?

# Part 2) Statistical Measures

## 2.1) Correlation Coefficient $r$

We can quantify predictiveness of variables using a _correlation coefficient_, a number that represents the degree to which two variables have a statistical relationship. The most common correlation coefficient used is the [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient), also known as _Pearson's r_, which measures the strength of linear relationships between variables.

Mathematically, the correlation coefficient is defined as:
$$ r = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_i (x_i - \bar{x})^2}\sqrt{\sum_i (y_i - \bar{y})^2}}
$$

where $x$ and $y$ are the two variables.

Those of you with a statistics background might recognize this as the ratio of covariance to the product of their standard deviations.

**Q2.1A)** Either using the mathematical definition or by exploring with code, explain what the correlation coefficient would be in the following cases:

A) $x = y$

B) $x = -y$

C) $x$ and $y$ are both normally distributed variables with mean 0 and variance 1, randomly sampled independently from each other.

In [None]:
"""
Optional cell for Q2.1A
"""

# Hint: Try writing code to generate values for x and y, then either write or import
# a function to calculate the correlation coefficient

# Your code here

Now run the following code box to use panda's `.corr()`  function to calculate the correlation coefficient between our variables. Note that pandas outputs the results as a matrix.

In [None]:
# calculate correlation
df.corr()

Unnamed: 0,Count_Person,Percent_Person_Obesity,Percent_Person_PhysicalInactivity,Percent_Person_SleepLessThan7Hours,Percent_Person_WithHighBloodPressure,Percent_Person_WithHighCholesterol,Percent_Person_WithMentalHealthNotGood
Count_Person,1.0,-0.020728,0.041395,0.08387,-0.006448,-0.02794,-0.01112
Percent_Person_Obesity,-0.020728,1.0,0.797189,0.651754,0.71957,0.428489,0.719002
Percent_Person_PhysicalInactivity,0.041395,0.797189,1.0,0.76309,0.750962,0.505813,0.740931
Percent_Person_SleepLessThan7Hours,0.08387,0.651754,0.76309,1.0,0.696953,0.380698,0.675784
Percent_Person_WithHighBloodPressure,-0.006448,0.71957,0.750962,0.696953,1.0,0.751325,0.545973
Percent_Person_WithHighCholesterol,-0.02794,0.428489,0.505813,0.380698,0.751325,1.0,0.233295
Percent_Person_WithMentalHealthNotGood,-0.01112,0.719002,0.740931,0.675784,0.545973,0.233295,1.0



**Q2.1B)** Explain why the diagonals of the matrix have the value 1.

**Q2.1C)** What is the correlation coefficient between `Count_Person` and `Percent_Person_Obesity`? What does the correlation coefficient imply about the relationship between population and obesity rate?

**Q2.1D)** What is the correlation coefficient between `Percent_Person_PhysicalInactivity` and `Percent_Person_Obesity`? What does the correlation coefficient imply about the relationship between physical inactivity and obesity rate?

**Q2.1E)** In general, would you prefer to include features that correlate strongly with the dependent variable, or features with no correlation in a regression model?

**Q2.1F)** You find a new feature with correlation coefficient $r=-0.97$ between it and obesity rates. Would it be a good idea to add this new feature to your model?


## 2.2) $R^2$ Score

To quantify how predictive a linear regression model is overall is to use the [coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination), $R^2$ (pronounced "R squared").

Mathematically, the $R^2$ score is defined as:

$$S_{residuals} = \sum_i{(y_i - f_i)^2} \\
S_{total} = \sum_i{(y_i - \bar{y})^2}\\
R^2 = 1 - \frac{S_{residuals}}{S_{total}}$$

where $y_i$s are the actual dependent variable values, $f_i$ are the predicted dependent variable values, and $\bar{y}$ is the average of the $y_i$'s.

Conceptually, the $R^2$ score is a measure of explained variance. If $R^2=0.75$, that means that 75% of the variance in the dependent variable has been accounted for by our model, while 25% of the remaining variability has not.

**Q2.2A)** Based on the mathematic definition, what is the range of values possible for R^2?

**Q2.2B)** Come up with a situation (e.g. what would the data look like) where:

A) $R^2 = 1.0$

B) $R^2 = 0.0$

Let's now analyze what the $R^2$ value is for our model.

In [None]:
# calculate R^2
print("Model R^2 =", results.rsquared)

Model R^2 = 0.7149364375531281


**Q2.2C)** Is the model's $R^2$ a "good" score?

**Q2.2D)** Can you think of any ways we can change our model that would improve the $R^2$ score?

## 2.3) Adjusted $R^2$

There's an issue with $R^2$ scores that one needs to be aware of when working with multiple independent variables. Namely, that the number of independent variables used can affect the $R^2$ score.

Let's see this in practice. Let's create a new dataframe with an extra 100 dummy variables (randomly sampled from a 0-mean 1-variance normal distribution) tacked on.

In [None]:
# Pad our dataframe with more random variables
df_padded = df.copy()
num_rows = len(df.index)
for i in range(100):
  var_name = f"Random Variable {i}"
  random_data = np.random.normal(size=(num_rows, 1))
  df_padded[var_name] = random_data
display(df_padded)


Unnamed: 0_level_0,City Name,Count_Person,Percent_Person_Obesity,Percent_Person_PhysicalInactivity,Percent_Person_SleepLessThan7Hours,Percent_Person_WithHighBloodPressure,Percent_Person_WithHighCholesterol,Percent_Person_WithMentalHealthNotGood,Random Variable 0,Random Variable 1,Random Variable 2,Random Variable 3,Random Variable 4,Random Variable 5,Random Variable 6,Random Variable 7,Random Variable 8,Random Variable 9,Random Variable 10,Random Variable 11,Random Variable 12,Random Variable 13,Random Variable 14,Random Variable 15,Random Variable 16,Random Variable 17,Random Variable 18,Random Variable 19,Random Variable 20,Random Variable 21,Random Variable 22,Random Variable 23,Random Variable 24,Random Variable 25,Random Variable 26,Random Variable 27,Random Variable 28,Random Variable 29,Random Variable 30,Random Variable 31,...,Random Variable 60,Random Variable 61,Random Variable 62,Random Variable 63,Random Variable 64,Random Variable 65,Random Variable 66,Random Variable 67,Random Variable 68,Random Variable 69,Random Variable 70,Random Variable 71,Random Variable 72,Random Variable 73,Random Variable 74,Random Variable 75,Random Variable 76,Random Variable 77,Random Variable 78,Random Variable 79,Random Variable 80,Random Variable 81,Random Variable 82,Random Variable 83,Random Variable 84,Random Variable 85,Random Variable 86,Random Variable 87,Random Variable 88,Random Variable 89,Random Variable 90,Random Variable 91,Random Variable 92,Random Variable 93,Random Variable 94,Random Variable 95,Random Variable 96,Random Variable 97,Random Variable 98,Random Variable 99
place,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
geoId/0107000,Birmingham,200733,40.1,33.9,44.0,44.9,34.6,17.4,-1.775024,1.448878,0.751420,-0.337571,-1.630007,-0.791610,-0.764136,1.001150,-0.497411,1.137086,0.388179,1.490722,0.479426,0.885647,2.139369,0.980127,-0.024894,-0.594549,0.057923,-0.206539,-0.676724,-2.132997,-0.659994,-1.009111,-0.611383,0.865748,-1.102444,0.305733,1.471427,-1.673124,-0.362989,0.163016,...,-1.721955,0.007987,-1.713295,-0.960122,0.449155,1.645943,-1.917358,0.040276,0.664790,-0.296875,0.954265,-1.586074,-1.319715,-1.551457,-0.404962,-0.581372,1.094329,-1.063519,1.249983,0.010912,-0.910707,0.438521,-0.202320,1.334286,-0.084193,0.414590,-0.209882,-1.603098,0.403013,-0.350851,-0.904914,1.062353,-0.370768,-0.365375,0.908413,-0.311293,0.084229,-0.442015,0.031837,-0.767270
geoId/0135896,Hoover,92606,28.8,20.2,33.4,32.4,33.2,12.0,0.694015,1.150745,0.025378,-0.283279,-1.503832,0.690609,-1.694068,0.082391,0.749624,-0.887541,-0.499350,-0.361166,0.946467,1.292919,-0.237458,2.216770,-1.168924,0.115793,0.216861,-1.016504,0.238576,-1.032536,0.744089,-0.619010,-0.562913,0.367186,0.847115,1.787353,0.734604,-0.638914,-0.946534,-0.530487,...,-0.979663,-0.370279,-1.037029,1.415770,1.118652,-0.677876,1.302612,-1.908806,1.756797,1.210258,1.005200,0.789094,-1.573405,2.561985,0.498816,-1.946839,-0.120957,-0.996368,-0.184924,-1.579735,0.392238,0.430946,-0.416738,1.785403,-2.073653,0.066519,1.128013,-0.126520,-0.112078,0.911456,-1.264666,-0.370690,1.162111,-0.785190,0.546058,0.844673,-0.371177,-1.715979,0.975173,-0.480216
geoId/0137000,Huntsville,215006,34.7,28.4,38.7,37.4,33.5,15.4,0.541813,-0.882819,0.958483,-0.261622,-0.318740,-0.296085,1.120216,0.625789,-0.255850,-0.295485,0.135865,0.272344,1.187588,-1.112298,0.068120,0.952634,0.141764,-0.210001,-0.478465,0.546696,-0.439181,0.087348,-1.092643,-0.371697,2.281920,-1.947865,-0.327443,0.040106,0.839419,0.002466,-0.918838,1.645719,...,1.755844,0.440961,-0.859049,-1.748428,0.187462,-0.103037,1.659586,-0.516001,-0.926824,1.735668,1.421723,-0.243201,-0.173084,0.046470,-0.996634,-0.168138,-0.195259,1.266686,-1.125591,0.749213,-0.993767,0.052149,0.014974,-0.610232,1.284871,1.321688,2.192499,-1.349410,0.056769,0.868246,0.590667,0.343471,-0.236944,-0.183300,-0.755913,0.290493,-1.092547,0.499298,0.546659,-1.824621
geoId/0150000,Mobile,187041,38.0,29.9,41.0,42.2,35.2,16.3,-1.411763,0.970479,0.526453,0.589572,1.195750,-1.162790,-0.946892,-0.184070,1.021928,0.232984,1.202840,0.139808,-1.334441,0.770988,0.189486,0.103376,-0.381955,-1.625051,0.199944,0.262585,-0.734874,-0.060794,-0.451282,0.203035,-0.612601,-0.970844,2.443790,1.528910,1.020285,1.013306,-0.827678,-0.444616,...,0.706610,1.619422,0.657103,0.770255,-1.270909,1.961774,-1.133165,0.846602,0.975800,-1.371879,1.105116,0.644582,-0.229678,-1.338127,0.406340,1.147806,0.124070,1.227782,0.159141,0.518456,0.179980,-0.583644,1.061271,1.375698,0.287154,1.024046,-0.792434,0.177761,0.822325,0.637097,-0.177783,-0.571769,0.108268,0.521113,-0.791119,0.998221,0.634474,0.111280,0.514318,0.718273
geoId/0151000,Montgomery,200603,36.7,30.7,42.6,41.2,34.5,17.0,0.274708,-0.180782,0.313885,-1.061191,0.712677,-0.992857,-1.264189,1.681904,-1.009600,2.518111,0.508708,1.014424,1.090256,0.348207,-1.711203,1.352788,0.211015,0.432492,-1.727934,0.254936,-1.506159,0.246940,1.057653,0.068258,-0.035257,-1.248899,-0.363321,-0.824603,0.282785,-0.866543,-0.069885,-0.473962,...,-0.159499,0.120937,-0.127973,0.141808,0.496150,-0.572781,0.237349,-0.167445,0.180610,-0.754127,0.920394,0.358120,-0.693735,0.304582,1.649599,2.005993,0.046118,-0.494053,-0.930855,0.450508,1.993434,0.624035,0.571500,0.082674,-0.678352,-0.615773,0.857199,0.228284,-2.445451,-0.192423,-0.855647,-0.462500,-0.692991,1.084134,-0.359452,-0.147825,0.243505,0.202438,0.317707,-2.382052
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
geoId/5548000,Madison,269840,24.9,17.1,31.2,22.1,28.2,13.0,-0.969009,1.412804,1.001098,-0.754356,1.113830,-0.541527,-0.293943,-1.027716,-0.642341,2.029056,0.857605,-1.037140,-0.793341,1.935037,-0.672720,0.667762,-0.068366,1.338460,1.047006,1.488816,-1.174426,0.837342,-0.743928,1.283631,-0.302360,0.478022,-0.361372,-0.675101,-0.660942,-0.051469,0.626492,1.470935,...,1.652755,-0.497184,-1.935963,1.281359,-0.818733,1.705189,0.800860,0.471286,-0.652267,-1.525656,0.311640,0.762192,0.680809,1.878626,0.652062,0.987393,-1.661159,0.736757,-0.440298,-0.064345,-0.357752,-0.303827,0.576510,0.528060,-1.513916,-1.766208,0.544944,-1.292090,-0.376568,-0.870507,0.759292,-1.398415,-0.107914,0.372533,1.394782,1.581284,1.155661,-0.614553,-2.088723,1.847830
geoId/5553000,Milwaukee,577222,38.9,28.9,38.8,31.2,29.9,16.6,-0.239748,-1.351544,1.373149,0.630010,0.313671,1.059540,0.304985,1.834013,-0.238911,-0.408768,-0.924046,-0.993349,-0.663866,-0.013103,-0.594201,1.996893,-0.895648,0.846737,-0.165234,-1.255032,1.052939,0.103962,0.160159,0.877918,1.521560,1.010804,1.500163,0.528522,1.053596,1.060611,0.196371,0.401724,...,0.159087,0.267534,0.947967,1.948878,-0.819116,1.411872,1.164725,-1.345878,0.434531,-0.701864,-1.116246,-0.247241,-1.014739,0.064561,0.594934,1.303606,-0.913499,-0.618747,1.222729,0.309434,-1.402033,0.947968,-2.250798,0.825114,-0.429653,-1.632259,-0.769433,0.357641,0.789376,-0.219585,-1.032197,-1.242889,-0.410550,-0.482256,-0.189732,0.003207,0.637076,0.939126,0.286405,0.024044
geoId/5566000,Racine,77816,43.2,28.8,36.7,31.6,33.1,15.1,-0.525982,-0.862656,-0.275084,0.252393,-0.057526,0.978272,-0.387141,-0.429862,0.710882,-0.918138,-0.114354,-1.412459,-0.313694,-1.806375,2.158390,-0.080033,-1.028974,-0.932772,1.386584,-0.376287,0.559161,-0.364964,0.062053,0.309001,-1.156470,-1.063613,-0.223070,1.784257,-1.016282,0.193039,-0.248497,0.943596,...,-1.710233,-1.497015,0.923087,-0.745110,-1.832465,-2.133669,0.292899,-0.617494,1.303770,0.707003,0.489196,-0.355318,-0.245810,-0.785993,0.737171,1.374875,0.539105,-0.670215,-0.082543,1.160429,1.370529,-2.556183,0.023618,-1.711679,-0.911774,-0.209436,-1.486000,-0.009568,-0.576010,-0.681486,-1.852981,-1.295352,-0.342304,-0.403603,0.074907,-0.441652,-0.646397,-0.039066,0.408928,-0.570368
geoId/5584250,Waukesha,71158,28.6,19.3,29.5,26.3,30.8,12.5,0.497990,-0.601220,-0.534763,-0.227176,0.220348,0.492816,-0.161077,-1.187640,-0.561108,1.171325,-0.722342,0.236320,-0.056901,-0.896256,2.266473,0.594716,0.525131,-0.297379,0.075712,0.944833,0.323573,-0.777092,-1.348715,1.141381,0.148966,0.726743,1.658158,0.605021,0.543027,0.061568,-0.276480,1.381142,...,-0.268136,-0.424532,-0.532448,-0.958664,-0.035628,0.950755,-0.191116,0.548828,-0.609989,-0.242135,-0.552404,-1.306694,0.205696,0.106275,-1.188721,-1.022715,-1.721808,0.577816,0.175367,-1.724745,0.854490,-0.586973,0.951074,-2.036884,0.989204,0.836019,0.143775,-1.125518,-0.355169,0.089212,-0.032642,0.663702,0.275836,-0.409020,-1.294200,0.602293,-0.485648,-0.123183,0.444515,1.782210


Now let's fit a new model to the data and compare R^2 scores.

In [None]:
# New R^2
y_padded = df_padded[dep_var].to_numpy().reshape(-1, 1)
x_padded = df_padded.loc[:, ~df_padded.columns.isin([dep_var, "City Name"])]
x_padded = sm.add_constant(x_padded)

padded_model = sm.OLS(y_padded, x_padded)
padded_results = padded_model.fit()

print("Original Model R^2 = ", results.rsquared)
print("Padded Model R^2 =", padded_results.rsquared)


Original Model R^2 =  0.7149364375531281
Padded Model R^2 = 0.769737804849763


**Q2.3A)** Which model had a better $R^2$ score?

**Q2.3B)** Think about the variables used in each model. Should one model be much more predictive than another?

**Q2.3B)** In general, how would you expect $R^2$ to change as we increase the number of independent variables?



So how do we fix this? We can adjust our $R^2$ metric to account for the number of variables. The most popular way to defined the _**adjusted $R^2$**_ score is as follows:

$$R^{2}_{adj}=1-(1-R^{2}){n-1 \over n-p-1}$$

where $n$ is the number of data points and $p$ is the number of independent variables.

Now let's compare the adjusted $R^2$ of our models.

In [None]:
# Adjusted R^2
print("Original Model Adjusted R^2 = ", results.rsquared_adj)
print("Padded Model Adjusted R^2 =", padded_results.rsquared_adj)

Original Model Adjusted R^2 =  0.7114600526452394
Padded Model Adjusted R^2 = 0.7074730275897498


**Q2.3D)** Which model had a better adjusted $R^2$ score?

**Q2.3E)** When would you prefer to use adjusted R^2 over R^2 to evaluate model fit?

# Part 3) Interpreting Regression Models


## 3.1) Analyzing Weights and Intercepts
The parameters of the regression model itself can also yield important insights.

Run the following code box to display the weights and intercept of our original model.

In [None]:
# Display weights/coefficients
display(results.params.round(5))

const                                     9.04913
Count_Person                             -0.00000
Percent_Person_PhysicalInactivity         0.41560
Percent_Person_SleepLessThan7Hours       -0.12346
Percent_Person_WithHighBloodPressure      0.47605
Percent_Person_WithHighCholesterol       -0.25152
Percent_Person_WithMentalHealthNotGood    0.70090
dtype: float64

**Q3.1A)** What is the intercept of our model? What are its units?

**Q3.1B)** What are the units on each of the model weights (aka coefficients)?

**Q3.1C)** Which variables matter most to our model?

**Q3.1D)** In words, describe what a weight/coefficient in a linear regression means.

**Q3.1E)** Our model is used to generate a predicted obesity rate for a fictional city named Dataopolis. If we increased `Percent_Person_WithMentalHealthNotGood` for Dataopolis by 1 unit, _while keeping the values for all remaining variables constant_, by how much would we expect our predicted obesity rate to change?

## 3.2) The effect of correlated variables

When interpreting weights, one thing to look out for is if we have independent variables that are highly correlated with each other.

Let's illustrate why this might be a problem, by adding a variables that are correlated with one of the existing variables

In [None]:
# New variable correlated with Percent_Person_WithMentalHealthNotGood
correlated_df = df.copy()
target_var = "Percent_Person_WithMentalHealthNotGood"
noise = np.random.normal(size=(len(correlated_df.index),))
correlated_df["Correlated Variable"] = correlated_df[target_var] + noise

# show new data frame
print("New dataframe to fit:")
display(correlated_df)

# Create a new model
y_corr = correlated_df[dep_var].to_numpy().reshape(-1, 1)
x_corr = correlated_df.loc[:, ~correlated_df.columns.isin([dep_var, "City Name"])]
x_corr = sm.add_constant(x_corr)

correlated_model = sm.OLS(y_corr, x_corr)
correlated_results = correlated_model.fit()

print("Correlated Model Weights and Intercept:")
display(correlated_results.params.round(5))

New dataframe to fit:


Unnamed: 0_level_0,City Name,Count_Person,Percent_Person_Obesity,Percent_Person_PhysicalInactivity,Percent_Person_SleepLessThan7Hours,Percent_Person_WithHighBloodPressure,Percent_Person_WithHighCholesterol,Percent_Person_WithMentalHealthNotGood,Correlated Variable
place,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
geoId/0107000,Birmingham,200733,40.1,33.9,44.0,44.9,34.6,17.4,16.774543
geoId/0135896,Hoover,92606,28.8,20.2,33.4,32.4,33.2,12.0,13.701440
geoId/0137000,Huntsville,215006,34.7,28.4,38.7,37.4,33.5,15.4,15.562815
geoId/0150000,Mobile,187041,38.0,29.9,41.0,42.2,35.2,16.3,17.077843
geoId/0151000,Montgomery,200603,36.7,30.7,42.6,41.2,34.5,17.0,17.946941
...,...,...,...,...,...,...,...,...,...
geoId/5548000,Madison,269840,24.9,17.1,31.2,22.1,28.2,13.0,12.058370
geoId/5553000,Milwaukee,577222,38.9,28.9,38.8,31.2,29.9,16.6,17.109574
geoId/5566000,Racine,77816,43.2,28.8,36.7,31.6,33.1,15.1,14.787989
geoId/5584250,Waukesha,71158,28.6,19.3,29.5,26.3,30.8,12.5,15.173866


Correlated Model Weights and Intercept:


const                                     9.03472
Count_Person                             -0.00000
Percent_Person_PhysicalInactivity         0.41299
Percent_Person_SleepLessThan7Hours       -0.12546
Percent_Person_WithHighBloodPressure      0.47722
Percent_Person_WithHighCholesterol       -0.25037
Percent_Person_WithMentalHealthNotGood    0.81148
Correlated Variable                      -0.10457
dtype: float64

**Q3.2A)** Compare the new weights of the correlated model with the weights of our original model. What happened to the wieghts corresponding to `Percent_Person_WithMentalHealthNotGood`?

**Q3.2B)** Thinking back to your answers for Q3.1C-E, how might correlated variables affect the interpretation of model weights?