<a href="https://colab.research.google.com/github/datacommonsorg/api-python/blob/colabs/Regression_Evaluation_and_Interpretation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Copyright 2025 Google LLC.
SPDX-License-Identifier: Apache-2.0

# Regression: Evaluation and Interpretation
In the [previous notebook](https://github.com/datacommonsorg/api-python/blob/master/notebooks/v2/intro_data_science/Regression_Basics_and_Prediction.ipynb), we saw how powerful regression can be as a tool for prediction. In this Colab, we'll take that exploration one step further: what can regression models tell us about the statistical relationships between variables?

In particular, this colab will take a more rigorous statistical approach to regressions. We'll look at how to evaluate and interpret our regression models using statistical methods.

## Learning objectives:
* Hypothesis testing with regression
* Regression tables
* Pearson correlation coefficient, $r$
* $R^2$ and adjusted $R^2$
* Interpreting weights and intercepts
* How correlated variables affect models
---
**Need extra help?**

If you're new to Google Colab, take a look at [this getting started tutorial](https://colab.research.google.com/notebooks/intro.ipynb).

To build more familiarity with the Data Commons API, check out these [Data Commons tutorials](https://docs.datacommons.org/api/python/v2/tutorials.md).

And for help with Pandas and manipulating data frames, take a look at the [Pandas documentation](https://pandas.pydata.org/docs/reference/index.html).

We'll be using the scikit-learn library for implementing our models today. Documentation can be found [here](https://scikit-learn.org/stable/modules/classes.html).

As usual, if you have any other questions, please reach out to your course staff!

## Getting set up


Run the following code boxes to load the Python libraries and data we'll be using today.

In [None]:
# Setup/Imports
!pip install "datacommons-client[Pandas]" --upgrade --quiet

In [None]:
# Data Commons Python and Pandas APIs
from datacommons_client.client import DataCommonsClient
client = DataCommonsClient(api_key="your API key")

# For manipulating data
import numpy as np
import pandas as pd

# For implementing models and evaluation methods
from sklearn import linear_model
from sklearn.metrics import r2_score, mean_squared_error
from statsmodels import api as sm


# For plotting/printing
from matplotlib import pyplot as plt
import seaborn as sns

### The data

In this assignment, we'll be returning to the scenario we started in the previous notebook. As a refresher, we'll be exploring how obesity rates vary with different health or societal factors across US cities.

Our data science question: **What can we learn about the relationship of those health and lifestyle factors to obesity rates?**

In [None]:
# Load the data we'll be using

# Fetch the population of the US cities
city_pop = client.observation.fetch_observations_by_entity_type(
    date="latest",
    parent_entity="country/USA",
    entity_type="City",
    variable_dcids="Count_Person",
    filter_facet_ids="2176550201" # USCensusPEP_Annual_Population
).byVariable["Count_Person"].byEntity
city_pop_dict = {
    city: data["orderedFacets"][0].observations[0].value
    for city, data in city_pop.items()
    }

# Filter to the top 500 cities
cities = [
    item[0]
    for item in sorted(
        city_pop_dict.items(),
        key=lambda item: item[1],
        reverse=True)[:500]
    ]

# We've compiled a list of some nice Data Commons Statistical Variables
# to use as features for you
stat_vars_to_query = [
  "Count_Person",
  "Percent_Person_PhysicalInactivity",
  "Percent_Person_SleepLessThan7Hours",
  "Percent_Person_WithHighBloodPressure",
  "Percent_Person_WithMentalHealthNotGood",
  "Percent_Person_WithHighCholesterol",
  "Percent_Person_Obesity"

]

# Query Data Commons for the data
raw_features_df = client.observations_dataframe(
    variable_dcids=stat_vars_to_query,
    date="latest",
    entity_dcids=cities)

# Filter to highest ranked facet for each entity and variable
df = raw_features_df.copy(deep=True)
df = df.groupby(["entity", "entity_name", "variable"]).first().reset_index()

# Select required columns and pivot by variable
df = df[["entity", "entity_name", "variable", "value"]]
df = df.pivot(index=["entity", "entity_name"], columns="variable", values="value")
df = df.dropna()

# Rename columns and order alphabetically
df = df.reset_index()
df.rename(columns={"entity":"place", "entity_name": "City Name"}, inplace=True)
df.set_index("place", inplace=True)
df = df.reindex(sorted(df.columns), axis=1)

# Display results
display(df)

variable,City Name,Count_Person,Percent_Person_Obesity,Percent_Person_PhysicalInactivity,Percent_Person_SleepLessThan7Hours,Percent_Person_WithHighBloodPressure,Percent_Person_WithHighCholesterol,Percent_Person_WithMentalHealthNotGood
place,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
geoId/0103076,Auburn,82025.0,33.0,23.6,36.0,34.3,30.6,17.8
geoId/0107000,Birmingham,196644.0,44.9,32.9,42.9,45.0,31.6,19.7
geoId/0135896,Hoover,92448.0,32.5,19.7,33.6,32.6,31.0,15.4
geoId/0137000,Huntsville,225564.0,37.5,24.0,40.0,36.5,31.6,18.0
geoId/0150000,Mobile,182595.0,44.2,28.7,43.4,39.8,32.5,19.9
...,...,...,...,...,...,...,...,...
geoId/5531000,Green Bay,105744.0,38.9,26.7,33.1,28.1,30.7,17.9
geoId/5539225,Kenosha,98211.0,43.7,23.8,36.6,29.9,30.0,18.6
geoId/5548000,Madison,280305.0,32.1,18.7,29.9,26.6,28.5,15.6
geoId/5553000,Milwaukee,561385.0,43.4,28.8,40.0,36.7,30.1,19.0


### The model

Run the following code box to fit an [ordinary least squares](https://en.wikipedia.org/wiki/Ordinary_least_squares) regression model to our data.

In [None]:
# Fit a regression model
dep_var = "Percent_Person_Obesity"
y = df[dep_var].to_numpy().reshape(-1, 1)
x = df.loc[:, ~df.columns.isin([dep_var, "City Name"])]
x = sm.add_constant(x)


model = sm.OLS(y, x)
results = model.fit()

## 0) Regression tables

When performing regression analyses, statistical packages will usually provide a _**regression table**_, which summarizes the results of the analysis.

Run the following codebox to display the regression table for our original model. In this Colab, we'll go over some of the statistics included in the table.


In [None]:
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.758
Model:                            OLS   Adj. R-squared:                  0.755
Method:                 Least Squares   F-statistic:                     256.0
Date:                Thu, 22 May 2025   Prob (F-statistic):          1.18e-147
Time:                        16:02:06   Log-Likelihood:                -1275.0
No. Observations:                 498   AIC:                             2564.
Df Residuals:                     491   BIC:                             2593.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                                             coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------

## 1) Hypothesis testing


### 1.1) Null hypotheses

When performing statistical analyses, one usually starts with a statement of the null hypothesis. Typically for regression models, these take the form of the coefficient for a variable equaling zero.

**1.1)** Write out the null hypotheses for each of our independent variables.

### 1.2) T-test

So how do we test our null hypotheses? We use the [T-test](https://en.wikipedia.org/wiki/Student%27s_t-test#Slope_of_a_regression_line).

Take a look at the regression table above to answer the following questions

**Q1.2A)** According to the t-test, which variables are statistically significant?

**Q1.2B)** For variables that are not statistically significant, should we keep them in our model? Why or why not?

### 1.3) F-test

Beyond testing the significance of our individual variables independently, we can also test the significance of our model overall using the [F-test](https://en.wikipedia.org/wiki/F-test#Regression_problems). In particular, the F-test compares our model to one without predictors (aka, just an intercept). In other words, can our model do statistically better than just predicting the mean?

Again use the regression table above to answer the following questions:

**1.3A)** What is the null hypothesis for the F-test?

**1.3B)** Can we reject the null hypothesis for our model?

## 2) Statistical measures

### 2.1) Correlation coefficient $r$

We can quantify predictiveness of variables using a _correlation coefficient_, a number that represents the degree to which two variables have a statistical relationship. The most common correlation coefficient used is the [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient), also known as _Pearson's r_, which measures the strength of linear relationships between variables.

Mathematically, the correlation coefficient is defined as:
$$ r = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_i (x_i - \bar{x})^2}\sqrt{\sum_i (y_i - \bar{y})^2}}
$$

where $x$ and $y$ are the two variables.

Those of you with a statistics background might recognize this as the ratio of covariance to the product of their standard deviations.

**2.1A)** Either using the mathematical definition or by exploring with code, explain what the correlation coefficient would be in the following cases:

A) $x = y$

B) $x = -y$

C) $x$ and $y$ are both normally distributed variables with mean 0 and variance 1, randomly sampled independently from each other.

In [None]:
"""
Optional cell for 2.1A
"""

# Hint: Try writing code to generate values for x and y, then either write or import
# a function to calculate the correlation coefficient

# Your code here

'\nOptional cell for 2.1A\n'

Now run the following code box to use the Pandas `.corr()`  function to calculate the correlation coefficient between our variables. Note that pandas outputs the results as a matrix.

In [None]:
# calculate correlation
df[stat_vars_to_query].corr()

variable,Count_Person,Percent_Person_PhysicalInactivity,Percent_Person_SleepLessThan7Hours,Percent_Person_WithHighBloodPressure,Percent_Person_WithMentalHealthNotGood,Percent_Person_WithHighCholesterol,Percent_Person_Obesity
variable,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Count_Person,1.0,0.059668,0.073807,0.025619,-0.006579,0.0481,-0.032606
Percent_Person_PhysicalInactivity,0.059668,1.0,0.778834,0.744643,0.700776,0.436643,0.753156
Percent_Person_SleepLessThan7Hours,0.073807,0.778834,1.0,0.745474,0.619343,0.369433,0.657111
Percent_Person_WithHighBloodPressure,0.025619,0.744643,0.745474,1.0,0.690294,0.381626,0.825544
Percent_Person_WithMentalHealthNotGood,-0.006579,0.700776,0.619343,0.690294,1.0,0.214004,0.735612
Percent_Person_WithHighCholesterol,0.0481,0.436643,0.369433,0.381626,0.214004,1.0,0.299001
Percent_Person_Obesity,-0.032606,0.753156,0.657111,0.825544,0.735612,0.299001,1.0



**2.1B)** Explain why the diagonals of the matrix have the value 1.

**2.1C)** What is the correlation coefficient between `Count_Person` and `Percent_Person_Obesity`? What does the correlation coefficient imply about the relationship between population and obesity rate?

**2.1D)** What is the correlation coefficient between `Percent_Person_PhysicalInactivity` and `Percent_Person_Obesity`? What does the correlation coefficient imply about the relationship between physical inactivity and obesity rate?

**2.1E)** In general, would you prefer to include features that correlate strongly with the dependent variable, or features with no correlation in a regression model?

**2.1F)** You find a new feature with correlation coefficient $r=-0.97$ between it and obesity rates. Would it be a good idea to add this new feature to your model?


### 2.2) $R^2$ score

To quantify how predictive a linear regression model is overall, we can use the [coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination), $R^2$ (pronounced "R squared").

Mathematically, the $R^2$ score is defined as:

$$S_{residuals} = \sum_i{(y_i - f_i)^2} \\
S_{total} = \sum_i{(y_i - \bar{y})^2}\\
R^2 = 1 - \frac{S_{residuals}}{S_{total}}$$

where $y_i$s are the actual dependent variable values, $f_i$ are the predicted dependent variable values, and $\bar{y}$ is the average of the $y_i$'s.

Conceptually, the $R^2$ score is a measure of explained variance. If $R^2=0.75$, that means that 75% of the variance in the dependent variable has been accounted for by our model, while 25% of the remaining variability has not.

**2.2A)** Based on the mathematic definition, what is the range of values possible for R^2?

**2.2B)** Come up with a situation (e.g. what would the data look like) where:

A) $R^2 = 1.0$

B) $R^2 = 0.0$

Let's now analyze what the $R^2$ value is for our model.

In [None]:
# calculate R^2
print("Model R^2 =", results.rsquared)

Model R^2 = 0.7577718062114178


**2.2C)** Is the model's $R^2$ a "good" score?

**2.2D)** Can you think of any ways we can change our model that would improve the $R^2$ score?

### 2.3) Adjusted $R^2$

There's an issue with $R^2$ scores that one needs to be aware of when working with multiple independent variables: namely, that the number of independent variables used can affect the $R^2$ score.

Let's see this in practice. Let's create a new dataframe with an extra 100 dummy variables (randomly sampled from a 0-mean 1-variance normal distribution) tacked on.

In [None]:
# Pad our dataframe with more random variables
num_rows = len(df.index)
num_new_columns = 100
random_data = np.random.normal(loc=0, scale=1, size=(num_rows, num_new_columns))
new_column_names = [f"Random Variable {i}" for i in range(num_new_columns)]
random_data_df = pd.DataFrame(
    random_data,
    columns=new_column_names,
    index=df.index
)
df_padded = pd.concat([df, random_data_df], axis=1)
display(df_padded)

Unnamed: 0_level_0,City Name,Count_Person,Percent_Person_Obesity,Percent_Person_PhysicalInactivity,Percent_Person_SleepLessThan7Hours,Percent_Person_WithHighBloodPressure,Percent_Person_WithHighCholesterol,Percent_Person_WithMentalHealthNotGood,Random Variable 0,Random Variable 1,...,Random Variable 90,Random Variable 91,Random Variable 92,Random Variable 93,Random Variable 94,Random Variable 95,Random Variable 96,Random Variable 97,Random Variable 98,Random Variable 99
place,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
geoId/0103076,Auburn,82025.0,33.0,23.6,36.0,34.3,30.6,17.8,-1.564312,0.288515,...,0.762521,0.251051,-0.697129,-1.697195,0.399706,-0.557155,0.444760,1.787642,0.340410,1.535658
geoId/0107000,Birmingham,196644.0,44.9,32.9,42.9,45.0,31.6,19.7,-0.580159,0.849181,...,0.215843,1.553184,-1.766115,1.152941,0.712426,0.936660,0.576485,-0.127241,-0.543845,1.536037
geoId/0135896,Hoover,92448.0,32.5,19.7,33.6,32.6,31.0,15.4,-0.322616,-1.748737,...,2.036116,0.993741,-1.786077,-0.264808,-1.922278,-1.227397,-1.723762,0.847944,-0.446194,-0.320127
geoId/0137000,Huntsville,225564.0,37.5,24.0,40.0,36.5,31.6,18.0,0.768514,-0.534476,...,0.950064,0.730344,0.007471,3.514180,0.145648,-1.254448,0.275048,-1.241024,-0.163577,0.376057
geoId/0150000,Mobile,182595.0,44.2,28.7,43.4,39.8,32.5,19.9,0.207217,1.028760,...,-0.775507,1.338210,-0.395432,-0.830337,-0.558512,-0.367606,-1.049303,-3.161325,-0.586668,0.934307
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
geoId/5531000,Green Bay,105744.0,38.9,26.7,33.1,28.1,30.7,17.9,-0.104983,-0.856795,...,-0.945322,-0.219595,-2.113165,0.614379,0.110795,-0.250010,0.926896,-0.526254,-0.359181,-1.424956
geoId/5539225,Kenosha,98211.0,43.7,23.8,36.6,29.9,30.0,18.6,-0.355349,0.348573,...,-0.789575,0.590118,-0.193587,0.502188,0.124404,-0.376209,-0.331331,0.697165,1.029427,-1.143744
geoId/5548000,Madison,280305.0,32.1,18.7,29.9,26.6,28.5,15.6,-0.648119,0.025662,...,-0.151965,0.835380,-1.381286,0.303114,0.540398,-0.359988,0.007904,0.010788,-0.276071,0.979319
geoId/5553000,Milwaukee,561385.0,43.4,28.8,40.0,36.7,30.1,19.0,-0.154089,-0.339432,...,2.255458,1.357828,0.692794,0.924034,0.951688,-0.071096,0.097582,0.952135,-1.019633,-0.778193


Now let's fit a new model to the data and compare R^2 scores.

In [None]:
# New R^2
y_padded = df_padded[dep_var].to_numpy().reshape(-1, 1)
x_padded = df_padded.loc[:, ~df_padded.columns.isin([dep_var, "City Name"])]
x_padded = sm.add_constant(x_padded)

padded_model = sm.OLS(y_padded, x_padded)
padded_results = padded_model.fit()

print("Original Model R^2 = ", results.rsquared)
print("Padded Model R^2 =", padded_results.rsquared)


Original Model R^2 =  0.7577718062114178
Padded Model R^2 = 0.7988444670439291


**2.3A)** Which model had a better $R^2$ score?

**2.3B)** Think about the variables used in each model. Should one model be much more predictive than another?

**2.3B)** In general, how would you expect $R^2$ to change as we increase the number of independent variables?



So how do we fix this? We can adjust our $R^2$ metric to account for the number of variables. The most popular way to defined the _**adjusted $R^2$**_ score is as follows:

$$R^{2}_{adj}=1-(1-R^{2}){n-1 \over n-p-1}$$

where $n$ is the number of data points and $p$ is the number of independent variables.

Now let's compare the adjusted $R^2$ of our models.

In [None]:
# Adjusted R^2
print("Original Model Adjusted R^2 = ", results.rsquared_adj)
print("Padded Model Adjusted R^2 =", padded_results.rsquared_adj)

Original Model Adjusted R^2 =  0.7548117875500502
Padded Model Adjusted R^2 = 0.7443112535059662


**2.3D)** Which model had a better adjusted $R^2$ score?

**2.3E)** When would you prefer to use adjusted R^2 over R^2 to evaluate model fit?

## 3) Interpreting regression models


### 3.1) Analyzing weights and intercepts
The parameters of the regression model itself can also yield important insights.

Run the following code box to display the weights and intercept of our original model.

In [None]:
# Display weights/coefficients
display(results.params.round(5))

Unnamed: 0,0
const,-0.19367
Count_Person,-0.0
Percent_Person_PhysicalInactivity,0.30528
Percent_Person_SleepLessThan7Hours,-0.12455
Percent_Person_WithHighBloodPressure,0.75717
Percent_Person_WithHighCholesterol,-0.1352
Percent_Person_WithMentalHealthNotGood,0.69012


**3.1A)** What is the intercept of our model? What are its units?

**3.1B)** What are the units on each of the model weights (aka coefficients)?

**3.1C)** Which variables matter most to our model?

**3.1D)** In words, describe what a weight/coefficient in a linear regression means.

**3.1E)** Our model is used to generate a predicted obesity rate for a fictional city named Dataopolis. If we increased `Percent_Person_WithMentalHealthNotGood` for Dataopolis by 1 unit, _while keeping the values for all remaining variables constant_, by how much would we expect our predicted obesity rate to change?

### 3.2) The effect of correlated variables

When interpreting weights, one thing to look out for is if we have independent variables that are highly correlated with each other.

Let's illustrate why this might be a problem, by adding a variable that is correlated with one of the existing variables

In [None]:
# New variable correlated with Percent_Person_WithMentalHealthNotGood
correlated_df = df.copy()
target_var = "Percent_Person_WithMentalHealthNotGood"
noise = np.random.normal(size=(len(correlated_df.index),))
correlated_df["Correlated Variable"] = correlated_df[target_var] + noise

# show new data frame
print("New dataframe to fit:")
display(correlated_df)

# Create a new model
y_corr = correlated_df[dep_var].to_numpy().reshape(-1, 1)
x_corr = correlated_df.loc[:, ~correlated_df.columns.isin([dep_var, "City Name"])]
x_corr = sm.add_constant(x_corr)

correlated_model = sm.OLS(y_corr, x_corr)
correlated_results = correlated_model.fit()

print("Correlated Model Weights and Intercept:")
display(correlated_results.params.round(5))

New dataframe to fit:


variable,City Name,Count_Person,Percent_Person_Obesity,Percent_Person_PhysicalInactivity,Percent_Person_SleepLessThan7Hours,Percent_Person_WithHighBloodPressure,Percent_Person_WithHighCholesterol,Percent_Person_WithMentalHealthNotGood,Correlated Variable
place,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
geoId/0103076,Auburn,82025.0,33.0,23.6,36.0,34.3,30.6,17.8,18.761300
geoId/0107000,Birmingham,196644.0,44.9,32.9,42.9,45.0,31.6,19.7,17.655787
geoId/0135896,Hoover,92448.0,32.5,19.7,33.6,32.6,31.0,15.4,14.736255
geoId/0137000,Huntsville,225564.0,37.5,24.0,40.0,36.5,31.6,18.0,16.549451
geoId/0150000,Mobile,182595.0,44.2,28.7,43.4,39.8,32.5,19.9,20.277958
...,...,...,...,...,...,...,...,...,...
geoId/5531000,Green Bay,105744.0,38.9,26.7,33.1,28.1,30.7,17.9,18.645080
geoId/5539225,Kenosha,98211.0,43.7,23.8,36.6,29.9,30.0,18.6,17.067335
geoId/5548000,Madison,280305.0,32.1,18.7,29.9,26.6,28.5,15.6,15.665917
geoId/5553000,Milwaukee,561385.0,43.4,28.8,40.0,36.7,30.1,19.0,19.073143


Correlated Model Weights and Intercept:


Unnamed: 0,0
const,-0.28192
Count_Person,-0.0
Percent_Person_PhysicalInactivity,0.30604
Percent_Person_SleepLessThan7Hours,-0.12529
Percent_Person_WithHighBloodPressure,0.75756
Percent_Person_WithHighCholesterol,-0.13345
Percent_Person_WithMentalHealthNotGood,0.55372
Correlated Variable,0.13921


**3.2A)** Compare the new weights of the correlated model with the weights of our original model. What happened to the weights corresponding to `Percent_Person_WithMentalHealthNotGood`?

**3.2B)** Thinking back to your answers for Q3.1C-E, how might correlated variables affect the interpretation of model weights?