In [1]:
# nbviz (course display helper) + common imports
try:
    from hickernellclasslib.nbviz import nbviz
except Exception:
    try:
        from nbviz import nbviz
    except Exception as e:
        nbviz = None
        print("nbviz not available in this environment:", e)

if nbviz is not None:
    nbviz()

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import statsmodels.api as sm
import statsmodels.formula.api as smf


nbviz not available in this environment: No module named 'nbviz'


\
# Health and Wealth: Life Expectancy vs GDP per Capita (Gapminder)

**Question.** Across countries, do richer countries tend to have higher life expectancy?

We'll use the Gapminder country dataset and focus on one year (2007) to keep this a clean **cross-sectional** linear regression example.

We will fit the model
$$
\text{lifeExp}_i = \beta_0 + \beta_1 \log(\text{gdpPercap}_i) + \varepsilon_i.
$$

Why $\log(\text{gdpPercap})$? GDP per capita is very right-skewed; taking logs often makes the relationship closer to linear and easier to interpret.


## Load data (Gapminder)

In [None]:
# We'll load Gapminder from plotly's built-in dataset.
# If plotly isn't installed, you'll get an import error; in that case install it or load a CSV copy.
import plotly.express as px

gap = px.data.gapminder()
gap.head()


## Filter to year 2007

In [None]:
df = gap.query("year == 2007").copy()

# Create log GDP per capita
df["log_gdpPercap"] = np.log(df["gdpPercap"])

df[["country","continent","year","lifeExp","gdpPercap","log_gdpPercap","pop"]].head()


## Plot: life expectancy vs log(GDP per capita)

In [None]:
plt.figure(figsize=(7,5))
plt.scatter(df["log_gdpPercap"], df["lifeExp"])
plt.xlabel("log(GDP per capita)")
plt.ylabel("Life expectancy (years)")
plt.title("Gapminder 2007: Health and Wealth")
plt.show()


## Fit the linear regression model

In [None]:
model = smf.ols("lifeExp ~ log_gdpPercap", data=df).fit()
model.summary()


\
### Interpreting the slope

Because the predictor is $\log(\text{gdpPercap})$, the slope $\beta_1$ is the expected change in life expectancy for a **multiplicative** change in GDP per capita.

Two handy interpretations:

- A **doubling** of GDP per capita changes $\log(\text{gdpPercap})$ by $\log(2)$, so the expected change in life expectancy is $\beta_1 \log(2)$ years.
- A **10% increase** changes $\log(\text{gdpPercap})$ by $\log(1.10)$, so the expected change is $\beta_1 \log(1.10)$ years.


In [None]:
beta1 = model.params["log_gdpPercap"]

effect_double = beta1 * np.log(2)
effect_10pct  = beta1 * np.log(1.10)

effect_double, effect_10pct


## Plot with fitted regression line

In [None]:
x = df["log_gdpPercap"]
y = df["lifeExp"]

x_grid = np.linspace(x.min(), x.max(), 200)
y_hat  = model.params["Intercept"] + model.params["log_gdpPercap"] * x_grid

plt.figure(figsize=(7,5))
plt.scatter(x, y)
plt.plot(x_grid, y_hat)
plt.xlabel("log(GDP per capita)")
plt.ylabel("Life expectancy (years)")
plt.title("Gapminder 2007: OLS fit")
plt.show()


## Diagnostics: residuals vs fitted values

In [None]:
fitted = model.fittedvalues
resid  = model.resid

plt.figure(figsize=(7,5))
plt.scatter(fitted, resid)
plt.axhline(0)
plt.xlabel("Fitted values")
plt.ylabel("Residuals")
plt.title("Residuals vs Fitted")
plt.show()


## Diagnostics: normal Q–Q plot (optional)

In [None]:
sm.qqplot(resid, line="45", fit=True)
plt.title("Normal Q–Q plot of residuals")
plt.show()


## Which countries are above/below the fitted line?

In [None]:
df2 = df.copy()
df2["fitted"] = fitted
df2["resid"] = resid

df2.sort_values("resid").head(10)[["country","continent","lifeExp","gdpPercap","resid"]]


In [None]:
df2.sort_values("resid", ascending=False).head(10)[["country","continent","lifeExp","gdpPercap","resid"]]


\
## Extension: add continent indicators (optional)

This model asks: after accounting for GDP per capita, are there systematic differences in life expectancy across continents?

$$
\text{lifeExp} = \beta_0 + \beta_1 \log(\text{gdpPercap}) + \gamma_{\text{continent}} + \varepsilon.
$$


In [None]:
model_cont = smf.ols("lifeExp ~ log_gdpPercap + C(continent)", data=df).fit()
model_cont.summary()


\
## Discussion / homework questions

1. What does the slope in the simple model mean in words?
2. Use the fitted model to estimate the expected difference in life expectancy between two countries whose GDP per capita differs by a factor of 2.
3. Look at the countries with the largest positive/negative residuals. Pick one and offer a plausible explanation (policy, geography, conflict, inequality, etc.).
4. In the continent-extended model, how do the continent coefficients change the story?
5. Why does this analysis not establish that higher GDP **causes** longer life expectancy?
