# Predictive Modeling

## Predicting $\hat{y}$

Most of the machine learning models we will cover focus on prediction. In this world, we are not interested in marginal effects like the increase in wages attributable to schooling, standard errors, or even interpretability. Instead, we focus on some measure of predictive accuracy. A black box model is fine and a simpler model might only be preferred with parsimony as a tiebreaker.

Let's focus on regression problems where $y$ is a continuous scalar value. For concreteness, let's say we are predicting midterm election vote share like in {cite}`tufte1975determinants`. The $y$ variable is $\frac{\text{Votes for Incumbent's Party in House Races}}{\text{Total Votes in House Races}}$ and suppose we have a measure of economic growth as the single predictor variable $x$. We can use simple linear regression for this and 



# Tufte

In [77]:
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

In [79]:
url = 'https://raw.githubusercontent.com/alexanderthclark/pols4728/refs/heads/main/data/Tufte_Midterms_Extended.csv'
df1 = pd.read_csv(url)

df1 

Unnamed: 0,year,pres_party,vote_share_incumbent,normal_vote_prev8,vote_loss,pres_approval,delta_rdi,in_original,rgdp_growth_yoy,cpi_inflation_pct
0,1938,D,50.82,57.18,-6.36,57.0,-82.0,True,-3.3,-2.8
1,1946,D,45.27,52.57,-7.3,32.0,-36.0,True,-11.6,8.3
2,1950,D,50.04,52.04,-2.0,43.0,99.0,True,8.7,1.3
3,1954,R,47.46,49.79,-2.33,65.0,-12.0,True,-0.6,0.7
4,1958,R,43.9,49.83,-5.93,56.0,-13.0,True,-0.7,2.8
5,1962,D,52.42,51.63,0.79,67.0,60.0,True,6.1,1.0
6,1966,D,51.33,53.06,-1.73,48.0,96.0,True,6.6,2.9
7,1970,R,45.68,46.66,-0.98,56.0,69.0,True,0.2,5.7
8,1974,R,41.5,46.26,-4.77,,,False,-0.5,11.0
9,1978,D,54.43,54.34,0.09,,,False,5.5,7.6


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

# ------------------------------------------------------------------
# Data from Tufte (1975) Table 1   (1942 excluded)
# ------------------------------------------------------------------
d = {
    "year":                  [1938, 1946, 1950, 1954, 1958, 1962, 1966, 1970],
    "vote_share":            [50.82, 45.27, 50.04, 47.46, 43.90, 52.42, 51.33, 45.68],   # V_i
    "normal_vote":           [57.18, 52.57, 52.04, 49.79, 49.83, 51.63, 53.06, 46.66],   # N_i*
    "approval_pct":          [57,    32,    43,    65,    56,    67,    48,    56],      # Gallup
    "delta_rdi_dollars":     [-82,   -36,   99,    -12,   -13,   60,    96,    69]       # $ change
}

df = pd.DataFrame(d)
df["std_loss"]         = df["vote_share"] - df["normal_vote"]     # Y_i

# ------------------------------------------------------------------
# OLS replication of Tufte Table 2
# ------------------------------------------------------------------
X = sm.add_constant(df[["approval_pct", "delta_rdi_dollars"]])
y = df["std_loss"]
tufte = sm.OLS(y, X).fit()
print(tufte.summary())

In [15]:
df

Unnamed: 0,year,vote_share,normal_vote,approval_pct,delta_rdi_dollars,std_loss
0,1938,50.82,57.18,57,-82,-6.36
1,1946,45.27,52.57,32,-36,-7.3
2,1950,50.04,52.04,43,99,-2.0
3,1954,47.46,49.79,65,-12,-2.33
4,1958,43.9,49.83,56,-13,-5.93
5,1962,52.42,51.63,67,60,0.79
6,1966,51.33,53.06,48,96,-1.73
7,1970,45.68,46.66,56,69,-0.98


In [17]:
df1

Unnamed: 0,year,pres_party,vote_share_incumbent,normal_vote_prev8,vote_loss,pres_approval,delta_rdi,in_original
0,1938,D,50.82,57.18,-6.36,57.0,-82.0,True
1,1946,D,45.27,52.57,-7.3,32.0,-36.0,True
2,1950,D,50.04,52.04,-2.0,43.0,99.0,True
3,1954,R,47.46,49.79,-2.33,65.0,-12.0,True
4,1958,R,43.9,49.83,-5.93,56.0,-13.0,True
5,1962,D,52.42,51.63,0.79,67.0,60.0,True
6,1966,D,51.33,53.06,-1.73,48.0,96.0,True
7,1970,R,45.68,46.66,-0.98,56.0,69.0,True
8,1974,R,41.5,46.26,-4.77,,,False
9,1978,D,54.43,54.34,0.09,,,False


In [16]:
import pandas as pd
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

url = 'https://raw.githubusercontent.com/alexanderthclark/pols4728/refs/heads/main/data/Tufte_Midterms_Extended.csv'
df1 = pd.read_csv(url)

In [7]:
res = smf.ols(formula='vote_share_incumbent ~ delta_rdi', data=df.head(8)).fit()
res.summary()



0,1,2,3
Dep. Variable:,vote_share_incumbent,R-squared:,0.105
Model:,OLS,Adj. R-squared:,-0.044
Method:,Least Squares,F-statistic:,0.7037
Date:,"Thu, 14 Aug 2025",Prob (F-statistic):,0.434
Time:,13:28:53,Log-Likelihood:,-19.675
No. Observations:,8,AIC:,43.35
Df Residuals:,6,BIC:,43.51
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,48.0161,1.228,39.098,0.000,45.011,51.021
delta_rdi,0.0154,0.018,0.839,0.434,-0.030,0.060

0,1,2,3
Omnibus:,1.365,Durbin-Watson:,2.255
Prob(Omnibus):,0.505,Jarque-Bera (JB):,0.651
Skew:,0.032,Prob(JB):,0.722
Kurtosis:,1.604,Cond. No.,71.0


In [14]:
# Tufte (1975) replication – standardized vote loss model
import pandas as pd, statsmodels.api as sm

# ------------------------------------------------------------------
# Data from Tufte (1975) Table 1   (1942 excluded)
# ------------------------------------------------------------------
d = {
    "year":                  [1938, 1946, 1950, 1954, 1958, 1962, 1966, 1970],
    "vote_share":            [50.82, 45.27, 50.04, 47.46, 43.90, 52.42, 51.33, 45.68],   # V_i
    "normal_vote":           [57.18, 52.57, 52.04, 49.79, 49.83, 51.63, 53.06, 46.66],   # N_i*
    "approval_pct":          [57,    32,    43,    65,    56,    67,    48,    56],      # Gallup
    "delta_rdi_dollars":     [-82,   -36,   99,    -12,   -13,   60,    96,    69]       # $ change
}

df = pd.DataFrame(d)
df["std_loss"]         = df["vote_share"] - df["normal_vote"]     # Y_i

# ------------------------------------------------------------------
# OLS replication of Tufte Table 2
# ------------------------------------------------------------------
X = sm.add_constant(df[["approval_pct", "delta_rdi_dollars"]])
y = df["std_loss"]
tufte = sm.OLS(y, X).fit()
print(tufte.summary())         # R² ≈ .91, coeffs ≈ .133 and .035

                            OLS Regression Results                            
Dep. Variable:               std_loss   R-squared:                       0.912
Model:                            OLS   Adj. R-squared:                  0.876
Method:                 Least Squares   F-statistic:                     25.84
Date:                Thu, 14 Aug 2025   Prob (F-statistic):            0.00231
Time:                        13:33:15   Log-Likelihood:                -9.6628
No. Observations:                   8   AIC:                             25.33
Df Residuals:                       5   BIC:                             25.56
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const               -11.0830      1.81



# James-Stein

In [20]:
import numpy as np

In [76]:
import numpy as np
rng = np.random.default_rng(0)

p = 3
theta = np.array([50, -1.0, 10])   # any fixed 3-vector
theta = np.array([0,0,0])   # any fixed 3-vector

R = 1_000_000                             # Monte Carlo replications

# One-shot normal-means: Y ~ N(theta, I_p)
Y  = theta + rng.normal(size=(R, p))
r2 = np.sum(Y**2, axis=1)

# James–Stein shrinkage factors (σ² = 1)
a  = 1 - (p - 2) / r2                   # untruncated JS
ap = np.maximum(0.0, a)                 # positive-part JS

# Squared-error losses
mle_se = np.sum((Y - theta)**2, axis=1)
mle_se2 = np.linalg.norm(Y-theta, axis=1)**2
js_se2 = np.linalg.norm((a[:, None]  * Y) -theta, axis=1)**2

js_se  = np.sum(((a[:, None]  * Y) - theta)**2, axis=1)
jsp_se = np.sum(((ap[:, None] * Y) - theta)**2, axis=1)

print("Normal-means (p=3, σ²=1)")
print(f"MLE risk ≈ {mle_se.mean():.6f}")
print(f"MLE risk ≈ {mle_se2.mean():.6f}")
print(f"JS  risk ≈ {js_se2.mean():.6f}")

print(f"JS  risk ≈ {js_se.mean():.6f}")
print(f"JS+ risk ≈ {jsp_se.mean():.6f}")


Normal-means (p=3, σ²=1)
MLE risk ≈ 2.998622
MLE risk ≈ 2.998622
JS  risk ≈ 1.982706
JS  risk ≈ 1.982706
JS+ risk ≈ 1.601375
