<a href="https://colab.research.google.com/github/arutraj/.githubcl/blob/main/02_conditional_means.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import numpy as np
import pandas as pd

rs = np.random.RandomState(seed=666)

# Let's create some fake data with a categorical variable
n = 2170

X = rs.choice(["A", "B"], size=n)

y = 10 - 3 * (X == "A") + rs.normal(size=n)

df = pd.DataFrame(
  {
    "y": y,
    "X": X,
  }
)

df

Unnamed: 0,y,X
0,7.473104,A
1,7.127036,A
2,9.510308,B
3,7.111323,A
4,6.026016,A
...,...,...
2165,7.320367,A
2166,11.326728,B
2167,5.324269,A
2168,5.759181,A


Fit a simple linear model with one categorical covariate to demonstrate that
the fitted coefficients are actually the conditional means.

In [3]:
import statsmodels.formula.api as smf

model = smf.ols(formula="y ~ X", data=df)
fitted_model = model.fit()
fitted_model.summary()

# Note that statsmodels internally parameterizes the categorical variable
# as an indicator of whether the value of X == "B" or not

0,1,2,3
Dep. Variable:,y,R-squared:,0.71
Model:,OLS,Adj. R-squared:,0.71
Method:,Least Squares,F-statistic:,5310.0
Date:,"Sun, 10 Nov 2024",Prob (F-statistic):,0.0
Time:,18:37:34,Log-Likelihood:,-3025.9
No. Observations:,2170,AIC:,6056.0
Df Residuals:,2168,BIC:,6067.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,6.9534,0.030,234.618,0.000,6.895,7.012
X[T.B],3.0541,0.042,72.868,0.000,2.972,3.136

0,1,2,3
Omnibus:,0.116,Durbin-Watson:,2.064
Prob(Omnibus):,0.944,Jarque-Bera (JB):,0.124
Skew:,0.018,Prob(JB):,0.94
Kurtosis:,2.99,Cond. No.,2.62


In [4]:
# Lets look at the fitted parameters
fitted_model.params

Unnamed: 0,0
Intercept,6.953426
X[T.B],3.054149


In [5]:
# Compare these parameter estimates to the _mean_ of y, for each different value of X
df.groupby("X").agg(mean_y=("y", "mean"))

Unnamed: 0_level_0,mean_y
X,Unnamed: 1_level_1
A,6.953426
B,10.007575


In [6]:
# Notice that the mean of y when X[T.B] == 0 (i.e., X == "A") is
fitted_model.params[0]

  fitted_model.params[0]


6.953426117133354

In [7]:
# And when X[T.B] == 1 (i.e., X == "B"), the mean is the sum of the parameters:
fitted_model.params.sum()

10.007575152368565

In other words, the `Intercept` is the conditional mean for when `X` is the default value (`"A"`), and the coefficient for when `X="B"` is the _difference_ in conditional mean, compared to when `X="A"`; so the conditional mean is computed by taking a sum of the parameters.