
#### What is `statsmodels`?

`statsmodels` is a library focused on **traditional (frequentist)** statistical modeling. It's great for:
- Fitting statistical models (like regression)
- Hypothesis testing
- Data analysis and visualization (from a statistics perspective)

#### Models You Can Use in `statsmodels`

Here are some common types of models provided:

1. **Linear Models**  
   - Standard OLS (Ordinary Least Squares) regression
   - Robust linear models (resistant to outliers)
   - Generalized Linear Models (GLMs) like logistic regression

2. **Linear Mixed Effects Models**  
   - Useful when data has both fixed and random effects (e.g., repeated measures)

3. **ANOVA (Analysis of Variance)**  
   - Used to compare means across multiple groups

4. **Time Series Models**  
   - ARIMA, SARIMA, State space models, etc.

5. **Generalized Method of Moments (GMM)**  
   - A flexible estimation technique used often in econometrics

#### Key Tools in `statsmodels`

You'll often use:

- **Pandas DataFrames**: to provide structured tabular data
- **Patsy formulas**: string-based syntax like in R (`y ~ x1 + x2`) to define model equations


## [ Estimating Linear Models ]
- There are several kinds of linear regression models in statsmodels, from the more basic (e.g., ordinary least squares) to more complex (e.g., iteratively reweighted least squares).

- Linear models in statsmodels have two different main interfaces: array based and formula based. These are accessed through these API module imports:

In [3]:
import statsmodels.api as sm 
import statsmodels.formula.api as smf 
import numpy as np 
import pandas as pd 

In [5]:
# to show how to use these, we generate a linear model from some random data.

# to make the example reproducible
rng = np.random.default_rng(seed=12345)

def dnorm(mean, variance, size=1):
    if isinstance(size, int):
        size = size,
    return mean + np.sqrt(variance) * rng.standard_normal(*size)

N = 100
X = np.c_[dnorm(0, 0.4, size=N),
          dnorm(0, 0.6, size=N),
          dnorm(0, 0.2, size=N)]
eps = dnorm(0, 0.1, size=N)
beta = [0.1, 0.3, 0.5]

y = np.dot(X, beta) + eps

# Here, I wrote down the “true” model with known parameters beta. In this case, dnorm is a helper function for generating normally distributed data with a particular mean and variance. So now we have:
X[:5]

array([[-0.90050602, -0.18942958, -1.0278702 ],
       [ 0.79925205, -1.54598388, -0.32739708],
       [-0.55065483, -0.12025429,  0.32935899],
       [-0.16391555,  0.82403985,  0.20827485],
       [-0.04765129, -0.21314698, -0.04824364]])

In [6]:
y[:5]

array([-0.59952668, -0.58845445,  0.18563386, -0.00747657, -0.01537445])