# 检测

## 整体检测

`summary`的时候会自行帮我们判断是否有可能多重共线性，见下方代码输出的`notes 2`

更多请查看[官网](https://www.statsmodels.org/dev/examples/notebooks/generated/ols.html#Multicollinearity),里面有更详细的方法

In [3]:
import statsmodels.api as sm
# The Longley dataset is well known to have high multicollinearity. That is, the exogenous predictors are highly correlated. This is problematic because it can affect the stability of our coefficient estimates as we make minor changes to model specification.
from statsmodels.datasets.longley import load_pandas

y = load_pandas().endog
X = load_pandas().exog
X = sm.add_constant(X)

ols_model = sm.OLS(y, X)
ols_results = ols_model.fit()
print(ols_results.summary())

                            OLS Regression Results                            
Dep. Variable:                 TOTEMP   R-squared:                       0.995
Model:                            OLS   Adj. R-squared:                  0.992
Method:                 Least Squares   F-statistic:                     330.3
Date:                Sun, 13 Feb 2022   Prob (F-statistic):           4.98e-10
Time:                        12:56:02   Log-Likelihood:                -109.62
No. Observations:                  16   AIC:                             233.2
Df Residuals:                       9   BIC:                             238.6
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -3.482e+06    8.9e+05     -3.911      0.0



## `vif`每个检测

如果$vif>10$,那么认为存在`严重多重共线性`

下面两个示例都使用了`variance_inflation_factor`,在具体使用时先用`R型公式`得到X,再使用
### 示例1

In [15]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

df = pd.DataFrame(
    {'a': [1, 1, 2, 3, 4],
     'b': [2, 2, 3, 2, 1],
     'c': [4, 6, 7, 8, 9],
     'd': [4, 3, 4, 5, 4]}
)

X = add_constant(df)
pd.Series([variance_inflation_factor(X.values, i) 
               for i in range(X.shape[1])], 
              index=X.columns)


const    136.875
a         22.950
b          3.000
c         12.950
d          3.000
dtype: float64

### 示例二

自己做修改下源码(源码就几行),更方便对pandas数据操作  -- [stackoverflow](https://stackoverflow.com/questions/42658379/variance-inflation-factor-in-python)

```python
def variance_inflation_factors(exog_df):
    '''
    Parameters
    ----------
    exog_df : dataframe, (nobs, k_vars)
        design matrix with all explanatory variables, as for example used in
        regression.

    Returns
    -------
    vif : Series
        variance inflation factors
    '''
    exog_df = add_constant(exog_df)
    vifs = pd.Series(
        [1 / (1. - OLS(exog_df[col].values, 
                       exog_df.loc[:, exog_df.columns != col].values).fit().rsquared) 
         for col in exog_df],
        index=exog_df.columns,
        name='VIF'
    )
    return vifs
```

调用方法:

```python
>>> variance_inflation_factors(df)
const    136.875
a         22.950
b          3.000
c         12.950
Name: VIF, dtype: float64
```

# 解决

待续

## Forward Selection 向前逐步回归

## Backward Elimination 后向逐步回归