# Analyzing variable importance using logistic regression
### Christian Igel, 2021

This notebook demonstrates how to analyze variable importance using logistic regression. It reimplements the example in R from https://stats.idre.ucla.edu/r/dae/logit-regression/ in Python. It is reassuring to reproduce the R results. 

It is important that we do not use the logistic regression from scikit-learn but from [statmodels](https://www.statsmodels.org/). First, we will do the data preprocessing using [pandas](https://pandas.pydata.org/) to organize the data, then we will make use of [patsy](https://patsy.readthedocs.io).

Any suggestions for improvements are more than welcome.

In [4]:
import pandas as pd
import statsmodels.api as sm

The task is to predict admission into graduate school based on  GRE (Graduate Record Exam scores), GPA (grade point average) and prestige of the undergraduate institution. Let's load the data:

In [5]:
df = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")

Let's inspect the data:

In [6]:
df.head(5)

Unnamed: 0,admit,gre,gpa,rank
0,0,380,3.61,3
1,1,660,3.67,3
2,1,800,4.0,1
3,1,640,3.19,4
4,0,520,2.93,4


In [7]:
df.describe()

Unnamed: 0,admit,gre,gpa,rank
count,400.0,400.0,400.0,400.0
mean,0.3175,587.7,3.3899,2.485
std,0.466087,115.516536,0.380567,0.94446
min,0.0,220.0,2.26,1.0
25%,0.0,520.0,3.13,2.0
50%,0.0,580.0,3.395,2.0
75%,1.0,660.0,3.67,3.0
max,1.0,800.0,4.0,4.0


#### First version
In the first version, we do the main data preprocessing steps ourselves.

The goal is to predict/explain the binary variable `admit` given the other variables. Thus, we split data into input (predictor variables) and target (response/dependent variable):

In [None]:
X = df.iloc[:,1:4]
y = df.iloc[:,0]

We treat the `rank`, which indicates the prestige of the undergraduate institution, as a categorical variable:

In [None]:
X["rank"] = X["rank"].astype('category')

In [None]:
X

In [None]:
display(y)

Now we transform the categorical variable. Note that in contrast to a one-hot-encoding one column is dropped to avoid the linear dependency. 

In [None]:
X_transformed = pd.get_dummies(X, prefix=['rank'], drop_first=True)
X_transformed

The logistic regression model we are using does not have a built in intercept (bias/offset) parameter. Thus, we augment our input data with a constant dummy variable. 

In [None]:
X_transformed = sm.add_constant(X_transformed)
X_transformed

Now we can compute and inspect the logistic regression model: 

In [None]:
logit_model=sm.Logit(y, X_transformed)
result=logit_model.fit()
print(result.summary2())

#### Second version

In the second version, we use a library function for creating design matrices:

In [None]:
from patsy import dmatrices

We can use a formula syntax to specify the design matrix as in R. When a formula is used to specify the terms to include in the design martrix, a constant for the intercept term will be included by default. The function `C` handles the encoding of the categorical varibale for us:

In [None]:
y, X = dmatrices('admit ~ gre + gpa + C(rank)', df, return_type = 'dataframe')
X.head()

In [None]:
logit = sm.Logit(y, X)
result = logit.fit()
print(result.summary2())

#### Interpreting the resuls

The results table  shows the coefficients, their standard errors, the z-statistic (sometimes called a Wald z-statistic, in this case the z-value is computed as the coefficient over its standard error, a concept we have not discussed in the lecture), the associated p-values (P), as well as the confidence intervals for the coefficient estimates. 

Assuming a 5% significance level, one can argue that all predictor variables are statistically significant, because their p-values are all smaller than 0.05. (The null hypothesis is that the coefficient is zero. If the p-value is smaller than our significnace level, we reject the null hypothesis.) 

"The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable. For every one unit change in `gre`, the log odds of admission (versus non-admission) increases by 0.002. For a one unit increase in `gpa`, the log odds of being admitted to graduate school increases by 0.804. The indicator variables for `rank` have a slightly different interpretation. For example, having attended an undergraduate institution with `rank` of 2, versus an institution with a `rank` of 1, changes the log odds of admission by -0.675" [(quoted from here)](https://stats.idre.ucla.edu/r/dae/logit-regression/).

The accuracy of the model on the training set, which is *not* an ubiased estimate of the generalizaition performance, can be computed as follows:

In [None]:
from sklearn.metrics import accuracy_score
y_pred = result.predict(X)  # Predict using the result object
print('Accuracy on training set:', 
      accuracy_score(y, [1 if m >= 0.5 else 0 for m in y_pred]))

#### Additional comments
In the following, it is demonstrated that linear rescaling does not change the significance of the the predictor variables.

In [None]:
from sklearn.preprocessing import StandardScaler
cols_to_norm = ['gre','gpa']
X[cols_to_norm] = StandardScaler().fit_transform(X[cols_to_norm])

In [None]:
X.head()

In [None]:
logit = sm.Logit(y, X)
result = logit.fit()
print(result.summary2())

Note that the p-values of the predictor variables did not change, as it should be. 
(However, we can not reject the null hypothesis that the intercept parameter is zero, zero is also in the confidence interval of the intercept estimate.) 