In [8]:
# Execute before using this notebook if using google colab
kernel = str(get_ipython())
if 'google.colab' in kernel:    
    !wget https://raw.githubusercontent.com/fredzett/rmqa/master/utils.py -P local_modules -nc 
    !npx degit fredzett/rmqa/data data
    import sys
    sys.path.append('local_modules')

In [9]:
import pandas as pd
import numpy as np
# import necessary modules

# Excercises

Considering the below dataset regrarding US colleges. Calculate a logistic regression model explaining what are factors that differentiate private from public colleges taking into account the following variables:

- private (0 = yes, 1 = no)
- AdmissionRate: percentage of students accepted divided by students that applied
- Top10perc: % of new students from top 10% of their high shool class
- Expend: Instructional expenditure per student
- S_F_Ratio: student to faculty ratio

Specifically answer the following questions:

1. calculate the model and a model summary showing coefficients, goodness of fit etc.

2. Explain how good the overall model is

3. Explain which coefficients are significant

4. Explain how coefficients are an indicator for private / public schools 

In [23]:
df = (pd.read_csv("./data/college.csv")
     .drop(columns=["Unnamed: 0"], axis=1)
     .assign(Private=lambda x: np.where(x=="Yes", 1, 0),
            AdmissionRate=lambda x: x["Accept"]/x["Apps"])
     )
df.columns = df.columns.str.replace(".","_")

In [24]:
df.head()

Unnamed: 0,Private,Apps,Accept,Enroll,Top10perc,Top25perc,F_Undergrad,P_Undergrad,Outstate,Room_Board,Books,Personal,PhD,Terminal,S_F_Ratio,perc_alumni,Expend,Grad_Rate,AdmissionRate
0,1,1660,1232,721,23,52,2885,537,7440,3300,450,2200,70,78,18.1,12,7041,60,0.742169
1,1,2186,1924,512,16,29,2683,1227,12280,6450,750,1500,29,30,12.2,16,10527,56,0.880146
2,1,1428,1097,336,22,50,1036,99,11250,3750,400,1165,53,66,12.9,30,8735,54,0.768207
3,1,417,349,137,60,89,510,63,12960,5450,450,875,92,97,7.7,37,19016,59,0.83693
4,1,193,146,55,16,44,249,869,7560,4120,800,1500,76,72,11.9,2,10922,15,0.756477


In [25]:
df.tail()

Unnamed: 0,Private,Apps,Accept,Enroll,Top10perc,Top25perc,F_Undergrad,P_Undergrad,Outstate,Room_Board,Books,Personal,PhD,Terminal,S_F_Ratio,perc_alumni,Expend,Grad_Rate,AdmissionRate
772,0,2197,1515,543,4,26,3089,2029,6797,3900,500,1200,60,60,21.0,14,4469,40,0.689577
773,1,1959,1805,695,24,47,2849,1107,11520,4960,600,1250,73,75,13.3,31,9189,83,0.921388
774,1,2097,1915,695,34,61,2793,166,6900,4200,617,781,67,75,14.4,20,8323,49,0.913209
775,1,10705,2453,1317,95,99,5217,83,19840,6510,630,2115,96,96,5.8,49,40386,99,0.229145
776,1,2989,1855,691,28,63,2988,1726,4990,3560,500,1250,75,75,18.1,28,4509,99,0.620609


## Solutions

In [26]:
import statsmodels.api as sm
from patsy import dmatrices

In [27]:
y, X = dmatrices("Private ~ AdmissionRate + Top10perc + Expend + S_F_Ratio", df)

In [28]:
model = sm.Logit(y, X).fit()

Optimization terminated successfully.
         Current function value: 0.449228
         Iterations 7


In [29]:
model.summary()

0,1,2,3
Dep. Variable:,Private,No. Observations:,777.0
Model:,Logit,Df Residuals:,772.0
Method:,MLE,Df Model:,4.0
Date:,"Wed, 09 Dec 2020",Pseudo R-squ.:,0.2335
Time:,18:32:05,Log-Likelihood:,-349.05
converged:,True,LL-Null:,-455.37
Covariance Type:,nonrobust,LLR p-value:,7.158999999999999e-45

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,2.0292,1.040,1.952,0.051,-0.008,4.067
AdmissionRate,3.1587,0.745,4.239,0.000,1.698,4.619
Top10perc,0.0020,0.008,0.246,0.805,-0.014,0.018
Expend,0.0001,4.69e-05,2.147,0.032,8.78e-06,0.000
S_F_Ratio,-0.2905,0.035,-8.204,0.000,-0.360,-0.221


__How good is the model?__

- Pseudo r2 is > 20% which can be already considered to be fairly good

- Model is statistical significant ($H_0 = \beta_0 + \ldots + \beta_4 = 0$ not very probable)

__Explain coefficients:__

- two coefficients are highly significant (p-value < 1%): AdmissionRate and S_F_Ration
- one coeeficient is moderately significant (p-value <3%): Expend

__Explain effect of coefficients on probability of school type__

_S-F-Ratio:_

Coefficient for S-F-Ratio is $-0.29$. This means that if X (here: S-F-Rate) moves by one, the probability that the school is a public shool increases by 

$$\frac{e^{-.029}}{1 + e^{-.029}} \approx 0.56\%$$

Interpretation: the higher the student to faculty ratio, the more likely it is that the school is a public school.

_AdmissionRate:_

Coefficient for AdmissionRate is $3.16$. This means that if X (here: Admission rate) moves by one, the probability that the school is a public shool increases by 

$$\frac{e^{3.16}}{1 + e^{3.16}} \approx 0.95\%$$

Given AdmissionRate is measured in _percentage_ this means that a higher AdmissionRate of $1\%$ increases the probability that the school is a public school by $95\% / 100 \approx 1\%$.

Interpretation: the higher the admission rate, the more likely it is that the school is a public school. 

Both results intuitively makes sense as we would expect (in the USA) public schools to have 

- higher student to faculty ratios and

- higher rates of admission