# DS-SF-30 | Unit Project 3: Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [188]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)
import statsmodels as sm
import math

from sklearn import preprocessing
import statsmodels.formula.api as smf

from sklearn import linear_model

In [189]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'dataset-ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether an applicant was admitted.

In [190]:
pd.crosstab(df.admit, df.prestige,margins=True)

prestige,1.0,2.0,3.0,4.0,All
admit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,28,95,93,55,271
1,33,53,28,12,126
All,61,148,121,67,397


## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [191]:
df_2 = pd.get_dummies(df['prestige'],prefix='Prestige')


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: We need 3 of these binary variables to represent the four possible values for prestige. Since 2.0 is the most common value for prestige, 'Prestige_2.0' can be the binary variable removed and treated as the default value. 

> ### Question 4.  Why are we doing this?

Answer: We create dummy binary variables for each prestige level because that allows us to treat them as separate categories instead of four values that are linearly related. For example, the relative increase in admit probability when an applicant has a prestige 4 school vs prestige 3 is likely not equal to the relative increase in admit probability when comparing prestige 3 to prestige 2, and treating them all as binary variables allows us to assign different weights to each. The reason we remove one of the four variables is that one of the variables is redundant, in that knowing the values of three of the variables will automatically tell us the value of the fourth (because we know that one and only one of them will be 1, and the rest will be zero). Removing one of the four allows us to have features that are more independent, which is good because redundant features tend to interfere with each other in our models. 

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [192]:
df = pd.concat([df, df_2], axis=1)
df = df.drop('prestige',axis=1)
df

Unnamed: 0,admit,gre,gpa,Prestige_1.0,Prestige_2.0,Prestige_3.0,Prestige_4.0
0,0,380.0,3.61,0.0,0.0,1.0,0.0
1,1,660.0,3.67,0.0,0.0,1.0,0.0
2,1,800.0,4.00,1.0,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...
395,0,620.0,4.00,0.0,1.0,0.0,0.0
396,0,560.0,3.04,0.0,0.0,1.0,0.0
397,0,460.0,2.63,0.0,1.0,0.0,0.0
398,0,700.0,3.65,0.0,1.0,0.0,0.0


## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether an applicant was admitted.

In [193]:
pd.crosstab(df.admit, df['Prestige_1.0'],margins=True)

Prestige_1.0,0.0,1.0,All
admit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,243,28,271
1,93,33,126
All,336,61,397


> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [194]:
prob_prest_1 = 33/61
odds_prest_1 = prob_prest_1 / (1-prob_prest_1)
print(odds_prest_1)

1.1785714285714288


> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [195]:
prob_prest_234 = 93/336
odds_prest_234 = prob_prest_234 / (1-prob_prest_234)
print(odds_prest_234)

0.3827160493827161


> ### Question 9.  Finally, what's the odds ratio?

In [196]:
odds_ratio = odds_prest_1 / odds_prest_234
print('Odds ratio = {:3.2f}'.format(odds_ratio))

Odds ratio = 3.08


> ### Question 10.  Write this finding in a sentence.

Answer: The odds ratio indicates that the odds of admission for an undergraduate who attended a #1 ranked college  are three times the odds of admission for an undergraduate who did not attend a #1 ranked college. 

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentence.

In [197]:
prob_prest_4 = 12/67
odds_prest_4 = prob_prest_4 / (1-prob_prest_4)
print(odds_prest_4)

0.21818181818181817


In [198]:
prob_prest_123 = 114/330
odds_prest_123 = prob_prest_123 / (1-prob_prest_123)
print(odds_prest_123)

0.5277777777777778


In [199]:
odds_ratio_4 = odds_prest_4 / odds_prest_123
print(odds_ratio_4)

0.4133971291866028


Answer: The odds ratio indicates that the odds admission to UCLA for an undergraduate who attended a prestige 4 college are less than half (0.41 times) the odds of admission for a student that did not attend a prestige 4 college. 

## Part D. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model predicting admission into UCLA using `gre`, `gpa`, and the `prestige` of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [200]:
model_1 = sm.discrete.discrete_model.Logit(df.admit,
                        df[['Prestige_2.0','Prestige_3.0','Prestige_4.0','gpa','gre']]).fit()

Optimization terminated successfully.
         Current function value: 0.589121
         Iterations 5


> ### Question 13.  Print the model's summary results.

In [201]:
model_1.summary()

0,1,2,3
Dep. Variable:,admit,No. Observations:,397.0
Model:,Logit,Df Residuals:,392.0
Method:,MLE,Df Model:,4.0
Date:,"Tue, 31 Jan 2017",Pseudo R-squ.:,0.05722
Time:,00:43:14,Log-Likelihood:,-233.88
converged:,True,LL-Null:,-248.08
,,LLR p-value:,1.039e-05

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Prestige_2.0,-0.9562,0.302,-3.171,0.002,-1.547 -0.365
Prestige_3.0,-1.5375,0.332,-4.627,0.000,-2.189 -0.886
Prestige_4.0,-1.8699,0.401,-4.658,0.000,-2.657 -1.083
gpa,-0.1323,0.195,-0.680,0.497,-0.514 0.249
gre,0.0014,0.001,1.308,0.191,-0.001 0.003


> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

logit(p) = ln(p/1-p)

logit = 0 when p = 0.5

e^coeff = multiplier change in odds for a unit change in x

In [202]:
odds_gre = math.exp(model_1.params['gre'])
int_high = math.exp(-0.001)
int_low = math.exp(0.003)
print('Odds_ratio of gre: {:.4f}'.format(odds_gre))
print('95% Confidence interval of gre: {:.4f} to {:.4f}'.format(int_high, int_low))

Odds_ratio of gre: 1.0014
95% Confidence interval of gre: 0.9990 to 1.0030


In [203]:
odds_gre = math.exp(model_1.params['gpa'])
int_high = math.exp(-0.514)
int_low = math.exp(0.249)
print('Odds_ratio of gpa: {:.4f}'.format(odds_gre))
print('95% Confidence interval of gpa: {:.4f} to {:.4f}'.format(int_high, int_low))

Odds_ratio of gpa: 0.8761
95% Confidence interval of gpa: 0.5981 to 1.2827


In [204]:
odds_gre = math.exp(model_1.params['Prestige_2.0'])
int_high = math.exp(-1.547)
int_low = math.exp(-0.365)
print('Odds_ratio of Prestige_2.0: {:.4f}'.format(odds_gre))
print('95% Confidence interval of Prestige_2.0: {:.4f} to {:.4f}'.format(int_high, int_low))

Odds_ratio of Prestige_2.0: 0.3843
95% Confidence interval of Prestige_2.0: 0.2129 to 0.6942


In [205]:
odds_gre = math.exp(model_1.params['Prestige_3.0'])
int_high = math.exp(-2.189)
int_low = math.exp(-0.886)
print('Odds_ratio of Prestige_3.0: {:.4f}'.format(odds_gre))
print('95% Confidence interval of Prestige_3.0: {:.4f} to {:.4f}'.format(int_high, int_low))

Odds_ratio of Prestige_3.0: 0.2149
95% Confidence interval of Prestige_3.0: 0.1120 to 0.4123


In [206]:
odds_gre = math.exp(model_1.params['Prestige_4.0'])
int_high = math.exp(-2.657)
int_low = math.exp(-1.083)
print('Odds_ratio of Prestige_4.0: {:.4f}'.format(odds_gre))
print('95% Confidence interval of Prestige_4.0: {:.4f} to {:.4f}'.format(int_high, int_low))

Odds_ratio of Prestige_4.0: 0.1541
95% Confidence interval of Prestige_4.0: 0.0702 to 0.3386


> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer: The odds ratio for 'Prestige = 2.0' is 0.384, which means that the odds of admission to UCLA graduate school for a candidate that attended a prestige 2 school are less than half (0.384 times) the odds of admission for somebody that did not attend a prestige 2 school. In this case, someone who did not attend a prestige 2 school could have attended a prestige 1 school (which would likely increase the odds of admission) or could have attended a prestige  or 4 school, so this odds ratio makes ituitive sense.

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer: The odds ratio of 'gpa' is 0.8761, which means that the odds of admission to UCLA graduate school for a candidate are 0.876 times the odds for an individual with a full point lower for the gpa. This finding does not make as much intuitive sense, but perhaps high gpas are not as predictive of admission as gpa, so a high gpa might be correlated with easier classes or lower prestige schools. 

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [207]:
df_test = pd.DataFrame({'gre': [800, 800, 800, 800],
            'Prestige_2.0': [0,1,0,0],
            'Prestige_3.0': [0,0,1,0],
            'Prestige_4.0': [0,0,0,1],
            'gpa': [4, 4, 4, 4]
          })
df_test

Unnamed: 0,Prestige_2.0,Prestige_3.0,Prestige_4.0,gpa,gre
0,0,0,0,4,800
1,1,0,0,4,800
2,0,1,0,4,800
3,0,0,1,4,800


In [208]:
model_1.predict(df_test)

array([ 0.63739858,  0.40320425,  0.27420161,  0.21318433])

Answer: For a student with a GRE of 800 and a GPA of 4.0, the probability of admission for each of the four prestige level schools is: Prestige 1: 0.637, Prestige 2: 0.403, Prestige 3: 0.274, and Prestige 4: 0.213. 

## Part E. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [209]:
X = df[['Prestige_2.0','Prestige_3.0','Prestige_4.0','gpa','gre']]
c = df.admit

model_2 = linear_model.LogisticRegression(C = 10 ** 2).\
    fit(X, c)

In [210]:
model_2.score(X, c)

0.70528967254408059

In [211]:
model_2.predict(df_test)

array([1, 1, 0, 0])

In [212]:
print(model_2.coef_)
print(model_2.intercept_)

[[-0.62882239 -1.25222745 -1.56879212  0.67315496  0.00215822]]
[-3.51478687]


> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [213]:
odds_gre = math.exp(0.00215822)
print('Odds_ratio of gre: {:.4f}'.format(odds_gre))

Odds_ratio of gre: 1.0022


In [214]:
odds_gpa = math.exp(0.67315496)
print('Odds_ratio of gpa: {:.4f}'.format(odds_gpa))

Odds_ratio of gpa: 1.9604


In [215]:
odds_Prestige_4 = math.exp(-1.56879212)
print('Odds_ratio of Prestige_4: {:.4f}'.format(odds_Prestige_4))

Odds_ratio of Prestige_4: 0.2083


In [216]:
odds_Prestige_3 = math.exp(-1.25222745)
print('Odds_ratio of Prestige_3: {:.4f}'.format(odds_Prestige_3))

Odds_ratio of Prestige_3: 0.2859


In [217]:
odds_Prestige_2 = math.exp(-0.62882239)
print('Odds_ratio of Prestige_2: {:.4f}'.format(odds_Prestige_2))

Odds_ratio of Prestige_2: 0.5332


Answer: 

The odds ratio of GRE is 1.0022 for sklearn, whereas for the statsmodel it was 1.0014. These are similarly low (but greater than 1), but GRE has a slightly larger positive predictive effect on the second model. 

The odds ratio of gpa was 1.9604 for the second model, much higher than 0.8761 for model 1. This is a big change, going from a negative to a positive correlation. 

The odds ratio of Prestige 4 was 0.2083 in model 2, and 0.1541 in model 1. The odds ratio of Prestige 3 was 0.2859 in model 2, and 0.2149 in model 1. The odds ratio of Prestige 2 was 0.5332 in model 2, and 0.3843 in model 1. Therefore, model 1 and 2 have similar odds ratios for the three lower Prestige levels, with each lower level having a more pronounced negative effect on the odds of admission, but the odds ratios were lower for model 1 than for model 2 for each Prestige level. 

> ### Question 20.  Again, assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [237]:
model_2.predict(df_test)

array([1, 1, 0, 0])

In [243]:
def calc_odds_m2(row):
    tot = 0
    for coef, feat in model_2.coef_, row:
        tot += coef * feat
    return model_2.intercept_ + tot

In [245]:
result = df_test.apply(calc_odds_m2,axis=0)
result 

ValueError: ('not enough values to unpack (expected 2, got 1)', 'occurred at index Prestige_2.0')

Answer:

# Part F: Further Model Exploration

In [219]:
df.head()

Unnamed: 0,admit,gre,gpa,Prestige_1.0,Prestige_2.0,Prestige_3.0,Prestige_4.0
0,0,380.0,3.61,0.0,0.0,1.0,0.0
1,1,660.0,3.67,0.0,0.0,1.0,0.0
2,1,800.0,4.0,1.0,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,0.0,1.0


In [235]:
scaler = preprocessing.MinMaxScaler().fit(df.drop('Prestige_1.0',axis=1))
df_np_array = scaler.transform(df.drop('Prestige_1.0',axis=1))
print(df_np_array)
model_3 = sm.discrete.discrete_model.Logit(df_np_array[:,0],df_np_array[:,1:]).fit()
#df[['gre','gpa','Prestige_2.0','Prestige_3.0','Prestige_4.0']]).fit()

[[ 0.          0.27586207  0.77586207  0.          1.          0.        ]
 [ 1.          0.75862069  0.81034483  0.          1.          0.        ]
 [ 1.          1.          1.          0.          0.          0.        ]
 ..., 
 [ 0.          0.4137931   0.21264368  1.          0.          0.        ]
 [ 0.          0.82758621  0.79885057  1.          0.          0.        ]
 [ 0.          0.65517241  0.93678161  0.          1.          0.        ]]
Optimization terminated successfully.
         Current function value: 0.585761
         Iterations 5


In [236]:
model_3.summary()

0,1,2,3
Dep. Variable:,y,No. Observations:,397.0
Model:,Logit,Df Residuals:,392.0
Method:,MLE,Df Model:,4.0
Date:,"Tue, 31 Jan 2017",Pseudo R-squ.:,0.0626
Time:,01:00:00,Log-Likelihood:,-232.55
converged:,True,LL-Null:,-248.08
,,LLR p-value:,2.976e-06

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
x1,0.2626,0.528,0.498,0.619,-0.772 1.297
x2,0.5692,0.507,1.122,0.262,-0.425 1.563
x3,-1.1193,0.280,-3.992,0.000,-1.669 -0.570
x4,-1.7517,0.312,-5.619,0.000,-2.363 -1.141
x5,-2.0335,0.382,-5.317,0.000,-2.783 -1.284


Not much better with min/max scaler, but this time GRE and GPA are both positive, and the coefficients for the lower-prestige schools are much more negative. This is a more intuitive-looking model for coefficient interpretation.