# DS-NYC-45 | Unit Project 3: Basic Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [1]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf

from sklearn import linear_model

In [2]:
df = pd.read_csv(os.path.join('ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether or not an applicant was admitted.

In [3]:
prestige_freq=pd.crosstab(index=df.prestige,columns=df.admit)

In [4]:
# frequency table
prestige_freq

admit,0,1
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,28,33
2.0,95,53
3.0,93,28
4.0,55,12


## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [5]:
one_hot = pd.get_dummies(df.prestige)

In [6]:
one_hot.head()

Unnamed: 0,1.0,2.0,3.0,4.0
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: We need n-1 binary variables for modeling (in this case three variables for undegrad school tier). We do not need all of them since we can drop one and just indicate that all others are equal to False in the instances when the dropped variable was equal to True. 

In [7]:
one_hot.drop(4.0,axis=1,inplace=True)

> ### Question 4.  Why are we doing this?

Answer: Encoding of categorical variables is necessary when using linear models such as linear and logistic regression. Such models require the predictors to be in some sort of numeric encoding but numerical categories have no meaning (i.e. relevant coefficients cannot be calculated for each of the predictors). Thus, dummy variables need to be introduced to be able to calculate the meaningful coefficients.

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [11]:
df = df.drop('prestige', axis=1)
df = df.join(one_hot)
df=df.rename(columns={1.0: "prestige_1", 2.0: "prestige_2", 3.0: "prestige_3"})
df.head()

Unnamed: 0,admit,gre,gpa,prestige_1,prestige_2,prestige_3
0,0,380.0,3.61,0.0,0.0,1.0
1,1,660.0,3.67,0.0,0.0,1.0
2,1,800.0,4.0,1.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,0.0
4,0,520.0,2.93,0.0,0.0,0.0


## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether or not an applicant was admitted.

In [12]:
pr1_table = prestige_freq.loc[1.0]
pr1_table=pd.DataFrame(pr1_table)
pr1_table

Unnamed: 0_level_0,1.0
admit,Unnamed: 1_level_1
0,28
1,33


> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [13]:
pr1_odds=33.0/28
print pr1_odds

1.17857142857


In [15]:
#Alternative way below

In [16]:
pr1_table=pd.DataFrame({'probability':float(pr1_table.iloc[1,0])/pr1_table.sum(axis=0)})

In [17]:
pr1_table['odds']=pr1_table.probability/(1 - pr1_table.probability)
pr1_table

Unnamed: 0,probability,odds
1.0,0.540984,1.178571


> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [18]:
# aggregate applicant admission records for those who attended undegrad schools tiers 2 to 4
non_prestige=prestige_freq.drop(1.0,axis=0)
non_prestige=non_prestige.T
non_prestige=non_prestige.sum(axis=1)
non_prestige

admit
0    243
1     93
dtype: int64

In [19]:
non_pr_odds=93.0/243
print non_pr_odds

0.382716049383


In [20]:
# Alternative way to produce odds
pr2_table=pd.DataFrame({'probability':float(non_prestige[1])/non_prestige.sum()},index=['2_4'])
pr2_table['odds']=pr2_table.probability/(1 - pr2_table.probability)
pr2_table

Unnamed: 0,probability,odds
2_4,0.276786,0.382716


> ### Question 9.  Finally, what's the odds ratio?

In [21]:
prestige_freq

admit,0,1
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,28,33
2.0,95,53
3.0,93,28
4.0,55,12


In [22]:
# odds ratio for applicants who attended tier 1 schools
pr1_table['odds_ratio']=(33/28)/((53.0+28+12)/(95+93+55))
pr1_table

Unnamed: 0,probability,odds,odds_ratio
1.0,0.540984,1.178571,2.612903


In [23]:
# odds ratio for applicants who attended tier 2 to 4 schools
pr2_table['odds_ratio']=0.382716/1.178571
pr2_table

Unnamed: 0,probability,odds,odds_ratio
2_4,0.276786,0.382716,0.324729


> ### Question 10.  Write this finding in a sentenance.

Answer: The odds to be admitted to UCLA for the applicants who attended the most prestigious undegrad schools are 161% higher than for those who did not.

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentence.

In [24]:
prestige_freq

admit,0,1
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,28,33
2.0,95,53
3.0,93,28
4.0,55,12


In [25]:
pr4_table=pd.DataFrame({'probability':12.0/(12+55)},index=['4'])
pr4_table['odds']=pr4_table.probability/(1 - pr4_table.probability)
pr4_table['odds_ratio']=(12.0/55)/((28.0+53+33)/(93.0+95+28))
pr4_table

Unnamed: 0,probability,odds,odds_ratio
4,0.179104,0.218182,0.413397


Answer: The odds for applicants who attended the least prestigious undegrad schools are 12 to 55, their odds of getting admitted to UCLA are about 59% lower than the rest of the applicants.

## Part C. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model prediting admission into UCLA using `gre`, `gpa`, and the prestige of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [26]:
df.columns

Index([u'admit', u'gre', u'gpa', u'prestige_1', u'prestige_2', u'prestige_3'], dtype='object')

In [27]:
import statsmodels.api as sm

In [28]:
formula = "admit ~ gre + gpa + prestige_1 + prestige_2 + prestige_3"
model = smf.glm(formula=formula, data=df, family=sm.families.Binomial())
result = model.fit()

> ### Question 13.  Print the model's summary results.

In [29]:
print(result.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:                  admit   No. Observations:                  397
Model:                            GLM   Df Residuals:                      391
Model Family:                Binomial   Df Model:                            5
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -227.82
Date:                Tue, 17 Jan 2017   Deviance:                       455.64
Time:                        12:24:06   Pearson chi2:                     394.
No. Iterations:                     6                                         
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -5.4303      1.140     -4.764      0.000        -7.664    -3.196
gre            0.0022      0.001      2.028      0.0

> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [30]:
odds_ratios_sm = pd.DataFrame({'odds_ratio_sm': np.exp(result.params)})

In [31]:
odds_ratios_sm=odds_ratios_sm[1:]
odds_ratios_sm

Unnamed: 0,odds_ratio_sm
gre,1.002221
gpa,2.180027
prestige_1,4.727566
prestige_2,2.394738
prestige_3,1.239531


> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer:  Holding the rest of the features at a fixed value, the odds of being admitted to UCLA for applicants who attended undegrad schools with prestige score = 2 over the odds of those who attended schools with prestige scores of 1,3 or 4 are 2.39. This means that the odds for them are 139% higher than the rest. 

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer: Holding GRE and prestige scores at a fixed value, there is a 118% inclrease in the odds of being admitted to UCLA for a one-point increase in GPA score.

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [32]:
result.params

Intercept    -5.430265
gre           0.002218
gpa           0.779337
prestige_1    1.553411
prestige_2    0.873274
prestige_3    0.214733
dtype: float64

In [33]:
intercept=result.params[0]
intercept

-5.4302645821411151

In [34]:
test=pd.DataFrame({'GRE': 800,'GPA': 4.0,'tier_1':[1,0,0,0],'tier_2':[0,1,0,0],'tier_3':[0,0,1,0],
                   'GRE_coef':result.params[1],'GPA_coef':result.params[2],
                  't1_coef':result.params[3],'t2_coef':result.params[4],'t3_coef':result.params[5]},index=[1,2,3,4])
test.index.name='tier'
test

Unnamed: 0_level_0,GPA,GPA_coef,GRE,GRE_coef,t1_coef,t2_coef,t3_coef,tier_1,tier_2,tier_3
tier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,4.0,0.779337,800,0.002218,1.553411,0.873274,0.214733,1,0,0
2,4.0,0.779337,800,0.002218,1.553411,0.873274,0.214733,0,1,0
3,4.0,0.779337,800,0.002218,1.553411,0.873274,0.214733,0,0,1
4,4.0,0.779337,800,0.002218,1.553411,0.873274,0.214733,0,0,0


In [35]:
test['logodds']=intercept+test.GPA*test.GPA_coef+test.GRE*test.GRE_coef+test.tier_1*test.t1_coef+test.tier_2*test.t2_coef+test.tier_3*test.t3_coef

In [36]:
test['odds']=np.exp(test['logodds'])

In [37]:
test['prob']=test['odds']/(1+test['odds'])

Answer: below

In [38]:
# probabilities of a student with 4.0 GPA and 800 GRE of being admitted to UCLA for every undegrad school tier.
test.prob

tier
1    0.734040
2    0.582995
3    0.419833
4    0.368608
Name: prob, dtype: float64

## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [39]:
df.columns

Index([u'admit', u'gre', u'gpa', u'prestige_1', u'prestige_2', u'prestige_3'], dtype='object')

In [40]:
feature_cols = ['gre', 'gpa','prestige_1','prestige_2','prestige_3']
X = df[feature_cols]
y = df['admit']

In [41]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C = 10 ** 2)

> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [42]:
logreg.fit(X, y)
zip(feature_cols, logreg.coef_[0])

[('gre', 0.002155689762905874),
 ('gpa', 0.76513053943782638),
 ('prestige_1', 1.5066410420832834),
 ('prestige_2', 0.84896280512071187),
 ('prestige_3', 0.18248514690109693)]

In [43]:
# sklearn odds ratios withou using Standard scaler
odds_ratios_skl=pd.DataFrame({'odds_ratio_sklearn':np.exp(logreg.coef_[0])},index=feature_cols)
odds_ratios_skl

Unnamed: 0,odds_ratio_sklearn
gre,1.002158
gpa,2.149275
prestige_1,4.511551
prestige_2,2.337221
prestige_3,1.200196


In [45]:
# joining two tables here and calculate the difference between statsmodels and sklearn odds ratios
ORs = odds_ratios_sm.join(odds_ratios_skl)
ORs['OR_dif']=ORs.odds_ratio_sm/ORs.odds_ratio_sklearn
ORs.sort_values('odds_ratio_sklearn',ascending=False)

Unnamed: 0,odds_ratio_sm,odds_ratio_sklearn,OR_dif
prestige_1,4.727566,4.511551,1.04788
prestige_2,2.394738,2.337221,1.024609
gpa,2.180027,2.149275,1.014308
prestige_3,1.239531,1.200196,1.032774
gre,1.002221,1.002158,1.000063


Answer: The odds ratio for prestige_1 (students who attended 1st tier undegrad schools) is the highest among all features. GPA also has high odds ratio compared to the other features, meaning that a one-poin change in GPA could increase the chances of a student being accepted by 115% (sklearn odds ratio). 

Between statsmodels and sklearn, the odds ratios are pretty similar with ~1-5% discrepancy.

> ### Question 20.  Again assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [46]:
logreg.intercept_

array([-5.31428638])

In [47]:
coefs=pd.DataFrame({'feature':feature_cols,'coef':logreg.coef_[0]})
coefs

Unnamed: 0,coef,feature
0,0.002156,gre
1,0.765131,gpa
2,1.506641,prestige_1
3,0.848963,prestige_2
4,0.182485,prestige_3


In [48]:
test2=pd.DataFrame({'GRE': 800,'GPA': 4.0,'tier_1':[1,0,0,0],'tier_2':[0,1,0,0],'tier_3':[0,0,1,0],
                   'GRE_coef':coefs['coef'][0],'GPA_coef':coefs['coef'][1],
                  't1_coef':coefs['coef'][2],'t2_coef':coefs['coef'][3],'t3_coef':coefs['coef'][4]},index=[1,2,3,4])
test2.index.name='tier'
test2

Unnamed: 0_level_0,GPA,GPA_coef,GRE,GRE_coef,t1_coef,t2_coef,t3_coef,tier_1,tier_2,tier_3
tier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,4.0,0.765131,800,0.002156,1.506641,0.848963,0.182485,1,0,0
2,4.0,0.765131,800,0.002156,1.506641,0.848963,0.182485,0,1,0
3,4.0,0.765131,800,0.002156,1.506641,0.848963,0.182485,0,0,1
4,4.0,0.765131,800,0.002156,1.506641,0.848963,0.182485,0,0,0


In [49]:
test2['logodds']=logreg.intercept_+test2.GPA*test2.GPA_coef+test2.GRE*test2.GRE_coef+test2.tier_1*test2.t1_coef+test2.tier_2*test2.t2_coef+test2.tier_3*test2.t3_coef

In [50]:
test2['odds']=np.exp(test2['logodds'])

In [51]:
test2['prob']=test2['odds']/(1+test2['odds'])

Answer: below

In [52]:
# probabilities of a student with 4.0 GPA and 800 GRE of being admitted to UCLA for every undegrad school tier.
test2['prob']

tier
1    0.726598
2    0.579263
3    0.414176
4    0.370701
Name: prob, dtype: float64