# DS-SF-30 | Unit Project 3: Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [1]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf

from sklearn import linear_model

In [2]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'dataset-ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether an applicant was admitted.

In [3]:
df[['prestige','admit']].groupby('prestige').count()

Unnamed: 0_level_0,admit
prestige,Unnamed: 1_level_1
1.0,61
2.0,148
3.0,121
4.0,67


## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [4]:
c = df.prestige

In [5]:
cs = pd.get_dummies(c, prefix = None)

In [6]:
cs

Unnamed: 0,1.0,2.0,3.0,4.0
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0
...,...,...,...,...
395,0.0,1.0,0.0,0.0
396,0.0,0.0,1.0,0.0
397,0.0,1.0,0.0,0.0
398,0.0,1.0,0.0,0.0


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: Only three (of the four) since the fourth can be inferred from the data within the other three

> ### Question 4.  Why are we doing this?

Answer: Prestige is ordinal data -- and should be transformed to one-hot encoding for proper use in modeling / machine-learning.

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [7]:
df1 = pd.concat([df, cs], axis=1)

In [8]:
df1.drop('prestige', axis=1, inplace=True)


In [9]:
df1.columns

Index([u'admit', u'gre', u'gpa', 1.0, 2.0, 3.0, 4.0], dtype='object')

In [10]:
df1_cols = ['admit', 'gre', 'gpa', 'prestige_1', 'prestige_2', 'prestige_3', 'prestige_4']
df1.columns = df1_cols
df1

Unnamed: 0,admit,gre,gpa,prestige_1,prestige_2,prestige_3,prestige_4
0,0,380.0,3.61,0.0,0.0,1.0,0.0
1,1,660.0,3.67,0.0,0.0,1.0,0.0
2,1,800.0,4.00,1.0,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...
395,0,620.0,4.00,0.0,1.0,0.0,0.0
396,0,560.0,3.04,0.0,0.0,1.0,0.0
397,0,460.0,2.63,0.0,1.0,0.0,0.0
398,0,700.0,3.65,0.0,1.0,0.0,0.0


## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether an applicant was admitted.

In [11]:
pd.crosstab(df[df.prestige==1].prestige, df[df.prestige==1].admit)

admit,0,1
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,28,33


> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

p / (1 - p)

In [12]:
odds_A = (33. / (28. + 33)) / (28. / (28. + 33))
odds_A

1.1785714285714286

> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [13]:
pd.crosstab(df[df.prestige!=1].prestige, df[df.prestige!=1].admit)

admit,0,1
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
2.0,95,53
3.0,93,28
4.0,55,12


p / (1 - p)

In [14]:
odds_B = ((53. + 28. + 12.) / (95. + 93. + 55. + 53. + 28. + 12.)) / ((95. + 93. + 55.) / (95. + 93. + 55. + 53. + 28. + 12.))
odds_B

0.3827160493827161

> ### Question 9.  Finally, what's the odds ratio?

In [15]:
odds_A / odds_B

3.0794930875576036

> ### Question 10.  Write this finding in a sentence.

An attendee of of a school with prestige = 1 (1 is high) is ~ twice as likely to be admitted to UCLA as an attendee of a school with prestige 2, 3 or 4 -- however for an exact figure on this, a relative risk figure should be used rather than an odds ratio.  This would be:

In [16]:
(33. / (33. + 28.)) / ((53. + 28. + 12.) / (53. + 28. + 12. + 97. + 93. + 55.))

1.9661554732945532

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentence.

In [17]:
pd.crosstab(df[df.prestige==4].prestige, df[df.prestige==4].admit)

admit,0,1
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
4.0,55,12


p / (1 - p)

In [18]:
odds_A = (12. / (55. + 12)) / (55. / (55. + 12))
odds_A

0.21818181818181817

In [19]:
pd.crosstab(df[df.prestige!=4].prestige, df[df.prestige!=4].admit)

admit,0,1
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,28,33
2.0,95,53
3.0,93,28


p / (1 - p)

In [20]:
odds_B = ((33. + 53. + 28.) / (33. + 53. + 28. + 28. + 95. + 93.)) / ((28. + 95. + 93.) / (33. + 53. + 28. + 28. + 95. + 93.))
odds_B

0.5277777777777778

In [21]:
odds_A / odds_B

0.4133971291866028

An attendee of of a school with prestige = 4 (4 is low) is ~ half as likely to be admitted to UCLA as an attendee of a school with prestige 1, 2 or 3 -- however for an exact figure on this, a relative risk figure should be used rather than an odds ratio.  This would be:

In [22]:
(12. / (12. + 55.)) / ((33. + 53. + 28.) / (33. + 53. + 28. + 28. + 97. + 93.))

0.5216025137470541

## Part C. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model predicting admission into UCLA using `gre`, `gpa`, and the `prestige` of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [23]:
df = df1
train_df = df1
train_df.columns

Index([u'admit', u'gre', u'gpa', u'prestige_1', u'prestige_2', u'prestige_3',
       u'prestige_4'],
      dtype='object')

In [24]:
names_X = ['gre', 'gpa', 'prestige_2', 'prestige_3', 'prestige_4']

def X_c(df):
    X = df[ names_X ]
    c = df.admit
    return X, c

train_X, train_c = X_c(train_df)
## test_X, test_c = X_c(test_df)

> ### Question 13.  Print the model's summary results.

In [25]:
# model = linear_model.LogisticRegression().\
#    fit(train_X, train_c)

# print model.intercept_
# print model.coef_

In [26]:
logit = smf.Logit(train_c, train_X)

result = logit.fit()

Optimization terminated successfully.
         Current function value: 0.589121
         Iterations 5


In [27]:
print result.summary()

                           Logit Regression Results                           
Dep. Variable:                  admit   No. Observations:                  397
Model:                          Logit   Df Residuals:                      392
Method:                           MLE   Df Model:                            4
Date:                Sun, 05 Feb 2017   Pseudo R-squ.:                 0.05722
Time:                        12:03:16   Log-Likelihood:                -233.88
converged:                       True   LL-Null:                       -248.08
                                        LLR p-value:                 1.039e-05
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
gre            0.0014      0.001      1.308      0.191        -0.001     0.003
gpa           -0.1323      0.195     -0.680      0.497        -0.514     0.249
prestige_2    -0.9562      0.302     -3.171      0.0

> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

The odds ratio of the features are the exponents of the features above (in the order of gre, gpa, prestige_2, prestige_3, prestige_4)

In [28]:
print np.exp(result.params)

gre           1.001368
gpa           0.876073
prestige_2    0.384342
prestige_3    0.214918
prestige_4    0.154135
dtype: float64


In [29]:
conf_int = result.conf_int()
print conf_int[0]
print conf_int[1]

gre          -0.000680
gpa          -0.513657
prestige_2   -1.547279
prestige_3   -2.188769
prestige_4   -2.656743
Name: 0, dtype: float64
gre           0.003414
gpa           0.249045
prestige_2   -0.365166
prestige_3   -0.886230
prestige_4   -1.083112
Name: 1, dtype: float64


> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

We can expect the odds of being admitted to increase by about 38% if the prestige of a school is 2 (relative to 3)

> ### Question 16.  Interpret the odds ratio of `gpa`.

We can expect the odds of being admitted to increase by about 87% with a 1.0 unit increase in GPA

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [30]:
predict_X = [ [800, 4.0, 0, 0, 0] ] # tier 1

print result.predict(predict_X)

[ 0.63739858]


In [31]:
predict_X = [ [800, 4.0, 1, 0, 0] ] # tier 2

print result.predict(predict_X)

[ 0.40320425]


In [32]:
predict_X = [ [800, 4.0, 0, 1, 0] ] # tier 3

print result.predict(predict_X)

[ 0.27420161]


In [33]:
predict_X = [ [800, 4.0, 0, 0, 1] ] # tier 4

print result.predict(predict_X)

[ 0.21318433]


For Tiers 1-4 respectively: 64%, 40%, 27%, 21%

## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [34]:
X = df[ ['gre', 'gpa', 'prestige_2', 'prestige_3', 'prestige_4'] ]
c = df.admit

model = linear_model.LogisticRegression(fit_intercept=False, C=100).\
    fit(X, c)

> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [35]:
print np.exp(model.coef_)

[[ 1.00128195  0.87660575  0.41070667  0.22500931  0.16769503]]


The odds ratios are a bit different as those calculated with statsmodels

gre           1.001368

gpa           0.876073

prestige_2    0.384342

prestige_3    0.214918

prestige_4    0.154135

> ### Question 20.  Again, assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [36]:
predict_X = [ [800, 4.0, 0, 0, 0] ] # tier 1

print model.predict(predict_X)
print model.predict_proba(predict_X)

[1]
[[ 0.37798533  0.62201467]]


In [37]:
predict_X = [ [800, 4.0, 1, 0, 0] ] # tier 2

print model.predict(predict_X)
print model.predict_proba(predict_X)

[0]
[[ 0.59670818  0.40329182]]


In [38]:
predict_X = [ [800, 4.0, 0, 1, 0] ] # tier 3

print model.predict(predict_X)
print model.predict_proba(predict_X)

[0]
[[ 0.72977971  0.27022029]]


In [39]:
predict_X = [ [800, 4.0, 0, 0, 1] ] # tier 4

print model.predict(predict_X)
print model.predict_proba(predict_X)

[0]
[[ 0.78372373  0.21627627]]


For Tiers 1-4 respectively: 62%, 40%, 27%, 22%