# DS-SF-27 | Unit Project 3: Basic Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [1]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf

from sklearn import linear_model

In [2]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether or not an applicant was admitted.

In [3]:
# TODO
pd.crosstab(df.admit, df.prestige)

prestige,1.0,2.0,3.0,4.0
admit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,28,95,93,55
1,33,53,28,12


## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [4]:
# TODO
c = df.prestige
one_hot = pd.get_dummies(c, prefix = "Prestige")
one_hot

Unnamed: 0,Prestige_1.0,Prestige_2.0,Prestige_3.0,Prestige_4.0
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0
...,...,...,...,...
395,0.0,1.0,0.0,0.0
396,0.0,0.0,1.0,0.0
397,0.0,1.0,0.0,0.0
398,0.0,1.0,0.0,0.0


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: 4, because there are 4 possible values for prestige.

> ### Question 4.  Why are we doing this?

Answer: Because prestige is a categorical variable, and this method enables the model to function properly.

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [8]:

# TODO
df = df.join([one_hot])


In [9]:
df.drop(['prestige'], axis = 1, inplace = True)
df.columns

Index([u'admit', u'gre', u'gpa', u'Prestige_1.0', u'Prestige_2.0',
       u'Prestige_3.0', u'Prestige_4.0'],
      dtype='object')

## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether or not an applicant was admitted.

In [10]:
# TODO
pd.crosstab(df.admit, df['Prestige_1.0'])

Prestige_1.0,0.0,1.0
admit,Unnamed: 1_level_1,Unnamed: 2_level_1
0,243,28
1,93,33


> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [38]:
# TODO I had .shape[] for the total_prestige variable, but the equation would always return 0... the answer here is 54%
admit_prestige = df[(df['Prestige_1.0'] == 1) & (df.admit == 1)].shape[0]
total_prestige = df[(df['Prestige_1.0'] == 1)].sum()

admit_prestige / total_prestige

admit           1.000000
gre             0.000884
gpa             0.156665
Prestige_1.0    0.540984
Prestige_2.0         inf
Prestige_3.0         inf
Prestige_4.0         inf
dtype: float64

> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [47]:
# TODO I can't figure out how to get decimals to display...
admit_noprestige = df[(df['Prestige_1.0'] != 1) & (df.admit == 1)].shape[0]
total_noprestige = df[(df['Prestige_1.0'] != 1)].shape[0]

print 93 / 336

0


> ### Question 9.  Finally, what's the odds ratio?

In [48]:
# TODO

(93 + 33) / (243 +28)

0

> ### Question 10.  Write this finding in a sentenance.

Answer: There is a 46% probability that anyone will be admitted. 

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentenance.

In [None]:
# TODO

Answer:

## Part C. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model prediting admission into UCLA using `gre`, `gpa`, and the prestige of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [55]:
# TODO I realized after I ran this that I'm using sklearn instead of statsmodels.
# I don't know how to use statsmodels to run logistic regression

X = df[ ['gre', 'gpa', 'Prestige_1.0', 'Prestige_2.0', 'Prestige_3.0', 'Prestige_4.0'] ]

model_Admit = linear_model.LogisticRegression().\
    fit(X, df.admit)

> ### Question 13.  Print the model's summary results.

In [56]:
# TODO
print model_Admit.coef_
print model_Admit.intercept_

[[ 0.00172658  0.19596439  0.35943456 -0.31929464 -0.88821548 -1.10459122]]
[-1.95266678]


In [57]:
model_Admit.score(X, df.admit)

0.7128463476070529

> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [None]:
# TODO


> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer: If a student attended a school with prestige = 2, their odds of admission decrease by 31%

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer: For each point increase in GPA, a student's chance of admission increases by 19%

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [None]:
# TODO

Answer:

## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [None]:
# TODO see answers above

> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [58]:
print model_Admit.coef_
print model_Admit.intercept_

[[ 0.00172658  0.19596439  0.35943456 -0.31929464 -0.88821548 -1.10459122]]
[-1.95266678]


Answer: no idea how they compare...

> ### Question 20.  Again assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [62]:
# TODO
Admit_1 = -1.9526678 + (.00172658 * 800) + (.19596439 * 4) + (.35943456 * 1) - (.31929464 * 0) - (.88821548 * 0) - (1.10459122 * 0)
Admit_2 = -1.9526678 + (.00172658 * 800) + (.19596439 * 4) + (.35943456 * 0) - (.31929464 * 1) - (.88821548 * 0) - (1.10459122 * 0)
Admit_3 = -1.9526678 + (.00172658 * 800) + (.19596439 * 4) + (.35943456 * 0) - (.31929464 * 0) - (.88821548 * 1) - (1.10459122 * 0)
Admit_4 = -1.9526678 + (.00172658 * 800) + (.19596439 * 4) + (.35943456 * 0) - (.31929464 * 0) - (.88821548 * 0) - (1.10459122 * 1)


print Admit_1
print Admit_2
print Admit_3
print Admit_4

0.57188832
-0.10684088
-0.67576172
-0.89213746


Answer: 57% chance if he/she comes from a tier 1 school, but essentially 0 for any other school