# DS-SF-25 | Unit Project 3: Basic Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [137]:
import os
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from sklearn import linear_model

pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

In [138]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'ucla-admissions.csv'))
df.dropna(inplace = True)

df.head()

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.0,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether or not an applicant was admitted.

In [139]:
pd.crosstab(df.prestige, df.admit, margins = True) 

admit,0,1,All
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.0,28,33,61
2.0,95,53,148
3.0,93,28,121
4.0,55,12,67
All,271,126,397


## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [140]:
dum_prestige = pd.get_dummies(df.prestige, prefix = 'pres')
dum_prestige

Unnamed: 0,pres_1.0,pres_2.0,pres_3.0,pres_4.0
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0
...,...,...,...,...
395,0.0,1.0,0.0,0.0
396,0.0,0.0,1.0,0.0
397,0.0,1.0,0.0,0.0
398,0.0,1.0,0.0,0.0


In [141]:
dum_prestige.rename(columns = {'pres_1.0':'pres_1','pres_2.0':'pres_2','pres_3.0':'pres_3','pres_4.0':'pres_4'})


Unnamed: 0,pres_1,pres_2,pres_3,pres_4
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0
...,...,...,...,...
395,0.0,1.0,0.0,0.0
396,0.0,0.0,1.0,0.0
397,0.0,1.0,0.0,0.0
398,0.0,1.0,0.0,0.0


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: I think we need four binary variables (bits) for modeling because schools are ranked on a scale of one to four, numerically, but they cannot be treated as just numbers. 

> ### Question 4.  Why are we doing this?

Answer: I kind of answered this above, but we are doing this because we cannot treat the prestige ranking of the school as a numeric value even though it is represented ny a number. So instead it is turned into a series of yes's and no's. 

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [142]:
df.drop('prestige',axis = 1, inplace = True) 

In [143]:
df = df.join([dum_prestige])

## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether or not an applicant was admitted.

In [144]:
df.head()

Unnamed: 0,admit,gre,gpa,pres_1.0,pres_2.0,pres_3.0,pres_4.0
0,0,380.0,3.61,0.0,0.0,1.0,0.0
1,1,660.0,3.67,0.0,0.0,1.0,0.0
2,1,800.0,4.0,1.0,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,0.0,1.0


> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [145]:
p0 = 33 / 61.
print p0 

oddsa = p0 / (1 - p0)
print oddsa

0.540983606557
1.17857142857


> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [146]:
p1 = 93 / 336. 
print p1 
oddsb = p1 / (1 - p1)
print oddsb

0.276785714286
0.382716049383


> ### Question 9.  Finally, what's the odds ratio?

In [147]:
oddsr = oddsa /oddsb
print oddsr

3.07949308756


> ### Question 10.  Write this finding in a sentenance.

The odds ratio above indicates to me that individuals that attended a top ranked undergraduate school have three times the chance of getting into graduate school than those that didn't. I am not sure if this is right though because I think that an odds ratio of 1 indicates there there is no difference in admittance between those that attended a top ranked school and those that didn't. So many those that attended just have twice the chance of getting in than those that don't?

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentenance.

In [148]:
#I am not sure how to do this based on the above frequency table so I will create a new one. Sorry. 
pd.crosstab(df.pres_4, df.admit, margins = True)

AttributeError: 'DataFrame' object has no attribute 'pres_4'

In [149]:
p2 = 12 / 67. 
oddsc = p2 / (1-p2)
p3 = 114 / 330. 
oddsd = p3 / (1-p3)

oddsr = oddsd / oddsc
print oddsr

2.41898148148


Answer: Using the same rationale as above I think this means that individuals that did not attend the least prestigiously ranked schools have two times the chance of getting into grad school than those that did. 

## Part C. Analysis using `statsmodel`

> ### Question 12.  Fit a logistic regression model prediting admission into UCLA using `gre`, `gpa`, and the prestige of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

> ### Question 13.  Print the model's summary results.

In [150]:
train_cols = df.columns[1:]

logit = smf.Logit(df['admit'], df[train_cols])

result = logit.fit()
print result.summary()


Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:                  admit   No. Observations:                  397
Model:                          Logit   Df Residuals:                      391
Method:                           MLE   Df Model:                            5
Date:                Fri, 19 Aug 2016   Pseudo R-squ.:                 0.08166
Time:                        21:14:38   Log-Likelihood:                -227.82
converged:                       True   LL-Null:                       -248.08
                                        LLR p-value:                 1.176e-07
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
gre            0.0022      0.001      2.028      0.043      7.44e-05     0.004
gpa            0.7793      0.

> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [151]:
print np.exp(result.params)

gre         1.002221
gpa         2.180027
pres_1.0    0.020716
pres_2.0    0.010494
pres_3.0    0.005432
pres_4.0    0.004382
dtype: float64


In [152]:
params = result.params
conf = result.conf_int()
conf['OR'] = params
conf.columns = ['2.5%', '9.7%', 'OR']
print np.exp(conf)

              2.5%      9.7%        OR
gre       1.000074  1.004372  1.002221
gpa       1.136120  4.183113  2.180027
pres_1.0  0.002207  0.194440  0.020716
pres_2.0  0.001183  0.093045  0.010494
pres_3.0  0.000569  0.051880  0.005432
pres_4.0  0.000469  0.040919  0.004382


> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer: I think this means that the odds of being admitted decreases by 50 percent if the prestige of a school decreases from two to one and increases by 50 percent if the prestige goes up from three to two.  

Overall explanation: Taking the exponential of each of the coefficients to get the odds ration. This then tells you how a 1 unit increase or decreate in a variable affects the odds of being admitted. 

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer: I think this means that as gpa increases or decreaes by a unit of one the odds of getting into gradudate increases or decreases by two. 

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [None]:
# TODO

Answer:

## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [None]:
# TODO

> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [None]:
# TODO

Answer:

> ### Question 20.  Again assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [None]:
# TODO

Answer: