# DS-NYC-45 | Unit Project 3: Basic Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [96]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf

from sklearn import linear_model

In [97]:
df = pd.read_csv(os.path.join('..', 'dataset', 'ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether or not an applicant was admitted.

In [98]:
# TODO
print pd.crosstab(df['admit'],df['prestige'], rownames=['admit'])

prestige  1.0  2.0  3.0  4.0
admit                       
0          28   95   93   55
1          33   53   28   12


## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [99]:
# TODO
pd.get_dummies(df['prestige'])

Unnamed: 0,1.0,2.0,3.0,4.0
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0
...,...,...,...,...
395,0.0,1.0,0.0,0.0
396,0.0,0.0,1.0,0.0
397,0.0,1.0,0.0,0.0
398,0.0,1.0,0.0,0.0


> ### Question 3.  How many of these binary variables do we need for modeling?

### Answer: 3

> ### Question 4.  Why are we doing this?

Answer: We only need 3 because the fourth is redundant.

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [100]:
df_prestige1 = pd.get_dummies(df.prestige, drop_first=True)
df_prestige1

Unnamed: 0,2.0,3.0,4.0
0,0.0,1.0,0.0
1,0.0,1.0,0.0
2,0.0,0.0,0.0
3,0.0,0.0,1.0
4,0.0,0.0,1.0
...,...,...,...
395,1.0,0.0,0.0
396,0.0,1.0,0.0
397,1.0,0.0,0.0
398,1.0,0.0,0.0


In [101]:
# TODO
df_prestige=pd.get_dummies(df['prestige'])

In [102]:
df = df.join(df_prestige)

In [103]:
df= df.drop('prestige', axis=1)

In [91]:
df.head()

Unnamed: 0,admit,gre,gpa,1.0,2.0,3.0,4.0
0,0,380.0,3.61,0.0,0.0,1.0,0.0
1,1,660.0,3.67,0.0,0.0,1.0,0.0
2,1,800.0,4.0,1.0,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,0.0,1.0


## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether or not an applicant was admitted.

In [104]:
# TODO
print pd.crosstab(df['admit'],df[1.0], rownames=['admit'])

1.0    0.0  1.0
admit          
0      243   28
1       93   33


> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [106]:
# TODO
33.0/28

1.1785714285714286

> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [107]:
# TODO
93.0/243

0.38271604938271603

> ### Question 9.  Finally, what's the odds ratio?

In [108]:
# TODO
1.17/0.38

3.078947368421052

> ### Question 10.  Write this finding in a sentenance.

Answer:For students who attended top ranked college, the odds of being admitted into graduate school is 3 times the odds of students who did not attend top ranked college.

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentenance.

In [109]:
# TODO
print pd.crosstab(df['admit'],df[4.0], rownames=['admit'])

4.0    0.0  1.0
admit          
0      216   55
1      114   12


(12.0/55)/(114.0/216) For students who attended least ranked college, the odds of being admitted into graduate school is 0.4 times the odds of students who attended to better ranked college.

## Part C. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model prediting admission into UCLA using `gre`, `gpa`, and the prestige of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [113]:
# TODO
train_cols = df.columns[1:]
logit = smf.Logit(df['admit'], df[train_cols])
result = logit.fit()

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


> ### Question 13.  Print the model's summary results.

In [114]:
# TODO
print result.summary()

                           Logit Regression Results                           
Dep. Variable:                  admit   No. Observations:                  397
Model:                          Logit   Df Residuals:                      391
Method:                           MLE   Df Model:                            5
Date:                Sun, 15 Jan 2017   Pseudo R-squ.:                 0.08166
Time:                        16:01:18   Log-Likelihood:                -227.82
converged:                       True   LL-Null:                       -248.08
                                        LLR p-value:                 1.176e-07
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
gre            0.0022      0.001      2.028      0.043      7.44e-05     0.004
gpa            0.7793      0.333      2.344      0.019         0.128     1.431
1.0           -3.8769      1.142     -3.393      0.0

> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [115]:
# TODO
print result.conf_int()

            0         1
gre  0.000074  0.004362
gpa  0.127619  1.431056
1.0 -6.116077 -1.637631
2.0 -6.739307 -2.374674
3.0 -7.472244 -2.958819
4.0 -7.664377 -3.196152


In [118]:
# odds ratios only
print np.exp(result.params)

gre    1.002221
gpa    2.180027
1.0    0.020716
2.0    0.010494
3.0    0.005432
4.0    0.004382
dtype: float64


In [119]:
params = result.params
conf = result.conf_int()
conf['OR'] = params
conf.columns = ['2.5%', '97.5%', 'OR']
print np.exp(conf)

         2.5%     97.5%        OR
gre  1.000074  1.004372  1.002221
gpa  1.136120  4.183113  2.180027
1.0  0.002207  0.194440  0.020716
2.0  0.001183  0.093045  0.010494
3.0  0.000569  0.051880  0.005432
4.0  0.000469  0.040919  0.004382


> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer: We can expect the odds of being admitted graduate school to decrease by about 1% if the prestige of a school is 2. 

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer: We can expect the odds of being admitted graduate school to increase by about 200% if the applicant's GPA is higher.

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [None]:
# TODO

Answer:

## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [None]:
# TODO

> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [None]:
# TODO

Answer:

> ### Question 20.  Again assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [None]:
# TODO

Answer: