# DS-SF-27 | Unit Project 3: Basic Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [36]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.api as sm
import statsmodels.formula.api as smf


from sklearn import linear_model

import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

import seaborn as sns

In [37]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


In [38]:
df.duplicated().sum()

5

## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether or not an applicant was admitted.

In [39]:
df.prestige.value_counts(dropna = False).sort_index()

1.0     61
2.0    148
3.0    121
4.0     67
Name: prestige, dtype: int64

In [40]:
pd.crosstab(df.prestige, df.admit,
           rownames = ['Prestige'], 
           colnames = ['Admit'])

Admit,0,1
Prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,28,33
2.0,95,53
3.0,93,28
4.0,55,12


## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [41]:
# TODO
prestige_df = pd.get_dummies(df.prestige, prefix = 'prestige')

In [42]:
prestige_df

Unnamed: 0,prestige_1.0,prestige_2.0,prestige_3.0,prestige_4.0
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0
...,...,...,...,...
395,0.0,1.0,0.0,0.0
396,0.0,0.0,1.0,0.0
397,0.0,1.0,0.0,0.0
398,0.0,1.0,0.0,0.0


In [43]:
prestige_df.rename(columns = {'prestige_1.0': 'prestige_1',
                           'prestige_2.0': 'prestige_2',
                           'prestige_3.0': 'prestige_3',
                           'prestige_4.0': 'prestige_4'}, inplace = True)

In [44]:
prestige_df

Unnamed: 0,prestige_1,prestige_2,prestige_3,prestige_4
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0
...,...,...,...,...
395,0.0,1.0,0.0,0.0
396,0.0,0.0,1.0,0.0
397,0.0,1.0,0.0,0.0
398,0.0,1.0,0.0,0.0


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: at least 3 of them. If we know 1, 2 and 3 than we will know 4.

> ### Question 4.  Why are we doing this?

Answer: We are transforming Prestige feature with 4 possible values into binary features, with only one active.

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [45]:
# TODO
#df = df.drop('column_name', 1)
df = df.join([prestige_df])

In [46]:
df.columns

Index([u'admit', u'gre', u'gpa', u'prestige', u'prestige_1', u'prestige_2',
       u'prestige_3', u'prestige_4'],
      dtype='object')

In [47]:
df.drop('prestige', axis = 1, inplace = True)

In [48]:
df

Unnamed: 0,admit,gre,gpa,prestige_1,prestige_2,prestige_3,prestige_4
0,0,380.0,3.61,0.0,0.0,1.0,0.0
1,1,660.0,3.67,0.0,0.0,1.0,0.0
2,1,800.0,4.00,1.0,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...
395,0,620.0,4.00,0.0,1.0,0.0,0.0
396,0,560.0,3.04,0.0,0.0,1.0,0.0
397,0,460.0,2.63,0.0,1.0,0.0,0.0
398,0,700.0,3.65,0.0,1.0,0.0,0.0


In [49]:
#df.drop('prestige_1', axis = 1, inplace = True)

In [50]:
#df

In [51]:
#df.columns

## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether or not an applicant was admitted.

In [53]:
# TODO
pd.crosstab(df.prestige_1, df.admit,
           rownames = ['Prestige'], 
           colnames = ['Admit'])

Admit,0,1
Prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,243,93
1.0,28,33


> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [54]:
prob_1 = 33 / 61.
odds_1 = prob_1 / (1 - prob_1)

print prob_1
print odds_1

0.540983606557
1.17857142857


> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [55]:
# TODO
prob_2 = 93 / 336.
odds_2 = prob_2 / (1 - prob_2)

print prob_2
print odds_2

0.276785714286
0.382716049383


> ### Question 9.  Finally, what's the odds ratio?

In [56]:
# TODO
odds_1 / odds_2

3.079493087557604

> ### Question 10.  Write this finding in a sentenance.

Answer: From the frequency table for prestige = 1, 33 applicants will be admited while 28 won't be admited to graduate school.  In terms of probability, there is a 54% of chance that an applicant will be admited and an odd of 1.18. In the other hand, the probablity of not being admited is 28% and an odd of 0.38. The odds ratio is 3, meaning that an applicant is three times more likely to be admited to graduate school if he/she is coming from a prestigeous undergraduate school. 

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentenance.

In [57]:
# TODO
pd.crosstab(df.prestige_4, df.admit,
           rownames = ['Prestige'], 
           colnames = ['Admit'])

Admit,0,1
Prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,216,114
1.0,55,12


In [58]:
prob_3 = 12 / 67.
odds_3 = prob_3 / (1 - prob_3)

print prob_3
print odds_3

0.179104477612
0.218181818182


In [None]:
prob_4 =  ?/ ?.
odds_4 = prob_4 / (1 - prob_4)

print prob_4
print odds_4

Answer: If the applicant attend the least prestigeous undergraduate school, the probability of getting accepted to graduate school is 18% and an odd of 0.2.

## Part C. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model prediting admission into UCLA using `gre`, `gpa`, and the prestige of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [59]:
# TODO
model = smf.logit(formula = 'admit ~ gre + gpa + prestige_2 + prestige_3 + prestige_4', data = df).fit()
model.summary()

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


0,1,2,3
Dep. Variable:,admit,No. Observations:,397.0
Model:,Logit,Df Residuals:,391.0
Method:,MLE,Df Model:,5.0
Date:,"Mon, 24 Oct 2016",Pseudo R-squ.:,0.08166
Time:,13:18:59,Log-Likelihood:,-227.82
converged:,True,LL-Null:,-248.08
,,LLR p-value:,1.176e-07

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,-3.8769,1.142,-3.393,0.001,-6.116 -1.638
gre,0.0022,0.001,2.028,0.043,7.44e-05 0.004
gpa,0.7793,0.333,2.344,0.019,0.128 1.431
prestige_2,-0.6801,0.317,-2.146,0.032,-1.301 -0.059
prestige_3,-1.3387,0.345,-3.882,0.000,-2.015 -0.663
prestige_4,-1.5534,0.417,-3.721,0.000,-2.372 -0.735


> ### Question 13.  Print the model's summary results.

In [60]:
# TODO
model.summary()

0,1,2,3
Dep. Variable:,admit,No. Observations:,397.0
Model:,Logit,Df Residuals:,391.0
Method:,MLE,Df Model:,5.0
Date:,"Mon, 24 Oct 2016",Pseudo R-squ.:,0.08166
Time:,13:21:55,Log-Likelihood:,-227.82
converged:,True,LL-Null:,-248.08
,,LLR p-value:,1.176e-07

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,-3.8769,1.142,-3.393,0.001,-6.116 -1.638
gre,0.0022,0.001,2.028,0.043,7.44e-05 0.004
gpa,0.7793,0.333,2.344,0.019,0.128 1.431
prestige_2,-0.6801,0.317,-2.146,0.032,-1.301 -0.059
prestige_3,-1.3387,0.345,-3.882,0.000,-2.015 -0.663
prestige_4,-1.5534,0.417,-3.721,0.000,-2.372 -0.735


> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [None]:
# TODO

> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer:

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer:

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [61]:
# TODO
model.predict(pd.DataFrame([[800, 4, 0, 0, 0]], columns=df.columns[1:]))

AssertionError: 6 columns passed, passed data had 5 columns

Answer:

## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [None]:
# TODO

> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [None]:
# TODO

Answer:

> ### Question 20.  Again assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [None]:
# TODO

Answer: