# DS-SF-36 | Unit Project | 3 | Machine Learning Modeling and Executive Summary | Starter Code

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Project 1 and 2.  You will summarize and present your findings and the methods you used.

In [22]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as sm

from sklearn import linear_model
from sklearn.linear_model import LogisticRegression, LinearRegression

In [23]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'dataset-ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether an applicant was admitted.

In [24]:
df_frequency = pd.crosstab(index=df["prestige"], columns=df["admit"])
df_frequency

admit,0,1
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,28,33
2.0,95,53
3.0,93,28
4.0,55,12


Answer: Above is a frequency table showing prestige and whether an applicant was admitted (0 = not admitted, 1= admitted)

## Part B.  Feature Engineering

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [25]:
prestige_df = pd.get_dummies(df.prestige, prefix = 'Prestige')
prestige_df.rename(columns = {'Prestige_1.0': 'Prestige_1',
    'Prestige_2.0': 'Prestige_2',
    'Prestige_3.0': 'Prestige_3',
    'Prestige_4.0': 'Prestige_4'}, inplace = True)
df = df.join([prestige_df])
df.drop('prestige', axis = 1, inplace = True)
df

Unnamed: 0,admit,gre,gpa,Prestige_1,Prestige_2,Prestige_3,Prestige_4
0,0,380.0,3.61,0,0,1,0
1,1,660.0,3.67,0,0,1,0
2,1,800.0,4.00,1,0,0,0
3,1,640.0,3.19,0,0,0,1
4,0,520.0,2.93,0,0,0,1
...,...,...,...,...,...,...,...
395,0,620.0,4.00,0,1,0,0
396,0,560.0,3.04,0,0,1,0
397,0,460.0,2.63,0,1,0,0
398,0,700.0,3.65,0,1,0,0


In [17]:
sm.ols(formula = 'admit ~ Prestige_2 + Prestige_3 + Prestige_4', data = df).fit().summary()

0,1,2,3
Dep. Variable:,admit,R-squared:,0.064
Model:,OLS,Adj. R-squared:,0.056
Method:,Least Squares,F-statistic:,8.899
Date:,"Tue, 15 Aug 2017",Prob (F-statistic):,1.02e-05
Time:,20:05:04,Log-Likelihood:,-246.67
No. Observations:,397,AIC:,501.3
Df Residuals:,393,BIC:,517.3
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,0.5410,0.058,9.333,0.000,0.427 0.655
Prestige_2,-0.1829,0.069,-2.655,0.008,-0.318 -0.047
Prestige_3,-0.3096,0.071,-4.355,0.000,-0.449 -0.170
Prestige_4,-0.3619,0.080,-4.517,0.000,-0.519 -0.204

0,1,2,3
Omnibus:,218.144,Durbin-Watson:,1.978
Prob(Omnibus):,0.0,Jarque-Bera (JB):,57.254
Skew:,0.725,Prob(JB):,3.69e-13
Kurtosis:,1.834,Cond. No.,6.1


In [18]:
sm.ols(formula = 'admit ~ Prestige_1 + Prestige_3 + Prestige_4', data = df).fit().summary()

0,1,2,3
Dep. Variable:,admit,R-squared:,0.064
Model:,OLS,Adj. R-squared:,0.056
Method:,Least Squares,F-statistic:,8.899
Date:,"Tue, 15 Aug 2017",Prob (F-statistic):,1.02e-05
Time:,20:05:09,Log-Likelihood:,-246.67
No. Observations:,397,AIC:,501.3
Df Residuals:,393,BIC:,517.3
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,0.3581,0.037,9.624,0.000,0.285 0.431
Prestige_1,0.1829,0.069,2.655,0.008,0.047 0.318
Prestige_3,-0.1267,0.055,-2.284,0.023,-0.236 -0.018
Prestige_4,-0.1790,0.067,-2.685,0.008,-0.310 -0.048

0,1,2,3
Omnibus:,218.144,Durbin-Watson:,1.978
Prob(Omnibus):,0.0,Jarque-Bera (JB):,57.254
Skew:,0.725,Prob(JB):,3.69e-13
Kurtosis:,1.834,Cond. No.,4.2


In [19]:
sm.ols(formula = 'admit ~ Prestige_1 + Prestige_2 + Prestige_4', data = df).fit().summary()

0,1,2,3
Dep. Variable:,admit,R-squared:,0.064
Model:,OLS,Adj. R-squared:,0.056
Method:,Least Squares,F-statistic:,8.899
Date:,"Tue, 15 Aug 2017",Prob (F-statistic):,1.02e-05
Time:,20:05:14,Log-Likelihood:,-246.67
No. Observations:,397,AIC:,501.3
Df Residuals:,393,BIC:,517.3
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,0.2314,0.041,5.623,0.000,0.150 0.312
Prestige_1,0.3096,0.071,4.355,0.000,0.170 0.449
Prestige_2,0.1267,0.055,2.284,0.023,0.018 0.236
Prestige_4,-0.0523,0.069,-0.759,0.449,-0.188 0.083

0,1,2,3
Omnibus:,218.144,Durbin-Watson:,1.978
Prob(Omnibus):,0.0,Jarque-Bera (JB):,57.254
Skew:,0.725,Prob(JB):,3.69e-13
Kurtosis:,1.834,Cond. No.,4.59


In [20]:
sm.ols(formula = 'admit ~ Prestige_1 + Prestige_2 + Prestige_3', data = df).fit().summary()

0,1,2,3
Dep. Variable:,admit,R-squared:,0.064
Model:,OLS,Adj. R-squared:,0.056
Method:,Least Squares,F-statistic:,8.899
Date:,"Tue, 15 Aug 2017",Prob (F-statistic):,1.02e-05
Time:,20:05:17,Log-Likelihood:,-246.67
No. Observations:,397,AIC:,501.3
Df Residuals:,393,BIC:,517.3
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,0.1791,0.055,3.238,0.001,0.070 0.288
Prestige_1,0.3619,0.080,4.517,0.000,0.204 0.519
Prestige_2,0.1790,0.067,2.685,0.008,0.048 0.310
Prestige_3,0.0523,0.069,0.759,0.449,-0.083 0.188

0,1,2,3
Omnibus:,218.144,Durbin-Watson:,1.978
Prob(Omnibus):,0.0,Jarque-Bera (JB):,57.254
Skew:,0.725,Prob(JB):,3.69e-13
Kurtosis:,1.834,Cond. No.,5.87


Answer: ran 4 regressions including 3 of the 4 binary variables created.  

In [26]:
df.corr()

Unnamed: 0,admit,gre,gpa,Prestige_1,Prestige_2,Prestige_3,Prestige_4
admit,1.0,0.181202,0.174116,0.204689,0.067459,-0.122302,-0.133859
gre,0.181202,1.0,0.382408,0.088277,0.058454,-0.07438,-0.069046
gpa,0.174116,0.382408,1.0,0.068304,-0.050507,0.070881,-0.087671
Prestige_1,0.204689,0.088277,0.068304,1.0,-0.328493,-0.28212,-0.191989
Prestige_2,0.067459,0.058454,-0.050507,-0.328493,1.0,-0.510469,-0.347385
Prestige_3,-0.122302,-0.07438,0.070881,-0.28212,-0.510469,1.0,-0.298345
Prestige_4,-0.133859,-0.069046,-0.087671,-0.191989,-0.347385,-0.298345,1.0


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: I would recommend using all 4 variables given that they are all statistically signficant in some combination with one another, and are not well correlated with each other.

> ### Question 4.  Why are we doing this?

Answer: All variables are important, see response above.

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

Answer: binary variables were added to the dataset and removed the redundant prestige feature under question 2. DF readout below confirms.

In [27]:
df

Unnamed: 0,admit,gre,gpa,Prestige_1,Prestige_2,Prestige_3,Prestige_4
0,0,380.0,3.61,0,0,1,0
1,1,660.0,3.67,0,0,1,0
2,1,800.0,4.00,1,0,0,0
3,1,640.0,3.19,0,0,0,1
4,0,520.0,2.93,0,0,0,1
...,...,...,...,...,...,...,...
395,0,620.0,4.00,0,1,0,0
396,0,560.0,3.04,0,0,1,0
397,0,460.0,2.63,0,1,0,0
398,0,700.0,3.65,0,1,0,0


## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether an applicant was admitted.

In [28]:
df_frequency = pd.crosstab(index=df["Prestige_1"], columns=df["admit"])
df_frequency

admit,0,1
Prestige_1,Unnamed: 1_level_1,Unnamed: 2_level_1
0,243,93
1,28,33


Answer: We are looking for when Prestige 1=1 and admit is 1.  Therefore of the 61 applicants from prestige=1 alma maters, 33 were admitted (33/61 = 54% of the prestige =1 applicants were admitted).

> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

Answer: Odds of being admitted to graduate school for applicants whom attend the most prestigious undergraduate schools (prestige = 1) is 33/28 = 1.18
- 33 represents the number of prestige=1 applicants whom were admitted 
- 28 represents the number of prestige=1 applicants whom were not admitted

Another way to calculate odds is: probability of acceptance/(1-probability of acceptance):
    =(0.54)/(1-0.54)=1.17 ~ 1.18

> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

Answer: 0.38
From the frequency table in Question 6 above, there were 336 (=243+93) applicants whom did not attend a #1 ranked college.  The odds of admission can be calculated by dividing the number of admitted students (93) by the number of non-admitted students (243)
    = 93/243 = 0.38.  

> ### Question 9.  Finally, what's the odds ratio?

The odds ratio of admittance if a student is from a prestige=1 undergraduate institution vs. admittance if they are NOT from a prestige = 1 institution can be calculated as follows:
    =1.18/.38=3.10

> ### Question 10.  Write this finding in a sentence.

Answer: Applicants whom attended a prestige=1 undergraduate institution are 3 times more likely to be admitted to graduate school based on the UCLA admittance data when compared to students whom did NOT attend a prestige=1 undergraduate institution.

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentence.

In [33]:
df_frequency = pd.crosstab(index=df["Prestige_4"], columns=df["admit"])
df_frequency

admit,0,1
Prestige_4,Unnamed: 1_level_1,Unnamed: 2_level_1
0,216,114
1,55,12


Answer: 
- Created a frequency table for the least prestigious undergraduate schools (prestige=4) above (aligns with data from original table).
- Odds of admittance for a student from a prestige=4 undergraduate school is 12/55 = .22
- Odds of admittance for a student from a more prestigious institution is 114/216=.53
- Odds ratio of a student from a prestige=4 undergraduate school being admitted to UCLA vs. a student from more prestigious institutions (prestige=1,2,or3) is .22/.53=.42 - they are less likely to be accepted (40% as likely).

## Part D. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model predicting admission into UCLA using `gre`, `gpa`, and the `prestige` of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

Removing Prestige_1 so that it serves as the reference point.  The new dataframe is called df_onehot for one hot encoding

In [34]:
df_onehot = df.drop('Prestige_1', axis=1)
df_onehot

Unnamed: 0,admit,gre,gpa,Prestige_2,Prestige_3,Prestige_4
0,0,380.0,3.61,0,1,0
1,1,660.0,3.67,0,1,0
2,1,800.0,4.00,0,0,0
3,1,640.0,3.19,0,0,1
4,0,520.0,2.93,0,0,1
...,...,...,...,...,...,...
395,0,620.0,4.00,1,0,0
396,0,560.0,3.04,0,1,0
397,0,460.0,2.63,1,0,0
398,0,700.0,3.65,1,0,0


In [44]:
train_cols = df_onehot.columns[1:]
# Index([gre, gpa, prestige_2, prestige_3, prestige_4], dtype=object)

logit = sm.Logit(df_onehot['admit'], df_onehot[train_cols])

  # fit the model
result = logit.fit()

Optimization terminated successfully.
         Current function value: 0.589121
         Iterations 5


> ### Question 13.  Print the model's summary results.

In [45]:
print result.summary()

                           Logit Regression Results                           
Dep. Variable:                  admit   No. Observations:                  397
Model:                          Logit   Df Residuals:                      392
Method:                           MLE   Df Model:                            4
Date:                Tue, 15 Aug 2017   Pseudo R-squ.:                 0.05722
Time:                        21:22:16   Log-Likelihood:                -233.88
converged:                       True   LL-Null:                       -248.08
                                        LLR p-value:                 1.039e-05
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
gre            0.0014      0.001      1.308      0.191        -0.001     0.003
gpa           -0.1323      0.195     -0.680      0.497        -0.514     0.249
Prestige_2    -0.9562      0.302     -3.171      0.0

> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [41]:
params = result.params
conf = result.conf_int()
conf['OR'] = params
conf.columns = ['2.5%', '97.5%', 'OR']

print np.exp(conf)

                2.5%     97.5%        OR
gre         0.999320  1.003420  1.001368
gpa         0.598303  1.282800  0.876073
Prestige_2  0.212826  0.694082  0.384342
Prestige_3  0.112055  0.412207  0.214918
Prestige_4  0.070176  0.338540  0.154135


See above for the odds ratios and 95% confidence intervals.  The other way to calculate the odds ratio is to take e^(coef)

> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer: An applicant coming from a prestige = 2 school is .38 (or 38%) as likely to be admitted when compared to a student coming from a prestige =1 school. 

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer: the coefficient for gpa is -.1323, which can be interpreted as the expected change in log odds for a one-unit increase in gpa.  The odds ratio is 0.8760 which means that we expect to see a 88% increase in the odds of being accepted for a one unit increase in GPA.  

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

NEED HELP HERE!


Multiply the odds rations of GRE, GPA, and prestige
With a tier 1 school:
- The odds ratio for GPA is .88, therefore the probability is .46
- The odds ratio for GRE is 1.0, therefore the probability is .5

Tier 1: 1*.88*1= .88
Tier 2: 1*.88*.38= .33
Tier 3: 1*.88*.21= .18
Tier 4: 1*.88*.15= .13

Answer: TODO

## Part E. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [55]:
X=df.drop("admit", axis=1)
y=df.admit

lr = LogisticRegression()
lr.fit(X,y)
score = lr.score(X,y)

In [56]:
coef = pd.DataFrame(zip(X.columns, np.transpose(lr.coef_[0])), columns=["coef", "value"])
coef

Unnamed: 0,coef,value
0,gre,0.001727
1,gpa,0.195964
2,Prestige_1,0.359435
3,Prestige_2,-0.319295
4,Prestige_3,-0.888215
5,Prestige_4,-1.104591


> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

Odds ratios using sklearn are as follows:

In [50]:
coef_odds = np.e**(coef["value"])
coef["odds_ratio"] = coef_odds
coef

Unnamed: 0,coef,value,odds_ratio
0,gre,0.001727,1.001728
1,gpa,0.195964,1.216484
2,Prestige_1,0.359435,1.432519
3,Prestige_2,-0.319295,0.726661
4,Prestige_3,-0.888215,0.411389
5,Prestige_4,-1.104591,0.331346


Answer: 
- These odds ratios are quite different from stats models, as Prestige 1 is not assumed to be 1.  
- Also when I compared prestige 2 to prestige 1, the odds ratio between them is not the same.

> ### Question 20.  Again, assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [None]:
# TODO

Answer: TODO

## Part F.  Executive Summary

> ## Question 21.  Introduction
>
> Write a problem statement for this project.

Answer: TODO

> ## Question 22.  Dataset
>
> Write up a description of your data and any cleaning that was completed.

Answer: TODO

> ## Question 23.  Demo
>
> Provide a table that explains the data by admission status.

Answer: TODO

> ## Question 24.  Methods
>
> Write up the methods used in your analysis.

Answer: TODO

> ## Question 25.  Results
>
> Write up your results.

Answer: TODO

> ## Question 26.  Visuals
>
> Provide a table or visualization of these results.

Answer: TODO

> ## Question 27.  Discussion
>
> Write up your discussion and future steps.

Answer: TODO