# DS-SF-27 | Unit Project 3: Basic Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [30]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf

from sklearn import linear_model

In [31]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether or not an applicant was admitted.

In [32]:
# TODO
df[['prestige', 'admit']]


Unnamed: 0,prestige,admit
0,3.0,0
1,3.0,1
2,1.0,1
3,4.0,1
4,4.0,0
...,...,...
395,2.0,0
396,3.0,0
397,2.0,0
398,2.0,0


In [19]:
df[(df.admit == 0)].prestige.count()

271

In [20]:
df[(df.admit == 1)].prestige.count()

126

In [21]:
my_tab = pd.crosstab(index=df["admit"],  # Make a crosstab
                              columns="count")      # Name the count column

my_tab

col_0,count
admit,Unnamed: 1_level_1
0,271
1,126


In [22]:
admit_tab = pd.crosstab(index=df["admit"], 
                           columns=df["prestige"])

admit_tab.index= ["admit","prestige"]

admit_tab

prestige,1.0,2.0,3.0,4.0
admit,28,95,93,55
prestige,33,53,28,12


## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [23]:
# TODO

c = df.prestige
cs = pd.get_dummies(c, prefix = None)

cs

Unnamed: 0,1.0,2.0,3.0,4.0
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0
...,...,...,...,...
395,0.0,1.0,0.0,0.0
396,0.0,0.0,1.0,0.0
397,0.0,1.0,0.0,0.0
398,0.0,1.0,0.0,0.0


> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: All of them. We need 'admit', as it is the only binary variable in the dataset. 0 = not admitted, 1 = admitted.

> ### Question 4.  Why are we doing this?

Answer: Logistic regressions are calculated using binary variables.

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [35]:
# TODO
prestige_df = pd.get_dummies(df.prestige, prefix = 'prestige')

prestige_df.rename(columns = {'prestige_1.0': 'prestige_1',
                             'prestige_2.0': 'prestige_2',
                             'prestige_3.0': 'prestige_3',
                             'prestige_4.0': 'prestige_4'}, inplace = True)
 
df = df[ ['admit', 'gre', 'gpa'] ].join(prestige_df)

In [36]:
prestige_df

Unnamed: 0,prestige_1,prestige_2,prestige_3,prestige_4
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0
...,...,...,...,...
395,0.0,1.0,0.0,0.0
396,0.0,0.0,1.0,0.0
397,0.0,1.0,0.0,0.0
398,0.0,1.0,0.0,0.0


## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether or not an applicant was admitted.

In [48]:
# TODO
admit_tab = pd.crosstab(index=df["admit"], 
                           columns=df["prestige_1"])

#admit_tab.index= ["admit", 'prestige_1']

admit_tab




prestige_1,0.0,1.0
admit,Unnamed: 1_level_1,Unnamed: 2_level_1
0,243,28
1,93,33


In [50]:
admit2_tab = pd.crosstab(index=df["admit"], 
                           columns=df["prestige_2"])

#admit_tab.index= ["admit", 'prestige_1']

admit2_tab

prestige_2,0.0,1.0
admit,Unnamed: 1_level_1,Unnamed: 2_level_1
0,176,95
1,73,53


In [51]:
admit3_tab = pd.crosstab(index=df["admit"], 
                           columns=df["prestige_3"])

#admit_tab.index= ["admit", 'prestige_1']

admit3_tab

prestige_3,0.0,1.0
admit,Unnamed: 1_level_1,Unnamed: 2_level_1
0,178,93
1,98,28


In [52]:
admit4_tab = pd.crosstab(index=df["admit"], 
                           columns=df["prestige_4"])

#admit_tab.index= ["admit", 'prestige_1']

admit4_tab

prestige_4,0.0,1.0
admit,Unnamed: 1_level_1,Unnamed: 2_level_1
0,216,55
1,114,12


In [38]:
prestige_df

Unnamed: 0,prestige_1,prestige_2,prestige_3,prestige_4
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0
...,...,...,...,...
395,0.0,1.0,0.0,0.0
396,0.0,0.0,1.0,0.0
397,0.0,1.0,0.0,0.0
398,0.0,1.0,0.0,0.0


> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [None]:
# TODO
#1-p/p

#(33/61) / 1-(33/61) = 1.17

> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [None]:
# TODO

# (93/336) / 1- (93/336) = 0.38

> ### Question 9.  Finally, what's the odds ratio?

In [None]:
# TODO

#Odds Ratio: 

> ### Question 10.  Write this finding in a sentenance.

Answer: The ratio

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentenance.

In [53]:
# TODO

admit4_tab

# (12/61) 1 - (12/61) = 0.244



prestige_4,0.0,1.0
admit,Unnamed: 1_level_1,Unnamed: 2_level_1
0,216,55
1,114,12


Answer: The odds ratio of attending UCLA graduate school having attended the least prestigious undergraduate school is 0.244

## Part C. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model prediting admission into UCLA using `gre`, `gpa`, and the prestige of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [39]:
# TODO

smf.ols(formula = 'admit ~ gre + gpa + prestige_4', data = df).fit().summary()

0,1,2,3
Dep. Variable:,admit,R-squared:,0.059
Model:,OLS,Adj. R-squared:,0.052
Method:,Least Squares,F-statistic:,8.181
Date:,"Tue, 25 Oct 2016",Prob (F-statistic):,2.7e-05
Time:,17:56:37,Log-Likelihood:,-247.69
No. Observations:,397,AIC:,503.4
Df Residuals:,393,BIC:,519.3
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,-0.4413,0.211,-2.092,0.037,-0.856 -0.027
gre,0.0005,0.000,2.443,0.015,0.000 0.001
gpa,0.1404,0.065,2.158,0.032,0.012 0.268
prestige_4,-0.1428,0.061,-2.337,0.020,-0.263 -0.023

0,1,2,3
Omnibus:,410.966,Durbin-Watson:,1.938
Prob(Omnibus):,0.0,Jarque-Bera (JB):,58.503
Skew:,0.696,Prob(JB):,1.98e-13
Kurtosis:,1.735,Cond. No.,5740.0


> ### Question 13.  Print the model's summary results.

In [None]:
# TODO

> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [None]:
# TODO

> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer:

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer:

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [None]:
# TODO

Answer:

## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [None]:
# TODO

> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [None]:
# TODO

Answer:

> ### Question 20.  Again assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [None]:
# TODO

Answer: