# DS-SF-25 | Unit Project 3: Basic Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [1]:
import os
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
from sklearn import linear_model
import pylab as pl

pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

In [2]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'ucla-admissions.csv'))
df.dropna(inplace = True)

df.head()

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.0,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether or not an applicant was admitted.

In [3]:
pd.crosstab(df.admit, df.prestige)

prestige,1.0,2.0,3.0,4.0
admit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,28,95,93,55
1,33,53,28,12


## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [4]:
df_prestige_dummy = pd.get_dummies(df.prestige, prefix = 'prestige')

> ### Question 3.  How many of these binary variables do we need for modeling?

In [5]:
df_prestige_dummy.sum()

prestige_1.0     61.0
prestige_2.0    148.0
prestige_3.0    121.0
prestige_4.0     67.0
dtype: float64

Answer: We only need three of them.

> ### Question 4.  Why are we doing this?

Answer: 1) if we include all we will introduce a multicollinearity problem because the four variables are correlated. 2) We need a baseline scenario to interpret the binary variables coefficients.

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [6]:
df = df.join(df_prestige_dummy)

In [7]:
df

Unnamed: 0,admit,gre,gpa,prestige,prestige_1.0,prestige_2.0,prestige_3.0,prestige_4.0
0,0,380.0,3.61,3.0,0.0,0.0,1.0,0.0
1,1,660.0,3.67,3.0,0.0,0.0,1.0,0.0
2,1,800.0,4.00,1.0,1.0,0.0,0.0,0.0
3,1,640.0,3.19,4.0,0.0,0.0,0.0,1.0
4,0,520.0,2.93,4.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...
395,0,620.0,4.00,2.0,0.0,1.0,0.0,0.0
396,0,560.0,3.04,3.0,0.0,0.0,1.0,0.0
397,0,460.0,2.63,2.0,0.0,1.0,0.0,0.0
398,0,700.0,3.65,2.0,0.0,1.0,0.0,0.0


## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether or not an applicant was admitted.

In [8]:
pd.crosstab(df.admit, df['prestige_1.0'][(df['prestige_1.0']==1)])

prestige_1.0,1.0
admit,Unnamed: 1_level_1
0,28
1,33


In [9]:
pd.crosstab(df.admit, df['prestige_1.0'][(df['prestige_1.0']==0)])

prestige_1.0,0.0
admit,Unnamed: 1_level_1
0,243
1,93


> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [10]:
# Odds
33/28.

1.1785714285714286

> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [11]:
93/243.

0.38271604938271603

> ### Question 9.  Finally, what's the odds ratio?

In [12]:
33/28*(243/93.)

2.6129032258064515

> ### Question 10.  Write this finding in a sentenance.

Answer: The odds for getting admitted is almost 3 times higher for candidates from a #1 Rank college.

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentenance.

In [13]:
pd.crosstab(df.admit, df['prestige_4.0'])

prestige_4.0,0.0,1.0
admit,Unnamed: 1_level_1,Unnamed: 2_level_1
0,216,55
1,114,12


In [14]:
#Odds for attending the least prestige school
12/55.

0.21818181818181817

In [15]:
#Odds for not attending the least prestige school
114/216.

0.5277777777777778

In [16]:
#Odds ration
12/55.*(216/114.) -1 

-0.5866028708133972

Answer: The odds for getting admitted is almost 60% lower for candidates from a #4 Rank college than others

## Part C. Analysis using `statsmodel`

> ### Question 12.  Fit a logistic regression model prediting admission into UCLA using `gre`, `gpa`, and the prestige of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [17]:
df.columns

Index([u'admit', u'gre', u'gpa', u'prestige', u'prestige_1.0', u'prestige_2.0',
       u'prestige_3.0', u'prestige_4.0'],
      dtype='object')

In [18]:
X = df[['gre', 'gpa', 'prestige_2.0','prestige_3.0','prestige_4.0']]
X=sm.add_constant(X)
y = df.admit

In [19]:
X

Unnamed: 0,const,gre,gpa,prestige_2.0,prestige_3.0,prestige_4.0
0,1,380.0,3.61,0.0,1.0,0.0
1,1,660.0,3.67,0.0,1.0,0.0
2,1,800.0,4.00,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,1.0
4,1,520.0,2.93,0.0,0.0,1.0
...,...,...,...,...,...,...
395,1,620.0,4.00,1.0,0.0,0.0
396,1,560.0,3.04,0.0,1.0,0.0
397,1,460.0,2.63,1.0,0.0,0.0
398,1,700.0,3.65,1.0,0.0,0.0


In [20]:
model = sm.Logit(y, X).fit()

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


> ### Question 13.  Print the model's summary results.

In [21]:
model.summary()

0,1,2,3
Dep. Variable:,admit,No. Observations:,397.0
Model:,Logit,Df Residuals:,391.0
Method:,MLE,Df Model:,5.0
Date:,"Wed, 07 Sep 2016",Pseudo R-squ.:,0.08166
Time:,18:25:08,Log-Likelihood:,-227.82
converged:,True,LL-Null:,-248.08
,,LLR p-value:,1.176e-07

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
const,-3.8769,1.142,-3.393,0.001,-6.116 -1.638
gre,0.0022,0.001,2.028,0.043,7.44e-05 0.004
gpa,0.7793,0.333,2.344,0.019,0.128 1.431
prestige_2.0,-0.6801,0.317,-2.146,0.032,-1.301 -0.059
prestige_3.0,-1.3387,0.345,-3.882,0.000,-2.015 -0.663
prestige_4.0,-1.5534,0.417,-3.721,0.000,-2.372 -0.735


> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [22]:
model.conf_int(.5).\
    rename(columns = {0: '2.5%', 1: '97.5%'})

Unnamed: 0,2.5%,97.5%
const,-4.647446,-3.106262
gre,0.001481,0.002956
gpa,0.555059,1.003616
prestige_2.0,-0.893913,-0.466361
prestige_3.0,-1.571278,-1.106077
prestige_4.0,-1.834985,-1.271836


In [23]:
model.params

const          -3.876854
gre             0.002218
gpa             0.779337
prestige_2.0   -0.680137
prestige_3.0   -1.338677
prestige_4.0   -1.553411
dtype: float64

In [24]:
def odds_ratio(x):
    odds_ratio = np.exp(x)
    print odds_ratio 

In [25]:
#gre:
odds_ratio(model.params[1])

1.00222086578


In [26]:
#gpa:
odds_ratio(model.params[2])

2.18002692743


In [27]:
#prestige_2.0:
odds_ratio(model.params[3])

0.506547791048


In [28]:
#prestige_3.0:
odds_ratio(model.params[4])

0.262192267325


In [29]:
#prestige_4.0:
odds_ratio(model.params[5])

0.211525333998


> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

In [30]:
odds_ratio(model.params[3])

0.506547791048


Answer: Graduated from a prestige =2 college, a candidiate's odds of getting admitted is about 51% less than the odds of candidates with the same gre and gpa but graduated from a prestige =1 college.

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer: One incremental grade point will increase the odds of being admitted by approximately 2.18 times.

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [31]:
predict_X = pd.DataFrame({'const': [1, 1, 1, 1], 'gre':[800, 800, 800, 800], 'gpa': [4, 4 ,4 ,4], 'prestige_2.0': [1, 0, 0, 0],
                          'prestige_3.0': [0, 1, 0, 0], 'prestige_4.0':[0, 0, 1, 0]})

In [32]:
#predict_X = sm.add_constant(predict_X)

In [33]:
predict_X

Unnamed: 0,const,gpa,gre,prestige_2.0,prestige_3.0,prestige_4.0
0,1,4,800,1,0,0
1,1,4,800,0,1,0
2,1,4,800,0,0,1
3,1,4,800,0,0,0


In [45]:
model.predict(predict_X)

array([ 1.,  1.,  1.,  1.])

In [35]:
np.exp(1.67315319e-179)/(1+np.exp(1.67315319e-179))

0.5

## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [36]:
model_SK = linear_model.LogisticRegression(C = 10**2).fit(X, y)

> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [37]:
print model_SK.intercept_
print model_SK.coef_

[-1.77303041]
[[-1.77303041  0.00209769  0.71809211 -0.72960187 -1.40029508 -1.60285974]]


In [38]:
odds_ratio(model_SK.coef_)

[[ 0.16981759  1.0020999   2.05051732  0.48210089  0.24652421  0.20131997]]


In [39]:
#Stats model:
odds_ratio(model.params)

const           0.020716
gre             1.002221
gpa             2.180027
prestige_2.0    0.506548
prestige_3.0    0.262192
prestige_4.0    0.211525
dtype: float64


Answer: It looks like the ratios in the sklearn model is higher than the ones in the statsmodel.

> ### Question 20.  Again assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [40]:
#Tier1
odds_ratio(model_SK.predict(predict_X))

[ 2.71828183  2.71828183  2.71828183  2.71828183]


Answer:

>### Unit-project-4

In [41]:
pd.crosstab(df.admit, df['prestige'])

prestige,1.0,2.0,3.0,4.0
admit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,28,95,93,55
1,33,53,28,12


In [42]:
df[df.admit==1].describe()

Unnamed: 0,admit,gre,gpa,prestige,prestige_1.0,prestige_2.0,prestige_3.0,prestige_4.0
count,126.0,126.0,126.0,126.0,126.0,126.0,126.0,126.0
mean,1.0,618.571429,3.489206,2.150794,0.261905,0.420635,0.222222,0.095238
std,0.0,109.257233,0.371655,0.921455,0.441426,0.495632,0.417399,0.294715
min,1.0,300.0,2.42,1.0,0.0,0.0,0.0,0.0
25%,1.0,540.0,3.22,1.0,0.0,0.0,0.0,0.0
50%,1.0,620.0,3.545,2.0,0.0,0.0,0.0,0.0
75%,1.0,680.0,3.7575,3.0,1.0,1.0,0.0,0.0
max,1.0,800.0,4.0,4.0,1.0,1.0,1.0,1.0


In [43]:
df[df.admit==0].describe()

Unnamed: 0,admit,gre,gpa,prestige,prestige_1.0,prestige_2.0,prestige_3.0,prestige_4.0
count,271.0,271.0,271.0,271.0,271.0,271.0,271.0,271.0
mean,0.0,573.579336,3.347159,2.645756,0.103321,0.350554,0.343173,0.202952
std,0.0,116.052798,0.376355,0.918922,0.304941,0.478026,0.475647,0.402941
min,0.0,220.0,2.26,1.0,0.0,0.0,0.0,0.0
25%,0.0,500.0,3.08,2.0,0.0,0.0,0.0,0.0
50%,0.0,580.0,3.34,3.0,0.0,0.0,0.0,0.0
75%,0.0,660.0,3.61,3.0,0.0,1.0,1.0,0.0
max,0.0,800.0,4.0,4.0,1.0,1.0,1.0,1.0
