# DS-SF-30 | Unit Project 3: Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [1]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf

from sklearn import linear_model

In [2]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'dataset-ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether an applicant was admitted.

In [3]:
df.groupby(['prestige','admit'])['admit'].count()

prestige  admit
1.0       0        28
          1        33
2.0       0        95
          1        53
3.0       0        93
          1        28
4.0       0        55
          1        12
Name: admit, dtype: int64

## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [4]:
prestige_df = pd.get_dummies(df.prestige, prefix = 'Prestige')

In [5]:
prestige_df

Unnamed: 0,Prestige_1.0,Prestige_2.0,Prestige_3.0,Prestige_4.0
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0
...,...,...,...,...
395,0.0,1.0,0.0,0.0
396,0.0,0.0,1.0,0.0
397,0.0,1.0,0.0,0.0
398,0.0,1.0,0.0,0.0


In [6]:
prestige_df.rename(columns = {'Prestige_1.0': 'Prestige_1',
                           'Prestige_2.0': 'Prestige_2',
                           'Prestige_3.0': 'Prestige_3',
                           'Prestige_4.0': 'Prestige_4'}, inplace = True)

> ### Question 3.  How many of these binary variables do we need for modeling?

Answer: We will need at least two but can use all of them

> ### Question 4.  Why are we doing this?

Answer: Converting the Prestige feature into this format will allow us to run a linear regression 

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [7]:
df = df.join([prestige_df])

In [9]:
df

Unnamed: 0,admit,gre,gpa,prestige,Prestige_1,Prestige_2,Prestige_3,Prestige_4
0,0,380.0,3.61,3.0,0.0,0.0,1.0,0.0
1,1,660.0,3.67,3.0,0.0,0.0,1.0,0.0
2,1,800.0,4.00,1.0,1.0,0.0,0.0,0.0
3,1,640.0,3.19,4.0,0.0,0.0,0.0,1.0
4,0,520.0,2.93,4.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...
395,0,620.0,4.00,2.0,0.0,1.0,0.0,0.0
396,0,560.0,3.04,3.0,0.0,0.0,1.0,0.0
397,0,460.0,2.63,2.0,0.0,1.0,0.0,0.0
398,0,700.0,3.65,2.0,0.0,1.0,0.0,0.0


In [13]:
#df.drop('prestige', inplace=True)

del df['prestige']

In [14]:
df

Unnamed: 0,admit,gre,gpa,Prestige_1,Prestige_2,Prestige_3,Prestige_4
0,0,380.0,3.61,0.0,0.0,1.0,0.0
1,1,660.0,3.67,0.0,0.0,1.0,0.0
2,1,800.0,4.00,1.0,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...
395,0,620.0,4.00,0.0,1.0,0.0,0.0
396,0,560.0,3.04,0.0,0.0,1.0,0.0
397,0,460.0,2.63,0.0,1.0,0.0,0.0
398,0,700.0,3.65,0.0,1.0,0.0,0.0


## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether an applicant was admitted.

In [22]:
df2 = df.groupby(['Prestige_1','admit'])['admit'].count()

df2

Prestige_1  admit
0.0         0        243
            1         93
1.0         0         28
            1         33
Name: admit, dtype: int64

> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [25]:
#33/(33+28)

admit_odds = 33.0/(33.0+28.0)
        
print admit_odds

0.540983606557


> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [26]:
#98/ (243+98)

admit_nonodds = 98.0/(243.0+98.0)

print admit_nonodds 

0.287390029326


> ### Question 9.  Finally, what's the odds ratio?

In [29]:
print admit_odds / admit_nonodds

1.88240214118


> ### Question 10.  Write this finding in a sentence.

Answer: The odds ratio for an admitted student from the top ranked college vs an admitted student from not the top ranked college is 1.9:1

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentence.

NameError: name 'freq_prest' is not defined

Answer:

## Part C. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model predicting admission into UCLA using `gre`, `gpa`, and the `prestige` of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [41]:
df.columns

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 397 entries, 0 to 399
Data columns (total 7 columns):
admit         397 non-null int64
gre           397 non-null float64
gpa           397 non-null float64
Prestige_1    397 non-null float64
Prestige_2    397 non-null float64
Prestige_3    397 non-null float64
Prestige_4    397 non-null float64
dtypes: float64(6), int64(1)
memory usage: 24.8 KB


In [44]:
prestige_df.head(2)

Unnamed: 0,Prestige_1,Prestige_2,Prestige_3,Prestige_4
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0


In [47]:
del df['Prestige_1']

In [52]:
train_df = df.sample(frac = .6, random_state = 0)
test_df = df.drop(train_df.index)



In [53]:
train_cols = df.columns[1:]

In [54]:
logit = smf.Logit(df['admit'], df[train_cols])

In [55]:
result = logit.fit()

Optimization terminated successfully.
         Current function value: 0.589121
         Iterations 5


> ### Question 13.  Print the model's summary results.

In [56]:
print result.summary()

                           Logit Regression Results                           
Dep. Variable:                  admit   No. Observations:                  397
Model:                          Logit   Df Residuals:                      392
Method:                           MLE   Df Model:                            4
Date:                Tue, 07 Feb 2017   Pseudo R-squ.:                 0.05722
Time:                        17:09:55   Log-Likelihood:                -233.88
converged:                       True   LL-Null:                       -248.08
                                        LLR p-value:                 1.039e-05
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
gre            0.0014      0.001      1.308      0.191        -0.001     0.003
gpa           -0.1323      0.195     -0.680      0.497        -0.514     0.249
Prestige_2    -0.9562      0.302     -3.171      0.0

> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [57]:
print np.exp(result.params)

gre           1.001368
gpa           0.876073
Prestige_2    0.384342
Prestige_3    0.214918
Prestige_4    0.154135
dtype: float64


In [58]:
print result.conf_int()

                   0         1
gre        -0.000680  0.003414
gpa        -0.513657  0.249045
Prestige_2 -1.547279 -0.365166
Prestige_3 -2.188769 -0.886230
Prestige_4 -2.656743 -1.083112


> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer: At prestige = 2 there is a chance of being admitted will decrease by 50%

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer: With GPA, the odds of being admitted will increase by 100%

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [62]:
predict = pd.DataFrame([[800, 4, 1, 0, 0],[800, 4, 0, 1, 0],[800, 4, 0, 0, 1], 
                          [800, 4, 0, 0, 0]],)
print result.predict(predict)

[ 0.40320425  0.27420161  0.21318433  0.63739858]


Answer: 
[ 0.40320425  0.27420161  0.21318433  0.63739858]



## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [64]:
train_df = df.sample(frac = .6, random_state = 0)
test_df = df.drop(train_df.index)

X = df
c = df.admit
   

new_model = linear_model.LogisticRegression().\
    fit(X, c)

In [66]:
new_model

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [None]:
# TODO

Answer:

> ### Question 20.  Again, assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [None]:
# TODO

Answer: