# DS-SF-30 | Unit Project 3: Machine Learning Modeling

In this project, you will perform a logistic regression on the admissions data we've been working with in Unit Projects 1 and 2.

In [7]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf

from sklearn import linear_model

In [8]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'dataset-ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


## Part A.  Frequency Table

> ### Question 1.  Create a frequency table for `prestige` and whether an applicant was admitted.

In [9]:
pd.crosstab(df['prestige'],df['admit'], margins = True)

admit,0,1,All
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.0,28,33,61
2.0,95,53,148
3.0,93,28,121
4.0,55,12,67
All,271,126,397


In [92]:
freq_table = pd.crosstab(index = df['prestige'], columns = df['admit'])
freq_table

admit,0,1
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,28,33
2.0,95,53
3.0,93,28
4.0,55,12


In [10]:
pd.crosstab(df.prestige,df.admit, normalize = True, margins = True)

admit,0,1,All
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.0,0.070529,0.083123,0.153652
2.0,0.239295,0.133501,0.372796
3.0,0.234257,0.070529,0.304786
4.0,0.138539,0.030227,0.168766
All,0.68262,0.31738,1.0


In [15]:
pd.crosstab(df.prestige,df.admit, normalize = 'columns')

admit,0,1
prestige,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,0.103321,0.261905
2.0,0.350554,0.420635
3.0,0.343173,0.222222
4.0,0.202952,0.095238


## Part B.  Variable Transformations

> ### Question 2.  Create a one-hot encoding for `prestige`.

In [18]:
one_hot = pd.get_dummies(df['prestige'], prefix = 'prestige')
one_hot

Unnamed: 0,prestige_1.0,prestige_2.0,prestige_3.0,prestige_4.0
0,0,0,1,0
1,0,0,1,0
2,1,0,0,0
3,0,0,0,1
4,0,0,0,1
...,...,...,...,...
395,0,1,0,0
396,0,0,1,0
397,0,1,0,0
398,0,1,0,0


In [21]:
df = df.join(other=one_hot)
df

ValueError: columns overlap but no suffix specified: Index(['prestige_1.0', 'prestige_2.0', 'prestige_3.0', 'prestige_4.0'], dtype='object')

In [95]:
pd.get_dummies(df['prestige'])

Unnamed: 0,1.0,2.0,3.0,4.0
0,0,0,1,0
1,0,0,1,0
2,1,0,0,0
3,0,0,0,1
4,0,0,0,1
...,...,...,...,...
395,0,1,0,0
396,0,0,1,0
397,0,1,0,0
398,0,1,0,0


> ### Question 3.  How many of these binary/dummy variables do we need for modeling?

Answer: 3 or n - 1, where n is the number of categories, and in this case n = 4.

> ### Question 4.  Why are we doing this?

Answer: We create the one hot encoding in order to create numerical variables instead of categorical variables.

> ### Question 5.  Add all these binary variables in the dataset and remove the now redundant `prestige` feature.

In [96]:
df = df.join(pd.get_dummies(df['prestige']))
df.pop('prestige')
df

Unnamed: 0,admit,gre,gpa,1.0,2.0,3.0,4.0
0,0,380.0,3.61,0,0,1,0
1,1,660.0,3.67,0,0,1,0
2,1,800.0,4.00,1,0,0,0
3,1,640.0,3.19,0,0,0,1
4,0,520.0,2.93,0,0,0,1
...,...,...,...,...,...,...,...
395,0,620.0,4.00,0,1,0,0
396,0,560.0,3.04,0,0,1,0
397,0,460.0,2.63,0,1,0,0
398,0,700.0,3.65,0,1,0,0


## Part C.  Hand calculating odds ratios

Let's develop our intuition about expected outcomes by hand calculating odds ratios.

> ### Question 6.  Create a frequency table for `prestige = 1` and whether an applicant was admitted.

In [None]:
df2 = df[['admit', 'prestige']][df.[restige_1] ==1]
pd.crosstab(df2.prestige_1, df.admit)

In [97]:
freq_table1 = pd.crosstab(index = df[1], columns = df['admit'])
freq_table1

admit,0,1
1.0,Unnamed: 1_level_1,Unnamed: 2_level_1
0,243,93
1,28,33


> ### Question 7.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the most prestigious undergraduate schools.

In [7]:
33/28

1.1785714285714286

> ### Question 8.  Now calculate the odds of admission for undergraduates who did not attend a #1 ranked college.

In [8]:
93/243

0.38271604938271603

> ### Question 9.  Finally, what's the odds ratio?

In [9]:
(33/28)/(93/243)

3.079493087557604

> ### Question 10.  Write this finding in a sentence.

Answer:  The odds ratio is showing that the admission is associated with the prestige because it is greater than 1.

> ### Question 11.  Use the frequency table above to calculate the odds of being admitted to graduate school for applicants that attended the least prestigious undergraduate schools.  Then calculate their odds ratio of being admitted to UCLA.  Finally, write this finding in a sentence.

In [10]:
(93/243)/(33/28)

0.3247287691732136

Answer:  The odds ratio is showing that it is less likely to be admitted because the least prestigious ratio is less than 1.

## Part C. Analysis using `statsmodels`

> ### Question 12.  Fit a logistic regression model predicting admission into UCLA using `gre`, `gpa`, and the `prestige` of the undergraduate schools.  Use the highest prestige undergraduate schools as your reference point.

In [98]:
import statsmodels.api as sm

X = df
X.pop(1.0)
y = df['admit']

results = sm.OLS(y, X).fit()

print (X.shape, y.shape)
print (' ')
print(results.summary())

(397, 6) (397,)
 
                            OLS Regression Results                            
Dep. Variable:                  admit   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 3.618e+26
Date:                Thu, 20 Apr 2017   Prob (F-statistic):               0.00
Time:                        21:37:10   Log-Likelihood:                 10974.
No. Observations:                 397   AIC:                        -2.194e+04
Df Residuals:                     391   BIC:                        -2.191e+04
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
admit          1.0000   2.73e-14  

In [101]:
import statsmodels.formula.api as smf
df.rename(columns={2.0: 'prestige2', 3.0:'prestige3', 4.0:'prestige4'}, inplace=True)

results2 = smf.ols(formula='admit ~ gre + gpa + prestige2 + prestige3 + prestige4', data=df).fit()

> ### Question 13.  Print the model's summary results.

In [102]:
results2.summary()

0,1,2,3
Dep. Variable:,admit,R-squared:,0.099
Model:,OLS,Adj. R-squared:,0.087
Method:,Least Squares,F-statistic:,8.594
Date:,"Thu, 20 Apr 2017",Prob (F-statistic):,9.71e-08
Time:,21:38:59,Log-Likelihood:,-239.02
No. Observations:,397,AIC:,490.0
Df Residuals:,391,BIC:,513.9
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.2377,0.217,-1.095,0.274,-0.665,0.189
gre,0.0004,0.000,1.997,0.047,6.48e-06,0.001
gpa,0.1508,0.064,2.349,0.019,0.025,0.277
prestige2,-0.1635,0.068,-2.407,0.017,-0.297,-0.030
prestige3,-0.2910,0.070,-4.139,0.000,-0.429,-0.153
prestige4,-0.3240,0.079,-4.082,0.000,-0.480,-0.168

0,1,2,3
Omnibus:,152.312,Durbin-Watson:,1.946
Prob(Omnibus):,0.0,Jarque-Bera (JB):,50.314
Skew:,0.678,Prob(JB):,1.19e-11
Kurtosis:,1.904,Cond. No.,6070.0


> ### Question 14.  What are the odds ratios of the different features and their 95% confidence intervals?

In [103]:
# This shows the odds ratios
print(np.exp(results2.params))

#This shows the 95% confidence intervals
print(results2.conf_int())

Intercept    0.788400
gre          1.000422
gpa          1.162792
prestige2    0.849131
prestige3    0.747545
prestige4    0.723254
dtype: float64
                  0         1
Intercept -0.664512  0.189012
gre        0.000006  0.000837
gpa        0.024562  0.277086
prestige2 -0.297107 -0.029977
prestige3 -0.429169 -0.152753
prestige4 -0.480056 -0.167932


> ### Question 15.  Interpret the odds ratio for `prestige = 2`.

Answer:  Prestige2 has a better odds ratio to be admitted in comparison to Prestige3 and Prestige4.  

> ### Question 16.  Interpret the odds ratio of `gpa`.

Answer: GPA has the highest odds ratio of admission compared to GRE, Prestige2, 3 and 4.

> ### Question 17.  Assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

In [None]:
#test_df = pd.DataFrame{'gre':800, 'gpa': 4, 'prestige_2': 0, 'prestige_3': 0, 'prestige_4': 00}
#model.predict()

In [108]:
# instead of generating all possible values of GRE and GPA, we're going
# to use an evenly spaced range of 10 values from the min to the max 
gres = np.linspace(df['gre'].min(), df['gre'].max(), 10)
print(gres)

[ 220.          284.44444444  348.88888889  413.33333333  477.77777778
  542.22222222  606.66666667  671.11111111  735.55555556  800.        ]


In [109]:
# instead of generating all possible values of GRE and GPA, we're going
# to use an evenly spaced range of 10 values from the min to the max 
gpas = np.linspace(df['gpa'].min(), df['gpa'].max(), 10)
print(gpas)

[ 2.26        2.45333333  2.64666667  2.84        3.03333333  3.22666667
  3.42        3.61333333  3.80666667  4.        ]


In [133]:
#cartesian(([1, 2, 3, 4], [1.]))
np.array(np.meshgrid(gres, gpas, [1, 2, 3, 4], [1.])).T.reshape(-1,4)

array([[ 220.        ,    2.26      ,    1.        ,    1.        ],
       [ 220.        ,    2.45333333,    1.        ,    1.        ],
       [ 220.        ,    2.64666667,    1.        ,    1.        ],
       ..., 
       [ 800.        ,    3.61333333,    4.        ,    1.        ],
       [ 800.        ,    3.80666667,    4.        ,    1.        ],
       [ 800.        ,    4.        ,    4.        ,    1.        ]])

In [140]:
# enumerate all possibilities
combos = pd.DataFrame(np.array(np.meshgrid(gres, gpas, [1, 2, 3, 4], [1.])).T.reshape(-1,4))
combos.columns = ['gre', 'gpa', 'prestige', 'intercept']
combos

Unnamed: 0,gre,gpa,prestige,intercept
0,220.0,2.260000,1.0,1.0
1,220.0,2.453333,1.0,1.0
2,220.0,2.646667,1.0,1.0
3,220.0,2.840000,1.0,1.0
4,220.0,3.033333,1.0,1.0
...,...,...,...,...
395,800.0,3.226667,4.0,1.0
396,800.0,3.420000,4.0,1.0
397,800.0,3.613333,4.0,1.0
398,800.0,3.806667,4.0,1.0


In [141]:
# recreate the dummy variables
dummy_ranks = pd.get_dummies(combos['prestige'], prefix='prestige')
dummy_ranks.columns = ['prestige1', 'prestige2', 'prestige3', 'prestige4']

In [142]:
# keep only what we need for making predictions
cols_to_keep = ['gre', 'gpa', 'prestige', 'intercept']
combos = combos[cols_to_keep].join(dummy_ranks.ix[:, 'prestige2':])

In [144]:
train_cols = df.columns[1:]

In [150]:
# make predictions on the enumerated dataset
combos['admit_pred'] = results2.predict(combos[train_cols])

print(combos[(combos.gre==800) & (combos.gpa==4.0)])

       gre  gpa  prestige  intercept  prestige2  prestige3  prestige4  \
99   800.0  4.0       1.0        1.0          0          0          0   
199  800.0  4.0       2.0        1.0          1          0          0   
299  800.0  4.0       3.0        1.0          0          1          0   
399  800.0  4.0       4.0        1.0          0          0          1   

     admit_pred  
99     0.702806  
199    0.539264  
299    0.411845  
399    0.378812  


Answer: 1 is 70%, 2 is 54%, 3 is 41% and 4 is 37% admission rate 

## Part D. Moving the model from `statsmodels` to `sklearn`

> ### Question 18.  Let's assume we are satisfied with our model.  Remodel it (same features) using `sklearn`.  When creating the logistic regression model with `LogisticRegression(C = 10 ** 2)`.

In [155]:
import pandas as pd
from sklearn.linear_model import LogisticRegression

lm = LogisticRegression()
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'dataset-ucla-admissions.csv'))
df.dropna(inplace = True)
df = df.join(pd.get_dummies(df['prestige']))
df.pop('prestige')
df.head()

X = df[['gre',  'gpa', 1.0]]
y = df['admit']

print (X.shape, y.shape)
print (' ')

lm.fit(X, y)

(397, 3) (397,)
 


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

> ### Question 19.  What are the odds ratios for the different variables and how do they compare with the odds ratios calculated with `statsmodels`?

In [None]:
# TODO

Answer:

> ### Question 20.  Again, assuming a student with a GRE of 800 and a GPA of 4.  What is his/her probability of admission  if he/she come from a tier-1, tier-2, tier-3, or tier-4 undergraduate school?

Answer: