In [1]:
# import the necessary packages
import warnings
warnings.filterwarnings('ignore')


import pandas as pd
import numpy as np
from plotnine import *

from sklearn.linear_model import LogisticRegression # Logistic Regression Model
from sklearn.preprocessing import StandardScaler #Z-score variables
from sklearn.metrics import accuracy_score, confusion_matrix, plot_confusion_matrix

from sklearn.model_selection import train_test_split # simple TT split cv
from sklearn.model_selection import KFold # k-fold cv
from sklearn.model_selection import LeaveOneOut #LOO cv
from sklearn.model_selection import cross_val_score # cross validation metrics
from sklearn.model_selection import cross_val_predict # cross validation metrics


%matplotlib inline

## 1. Building a Logistic Regression Model

Using the grad admissions [data](https://raw.githubusercontent.com/cmparlettpelleriti/CPSC392ParlettPelleriti/master/Data/GradAdmissions.csv), build a logistic regression model that predicts `Admission` (whether or not a student was admitted) based on ALL the other variables in the data set (EXCEPT `Serial No`, information on the variables can be found [here](https://www.kaggle.com/mohansacharya/graduate-admissions#Admission_Predict_Ver1.1.csv), note that I've added the `Admissions` column to our dataset.)

### 1.1
Z-score your continuous variables
### 1.2
Use Train Test Split to validate your model
### 1.3
Put your coefficients in a data frame and for EACH predictor variable, interpret the coefficient in terms of Log Odds
### 1.4
Add a row to your coefficients data frame and add the coefficients in terms of Odds. For EACH predictor variable, interpret the coefficient in terms of Odds
### 1.5
(MARKDOWN) How well did your model do? Which metrics did you use to support your assessment?

    
## 2. Exploring Logistic Regression Coefficients

### 2.1

Using the coefficients from grad admissions model, manually (using code but not any built-in sklearn functions) calculate the predicted *log odds* of admission for the two students with the following (already z-scored) values:

(hint: if you're confused about how to do this, click [here](https://github.com/cmparlettpelleriti/CPSC392ParlettPelleriti/blob/master/Extras/Hint_HW3.ipynb) for a hint)

|           | GRE Score | TOEFL Score | University Rating | SOP  | LOR  | CGPA | Research |
|-----------|-----------|-------------|-------------------|------|------|------|----------|
| Student 1 | 0.60      | 0.05        | 0.75              | 0.65 | 1.02 | 0.25 | 1        |
| Student 2 | 1.60      | 0.05        | 0.75              | 0.65 | 1.02 | 0.25 | 1        |

Note that the only difference between Student 1 and Student 2 is a *1 unit (standard deviation) increase in GRE score*. 

* 2.1.1 (MARKDOWN) What is the difference (Student 2 - Student 1) in the log odds of the two students? (does that number look familiar?)

Now calculate the predicted *odds* of admission for the two students.

* 2.1.2 (MARKDOWN) What is the ratio (Student 2/Student 1) in the log odds of the two students? (does that number look familiar?)

Now calculate the predicted *probability* of admission for the two students.

### 2.2
The following students are from the same data set. Similarly to Students 1&2, the *only difference between Student 3&4 is a 1 unit (standard deviation) increase in GRE score*. To reiterate: the difference between Student 1 and Student 2 is *the same* as the difference between Student 3 and Student 4, but Students 1&2 have different values for the other variables than Students 3&4 do.

|           | GRE Score | TOEFL Score | University Rating | SOP  | LOR  | CGPA | Research |
|-----------|-----------|-------------|-------------------|------|------|------|----------|
| Student 3 | -1.25     | 0.24        | 0                 | 0.56 | -1   | -0.1 | 0        |
| Student 4 | -0.25     | 0.24        | 0                 | 0.56 | -1   | -0.1 | 0        |

Calculate the *log odds* of admission for the two students.

* 2.2.1 (MARKDOWN) What is the difference (Student 4 - Student 3) in the log odds of the two students? (does that number look familiar?)

Now calculate the predicted *odds* of admission for the two students.

* 2.2.2 (MARKDOWN) What is the ratio (Student 4/Student 3) in the log odds of the two students? (does that number look familiar?)

Now calculate the predicted *probability* of admission for the two students.

### 2.3

* 2.3.1 (MARKDOWN) Is the difference in log odds the same for the two pairs (1&2 and 3&4) of students?
* 2.3.2 (MARKDOWN) Is the ratio of  odds the same for the two pairs (1&2 and 3&4) of students?
* 2.3.3 (MARKDOWN) Is the difference in probability OR the ratio of probabilities the same for the two pairs (1/2 and 3/4) of students? 
* 2.3.4 (MARKDOWN) Using the information you learned in lectures and classwork, explain *why* the differences/ratios may be constant for some of these measures(log odds, odds, probabilities...) but not others.

In [3]:
d.columns

Index(['Serial No.', 'GRE Score', 'TOEFL Score', 'University Rating', 'SOP',
       'LOR ', 'CGPA', 'Research', 'Admitted'],
      dtype='object')

In [7]:
# 1.1
d = pd.read_csv("https://raw.githubusercontent.com/cmparlettpelleriti/CPSC392ParlettPelleriti/master/Data/GradAdmissions.csv")

predictors = ["GRE Score", "TOEFL Score", "University Rating", "SOP", "LOR ", "CGPA", "Research"]
outcome = "Admitted"

X = d[predictors]
y = d[[outcome]]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

z = StandardScaler()
X_train[["GRE Score", "TOEFL Score", "University Rating", "SOP", "LOR ", "CGPA"]] = z.fit_transform(X_train[["GRE Score", "TOEFL Score", "University Rating", "SOP", "LOR ", "CGPA"]])
X_test[["GRE Score", "TOEFL Score", "University Rating", "SOP", "LOR ", "CGPA"]] = z.transform(X_test[["GRE Score", "TOEFL Score", "University Rating", "SOP", "LOR ", "CGPA"]])

lr = LogisticRegression()

lr.fit(X_test, y_test)

coef = pd.DataFrame({"Coefs": lr.coef_[0], "Names": predictors})
coef = coef.append({"Coefs": lr.intercept_[0], "Names": "intercept"}, ignore_index = True)
coef

Unnamed: 0,Coefs,Names
0,0.577337,GRE Score
1,0.513078,TOEFL Score
2,-0.273198,University Rating
3,0.389323,SOP
4,0.43977,LOR
5,1.467234,CGPA
6,0.543061,Research
7,1.733991,intercept


In [None]:
# 1.2

In [None]:
# 1.3

In [None]:
# 1.4

In [None]:
# 1.5

In [14]:
# 2.1

coefs_vec = coef["Coefs"]
vals1 = np.array([0.60,0.05,0.75,0.65,1.02,0.25,1,1])
vals2 = np.array([1.60,0.05,0.75,0.65,1.02,0.25,1,1])

lo_pred1 = np.sum(coefs_vec*vals1)
lo_pred2 = np.sum(coefs_vec*vals2)

print(lo_pred2 - lo_pred1)

o_pred1 = np.exp(lo_pred1)
o_pred2 = np.exp(lo_pred2)

print(o_pred2/o_pred1)

p_pred1 = lo_pred1/(1 + lo_pred1)
p_pred2 = lo_pred2/(1 + lo_pred2)

print(p_pred2 - p_pred1)
print(p_pred2/p_pred1)

0.5773365374419104
1.7812877137493675
0.025135185754579847
1.0322908199879592


In [15]:
# 2.2
coefs_vec = coef["Coefs"]
vals3 = np.array([-1.25,0.24,0,0.56,-1,-0.1,0,1])
vals4 = np.array([-0.25,0.24,0,0.56,-1,-0.1,0,1])

lo_pred3 = np.sum(coefs_vec*vals3)
lo_pred4 = np.sum(coefs_vec*vals4)

print(lo_pred4 - lo_pred3)

o_pred3 = np.exp(lo_pred3)
o_pred4 = np.exp(lo_pred4)

print(o_pred4/o_pred3)

p_pred3 = lo_pred3/(1 + lo_pred3)
p_pred4 = lo_pred4/(1 + lo_pred4)

print(p_pred4 - p_pred3)
print(p_pred4/p_pred3)

0.5773365374419099
1.7812877137493666
0.13937302668789003
1.3210883304187624


In [1]:
# 2.3