# Project 4 (Ellein Cheng)

In this project, you will summarize and present your analysis from Projects 1-3.

### Intro: Write a problem Statement/ Specific Aim for this project

Answer: 

PROBLEM STATEMENT - The aim for this project is to model the probability of admission into grad school based on dependent variables such as GRE scores, undergrad GPAs, and prestige (ranking) of undergraduate school.



### Dataset:  Write up a description of your data and any cleaning that was completed

Answer: This is a data set with 400 rows and 4 columns.  The outcome variable is admit, and the input variables are GRE, GPA, and prestige.  Here is the data dictionary:

Variable | Description | Type of Variable
---| ---| ---
admit | 0=not admitted, 1=admitted | categorical
gre | GRE score based on the 200-800 scale range | continuous 
gpa | Grade Point Average based on 0-4.0 scale range | continuous 
prestige | Prestige of undergrad school: 1=not prestige, 2=low prestige, 3=good prestige, 4=high prestige | categorical

It is unclear what the timeframe is for the dataset. One thing we know is that the GRE score is based on the 200-800 scale, which was the old scale used prior to August 2011. Therefore, the data would have to be older than that. In order to get the exact timeframe, we might need to reach out to the original source of the dataset.

Data cleaning completed includes the dropping of missing data which decreased the data size by 3 rows, resulting in 397 data points.


### Demo: Provide a table that explains the data by admission status

In [20]:
#Import
from __future__ import division
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import pylab as pl
import numpy as np
%matplotlib inline

df_raw = pd.read_csv("../assets/admissions.csv")
df = df_raw.dropna()
df.describe()

Unnamed: 0,admit,gre,gpa,prestige
count,397.0,397.0,397.0,397.0
mean,0.31738,587.858942,3.392242,2.488665
std,0.466044,115.717787,0.380208,0.947083
min,0.0,220.0,2.26,1.0
25%,0.0,520.0,3.13,2.0
50%,0.0,580.0,3.4,2.0
75%,1.0,660.0,3.67,3.0
max,1.0,800.0,4.0,4.0


In [9]:
#average
df.pivot_table(columns='admit')

admit,0,1
gpa,3.347159,3.489206
gre,573.579336,618.571429
prestige,2.645756,2.150794


In [14]:
#standard dev
df.pivot_table(columns='admit',aggfunc='std')

admit,0,1
gpa,0.376355,0.371655
gre,116.052798,109.257233
prestige,0.918922,0.921455


# Counts by Rank
print pd.crosstab(df['admit'], df['prestige'])

Mean (STD) or counts by admission status for each variable 

| Not Admitted | Admitted
---| ---|---
GPA | mean=3.347159 (std=0.376355)  | mean=3.489206 (std=0.371655)
GRE |mean=573.579336 (std=116.052798) | mean=618.571429 (std=109.257233)
Prestige 1 | 46% | 54%
Prestige 2 | 64% | 36%
Prestige 3 | 77% | 23%
Prestige 4 | 82% | 18%

### Methods: Write up the methods used in your analysis

Answer:

1) Initially, the correlations were explored.  However, since the output variable is binary, linear regression won't be a good model to use for this analysis.  Below is the correlation matrix.  From the table, the correlation coefficients between admit and the other variables are all close to zero in absolute terms, indicating a poor fit of the linear model.

In [19]:
#Correlation Matrix
df.corr()

Unnamed: 0,admit,gre,gpa,prestige
admit,1.0,0.181202,0.174116,-0.243563
gre,0.181202,1.0,0.382408,-0.124533
gpa,0.174116,0.382408,1.0,-0.060976
prestige,-0.243563,-0.124533,-0.060976,1.0


2) With a boolean output variable, a better model to used is Logistic Regression.

In [23]:
cols_to_keep = ['admit', 'gre', 'gpa']
dummy_pres = pd.get_dummies(df['prestige'], prefix='pres')
data = df[cols_to_keep].join(dummy_pres.ix[:, 'pres_2':])
data['intercept'] = 1.0
train_cols = data.columns[1:]
logit = sm.Logit(data['admit'], data[train_cols])
model_result = logit.fit()

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  This is separate from the ipykernel package so we can avoid doing imports until


### Results: Write up your results

Answer: 

From the Logit Regression Results table, we could see that the P-values for all the input variables are less than 0.05, which means all these variables have a significant association with the admission rate.  Prestige rankings 3 and 4 have P-values closest to 0, which indicates that they have the strongest predictive ability of the output variable.  The P-value if GRE is closest to 0.05 which indicates a borderline strength in predictive ability.

Observing the 97.5% confidence intervals of the coefficients, none of them include zero.  When running the exponential of these parameters to get the odss ratios, the 97.5% confidence intervals of those do not include one.  These are both confirmations that the input variables have meaningful impacts on the output variable.

GPA:  The Odds Ratio of GPA is about 2.18, which means with each unit increase in the undergraduate GPA score, it's twice likely for the candidate to be admitted.

Prestige 2: The Odds Ratio is about 0.51, which means a candidate who went to a ranking 2 school is about half likely to be admitted than one who did not attend a ranking 2 school.

### Visuals: Provide a table or visualization of these results

In [24]:
print model_result.summary()

                           Logit Regression Results                           
Dep. Variable:                  admit   No. Observations:                  397
Model:                          Logit   Df Residuals:                      391
Method:                           MLE   Df Model:                            5
Date:                Mon, 15 Jan 2018   Pseudo R-squ.:                 0.08166
Time:                        11:20:42   Log-Likelihood:                -227.82
converged:                       True   LL-Null:                       -248.08
                                        LLR p-value:                 1.176e-07
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
gre            0.0022      0.001      2.028      0.043    7.44e-05       0.004
gpa            0.7793      0.333      2.344      0.019       0.128       1.431
pres_2.0      -0.6801      0.317     -2.146      0.0

In [25]:
print np.exp(model_result.params)

gre          1.002221
gpa          2.180027
pres_2.0     0.506548
pres_3.0     0.262192
pres_4.0     0.211525
intercept    0.020716
dtype: float64


In [26]:
params = model_result.params
conf_interval = model_result.conf_int()
conf_interval['OddsRatio'] = params
conf_interval.columns = ['2.5%', '97.5%', 'OddsRatio']
print np.exp(conf_interval)

               2.5%     97.5%  OddsRatio
gre        1.000074  1.004372   1.002221
gpa        1.136120  4.183113   2.180027
pres_2.0   0.272168  0.942767   0.506548
pres_3.0   0.133377  0.515419   0.262192
pres_4.0   0.093329  0.479411   0.211525
intercept  0.002207  0.194440   0.020716


<img src='placeholder.png' height= 25% width= 25%>

<img src='placeholder.png' height= 25% width= 25%>

### Discussion: Write up your discussion and future steps

Answer: 

The conclusion from this analysis is that it's indicative of some level of association between the inputs (GRE, GPA, and Prestige) with the admission rate to grad school.  However, there could be other confounding variables that could affect admission rate, such as gender, years of work experience, age, and ethnicities.  Therefore, if more data could be collected with regards to these other factors, a more rigorous and relevant model could be developed to provide a stronger analysis.

Another potential project that could be stemmed off from this one is an analysis of the association between admission into grad school and career aspects like salary projections and title changes.  Since it is an investment to attend grad school, applicant might not necessary take the offer even if they are admitted.  Whether they end up participating in the program depends on the worthiness of the program in helping their career growth.