# UCLA Admissions 
#### Data Source: UCLA's Logistic Regression in R tutorial <br/>
http://www.ats.ucla.edu/stat/r/dae/logit.htm <br/><br/>
## **Research Question: What helps you get into UCLA to most - the GPA score, GRE score, or the prestige of your undergraduate school?** <br/>
With a dataset of 400 observations we will explore the dataset and use Logistic Regression to predict admittance into UCLA 

## The Data

### Metadata / Data Dictionary

Variable | Summary | Description | Type of Variable | Variable
---| ---| --- | ---
admit | admitted to UCLA or not | 1 admitted, 0 not admitted | binary | $Y$, Predictor, Target, Response, Dependent Variable
GRE | Graduate Record Examinations - standarized test |integers - range from 200 - 800 | discrete* | $X_0$, Predictors, Features, Independent Variable
GPA | Grade Point Average - summation rank of course grades  |floats with precision to the hundredths - range from 0.00 to 4.00 | continous | $X_1$, Predictors, Features, Independent Variable
prestige | rank of undergraduates' university | integers 4 to 1 (highest)| ordinal | $X_2$, $X_3$, $X_4$ Predictors, Features, Independent Variable

*although GRE score is technically discrete it will be treated as continuous

This dataset is hypothetical - was generated by UCLA - thus there is not timeframe of the data

For Logistic Regression:<br/>

coefficients of model out put are the predictors <br/>
$y^*$ = $B_0$ + $B_1*GRE$ + $B_2*GPA$ + $B_3*prestige$ <br/>

then for logistic regression log of the coefficients <br/>
$p$ = $e^y$ / $e^y$ + 1<br/>

then $ln(p)$

In [1]:
import pandas as pd
import numpy as np
# pandas is a very important api for data anaysis this is optimized for working with Data Frames 
# aka a data table like a spreadsheet in excel
# numpy for all your math functions 
# it is a best practice to import all your libaries at the beginning - but I like to import them as they are needed

https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/

In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# pretty cool - now I can have multiple outputs from one notebook cell

In [51]:
data = pd.read_csv('assets/admissions.csv')
data.head()
data.tail()
# look the head and tail BOTH show up - awesome!

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.0,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0


Unnamed: 0,admit,gre,gpa,prestige
395,0,620.0,4.0,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0
399,0,600.0,3.89,3.0


## The hypothesis
Does an increase or decrease in GRE, GPA, and Prestige affect admittance rates?
* $h_0^a$ There is not a trend in GRE and admittance rate
* $h_1^a$ There is a trend in GRE and admittance rate <br/><br/>
* $h_0^b$ There is not a trend in GPA and admittance rate
* $h_1^b$ There is a trend in GPA and admittance rate <br/><br/>
* $h_0^c$ There is not a trend in prestige and admittance rate
* $h_1^c$ There is a trend in prestige and admittance rate <br/><br/>

### Data Preprocessing / Cleaning
not much - this data set is good to go - just going to drop a few rows that have null values

In [52]:
data.shape
# (rows, columns)

(400, 4)

In [53]:
print "GRE null indicies",np.where(data['gre'].isnull())[0]
print "GPA null indicies", np.where(data['gpa'].isnull())[0]
print "Prestige null indicies", np.where(data['prestige'].isnull())[0]
# the rows where there is a null value

GRE null indicies [187 212]
GPA null indicies [187 236]
Prestige null indicies [236]


In [54]:
data.loc[187]
data.loc[212]
data.loc[236]
# note it is interesting that one "person" was admitted with out a GPA or prestige score

admit       0.0
gre         NaN
gpa         NaN
prestige    2.0
Name: 187, dtype: float64

admit       0.00
gre          NaN
gpa         2.87
prestige    2.00
Name: 212, dtype: float64

admit         1.0
gre         660.0
gpa           NaN
prestige      NaN
Name: 236, dtype: float64

In [55]:
# drop nulls
df = data.dropna()
df.shape
# 397 records remain - lost 3 rows 

(397, 4)

## Exploratory Analysis

In [57]:
statsDF = df.describe()
statsDF

Unnamed: 0,admit,gre,gpa,prestige
count,397.0,397.0,397.0,397.0
mean,0.31738,587.858942,3.392242,2.488665
std,0.466044,115.717787,0.380208,0.947083
min,0.0,220.0,2.26,1.0
25%,0.0,520.0,3.13,2.0
50%,0.0,580.0,3.4,2.0
75%,1.0,660.0,3.67,3.0
max,1.0,800.0,4.0,4.0


http://www.statisticshowto.com/probability-and-statistics/interquartile-range/

In [60]:
q75 = statsDF.loc["75%"]
q25 = statsDF.loc["25%"]

IQR = map(lambda (x75, x25): x75 - x25, zip(q75, q25))

IQRindex = {"admit": IQR[0], 
            "gre": IQR[1], 
            "gpa": IQR[2], 
            "prestige": IQR[3]}

dfIQR = pd.DataFrame(IQRindex, index = range(1))
A = statsDF.append(dfIQR)

In [61]:
std = statsDF.loc["std"]
mean = statsDF.loc["mean"]

Rstd = map(lambda (std, mean): (std / mean) * 100, zip(std, mean))

Rstd_index = {"admit": Rstd[0], 
            "gre": Rstd[1], 
            "gpa": Rstd[2], 
            "prestige": Rstd[3]}

df_Rstd = pd.DataFrame(Rstd_index, index = range(1))
A.append(df_Rstd)

Unnamed: 0,admit,gpa,gre,prestige
count,397.0,397.0,397.0,397.0
mean,0.31738,3.392242,587.858942,2.488665
std,0.466044,0.380208,115.717787,0.947083
min,0.0,2.26,220.0,1.0
25%,0.0,3.13,520.0,2.0
50%,0.0,3.4,580.0,2.0
75%,1.0,3.67,660.0,3.0
max,1.0,4.0,800.0,4.0
0,1.0,0.54,140.0,1.0
0,146.840899,11.208172,19.684618,38.055886


Ok so first the 0 in the row above is the Interquartile **Range** - This helps explain the distribution of the data.  75% is Quartile 3 (Q3) and 25% is Quartile 1 (Q1). The Interquartile Range (IQR) is Q3 - Q1. So the middle 50% of the data for GPA varies by .54 and for GRE by 140 points. The second (last) 0 index is the relative standard deviation. **Relative standard deviation** tells you how much varibility the data has in respect to the mean. Comparing GPA to GRE you can see that GRE has a greater percentage than GPA.  GRE numbers vary more than GPA.  Prestige varies the most, but this is categorical data so you cannot really interpert variability - see frequency tables below the box plots for that. Back to GRE and GPA the distribution of the data with the Interquartile Range can be visualized with box plots below. Basically the point of this looking at the distibution of the numerical data is to see if if is fairly normally distributed.  GPA and GRE pass the test essentially. Both are skewed to the right meaning most of the data is a little to the higher side in value, which makes sense.

### Visualize 

GRE and GPA Box Plots <br/>
https://plot.ly/~akell47/34/gre/ <br/>
https://plot.ly/~akell47/50/gpa/

In [62]:
import plotly 
import plotly.plotly as py
import plotly.graph_objs as go
# these box plots use plotly as dependancy
plotly.tools.set_credentials_file(username='yourusername', api_key='your api key')

In [64]:
GRE = [
    go.Box(
        x= df.gre,
        name = 'GRE',
        boxpoints='all',
        jitter=0.5,
        pointpos=0,
        marker = dict(
            color = 'rgb(199, 21, 133)',
#             I <3 pink
        )
    )
]
py.iplot(GRE)


GPA = [
    go.Box(
        x= df.gpa,
        name = 'GPA',
        boxpoints='all',
        jitter=0.5,
        pointpos=0,
        marker = dict(
            color = 'rgb(50, 30, 112)',
#             this is a really nice color
        )
    )
]
py.iplot(GPA)
# these box plots are really nice because you can actually see the data points 
# and hover over the plot for each quartile pop-up
# You can see the discreteness of values in GRE and the more continuous values in GPA 

Prestige of 1 is the highest ranked prestige and 4 is the lowest prestige. Most applicants had a 2 or 3 prestige.

In [73]:
df.prestige.value_counts()

2.0    148
3.0    121
4.0     67
1.0     61
Name: prestige, dtype: int64

Two way Frequency Table of Prestige and Admit <br/>

In [81]:
admit_prestige = pd.crosstab(index = df.admit, columns = df['prestige'], margins = True)

admit_prestige.index = ["rejected", "accepted", "Total"]

admit_prestige

prestige,1.0,2.0,3.0,4.0,All
rejected,28,95,93,55,271
accepted,33,53,28,12,126
Total,61,148,121,67,397


So 3.39 is the average for GPA and 

In [87]:
admit_gpa = pd.crosstab(index = df.admit, columns = df['gpa'].mean(), margins = True)

admit_gpa.index = ["rejected", "accepted", "Total"]

admit_gpa

col_0,3.3922418136,All
rejected,271,271
accepted,126,126
Total,397,397


In [88]:
admit_gre = pd.crosstab(index = df.admit, columns = df['gre'].mean(), margins = True)

admit_gre.index = ["rejected", "accepted", "Total"]

admit_gre

col_0,587.858942065,All
rejected,271,271
accepted,126,126
Total,397,397


Two way Frequency Table of Prestige and Admit RATIO<br/>
Highest Rate of acceptance was with a prestige of 2, followed by prestige of 1. <br/>
Highest Rage of rejection was with a prestige of 2, followed closly by prestige of 3. 

In [91]:
admit_prestige/admit_prestige.ix["Total", "All"]


col_0,3.3922418136,All
rejected,0.68262,0.68262
accepted,0.31738,0.31738
Total,1.0,1.0


Lets compare the rate of acceptance between prestige of 4 and of 1 using odds ratio. 

In [77]:
cols_to_keep = ['admit', 'gre', 'gpa']
handCalc = df[cols_to_keep].join(prestige_dummy.ix[:, 'prestige_1.0':])
handCalc.head()

NameError: name 'prestige_dummy' is not defined