# DS-SF-27 | Unit Project 4: Notebook with Executive Summary

In this project, you will summarize and present your analysis from Unit Projects 1-3.

> ## Question 1.  Introduction
> Write a problem statement for this project.

Answer: We want to determine what influences the likelihood of a student being admitted to UCLA Graduate School. Applicants must submit many materials, including their undergraduate GPA and their GRE score. In addition, where applicants studied as undergraduates is taken into consideration by the admissions process. We want to see if one of these aspects of the application has a significant influence on whether a student gets admitted or not. This can help students better understand their chances of admission, and help undergraduate programs better inform and assist students with their graduate school applications.

> ## Question 2.  Dataset
> Write up a description of your data and any cleaning that was completed.

Answer: We have a dataset containing data on 400 applicants. The variables in the dataset are gre, gpa, prestige, and admit. gre indicates the applicant's Graduate Record Exam (GRE) score, with the range of 220-800. gpa indicates the applicant's undergraduate Grade Point Average, with the range of 2.26-4.0. prestige indicates the the ranking or merit of the applicant's undergraduate school, with a range of 1-4, 1 marking the highest prestige, and 4 being the lowest, or not prestigious. 

To clean the data, we used the dropna function to remove all null values. This resulted in having 397 data observations, instead of the previous 400.

> ## Question 3.  Demo
> Provide a table that explains the data by admission status.

Answer: Variable | Description | Type of Variable
---|---|---
`admit` | 0 = Not admitted, 1 = Admitted | Categorical
`gre` | GRE (range: 200-800) | Continuous
`gpa` | GPA (range: 0-4.0) | Continuous
`prestige` | 1 = High prestige, 2 = Good prestige, 3 = Low prestige, 4 = Not prestigious | Categorical

> ## Question 4. Methods
> Write up the methods used in your analysis.

Answer: We used box plots to check for outliers, and qq plots and histograms to check for distribution. We then used odds ratios to calculate the odds of admission into UCLA Graduate School based on prestige. Lastly, we used logistic regression to determine the influence of gpa, and gre on an applicant's admission.

> ## Question 5. Results
> Write up your results.

Answer: Using a manual calculation odds ratio, we found that the odds of being admitted into UCLA Graduate School were very dependent on prestige. The higher the prestige, the higher the odds for admission, and the lower the prestige, the lower the odds were.

After performing a linear regression, we found that all variables (gre, gpa, prestige2, prestige3, and prestige4) are statistically significant and all play a part in determining admission. No one variable carried more weight than another according to this analysis.

One interesting finding was when the odds ratio function was used, gpa returned the largest value, which inidicates that as gpa increases, the odds of admission increase significanlty. This suggests that gpa potentially has the most influence out of all the variables.

> ## Question 6. Visuals
> Provide a table or visualization of these results.

In [13]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.notebook_repr_html', True)

import statsmodels.formula.api as smf

from sklearn import linear_model

In [14]:
df = pd.read_csv(os.path.join('..', '..', 'dataset', 'ucla-admissions.csv'))
df.dropna(inplace = True)

df

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.00,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0
...,...,...,...,...
395,0,620.0,4.00,2.0
396,0,560.0,3.04,3.0
397,0,460.0,2.63,2.0
398,0,700.0,3.65,2.0


In [19]:
prestige_df = pd.get_dummies(df.prestige, prefix = 'prestige')

prestige_df.rename(columns = {'prestige_1.0': 'prestige_1',
                              'prestige_2.0': 'prestige_2',
                              'prestige_3.0': 'prestige_3',
                              'prestige_4.0': 'prestige_4'}, inplace = True)

In [20]:

prestige_df.head()

Unnamed: 0,prestige_1,prestige_2,prestige_3,prestige_4
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0


In [24]:
df = df[ ['admit', 'gre', 'gpa'] ].join(prestige_df)

df.head()

Unnamed: 0,admit,gre,gpa,prestige_1,prestige_2,prestige_3,prestige_4
0,0,380.0,3.61,0.0,0.0,1.0,0.0
1,1,660.0,3.67,0.0,0.0,1.0,0.0
2,1,800.0,4.0,1.0,0.0,0.0,0.0
3,1,640.0,3.19,0.0,0.0,0.0,1.0
4,0,520.0,2.93,0.0,0.0,0.0,1.0


In [25]:
pd.crosstab(df.admit, df.prestige_4)

prestige_4,0.0,1.0
admit,Unnamed: 1_level_1,Unnamed: 2_level_1
0,216,55
1,114,12


In [26]:
model = smf.logit(formula = 'admit ~ gre + gpa + prestige_2 + prestige_3 + prestige_4', data = df).fit()

Optimization terminated successfully.
         Current function value: 0.573854
         Iterations 6


Below are the results from our logistic regression:

In [27]:
model.summary()

0,1,2,3
Dep. Variable:,admit,No. Observations:,397.0
Model:,Logit,Df Residuals:,391.0
Method:,MLE,Df Model:,5.0
Date:,"Thu, 03 Nov 2016",Pseudo R-squ.:,0.08166
Time:,17:50:57,Log-Likelihood:,-227.82
converged:,True,LL-Null:,-248.08
,,LLR p-value:,1.176e-07

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,-3.8769,1.142,-3.393,0.001,-6.116 -1.638
gre,0.0022,0.001,2.028,0.043,7.44e-05 0.004
gpa,0.7793,0.333,2.344,0.019,0.128 1.431
prestige_2,-0.6801,0.317,-2.146,0.032,-1.301 -0.059
prestige_3,-1.3387,0.345,-3.882,0.000,-2.015 -0.663
prestige_4,-1.5534,0.417,-3.721,0.000,-2.372 -0.735


Below are our results from calculating the odds ratio. gpa has the highest value:

In [28]:
np.exp(model.params)

Intercept     0.020716
gre           1.002221
gpa           2.180027
prestige_2    0.506548
prestige_3    0.262192
prestige_4    0.211525
dtype: float64

> ## Question 7.  Discussion
> Write up your discussion and future steps.

Answer: Our problem statement asked to investigate if any of the variables in our dataset have a significant influence over an applicant's admission into UCLA Graduate School. As discussed above in our results, the logistic regression shows that all of the variables are significant, and that none should be discounted. However, the odds ratio function outputted the highest number for gpa, which indicates that as gpa increases, the odds of admission go up.

With these results, additional analysis should be conducted. We need to see if other models show similar significance for gpa, or if they show it for other variables. With one (or more) variables pinpointed, this would benefit students and graduate school advisors as they can strategically focus on one aspect of the application and know it can help with admission.

Additionally, we only had access to one dataset for this analysis, and it did not indicate a time period. In order to establish a strong hypothesis, we would need multiple datasets like this one, from consecutive years of UCLA Graduate School applicants and admissions results. With multiple datasets, we could compare variable significance by year, and then look at trends in the data. Once this analysis is completed, we can make more informed insights on what variables matter the most in the admissions process.