# Project 1

In this first project you will create a framework to scope out data science projects. This framework will provide you with a guide to develop a well-articulated problem statement and analysis plan that will be robust and reproducible.

### Read and evaluate the following problem statement: 
Determine which free-tier customers will convert to paying customers, using demographic data collected at signup (age, gender, location, and profession) and customer useage data (days since last log in, and activity score 1 = active user, 0= inactive user) based on Hooli data from Jan-Apr 2015. 


#### 1. What is the outcome?

Answer: The outcome will tell us how likely it is that a free-tier customer will become a paying customer (or, the expected conversion rate from free-tier to paying customer).

#### 2. What are the predictors/covariates? 

Answer: The predictors are all the demographic and customer usage data in the system:
- Age
- Gender
- Location
- Profession
- Days since last log in
- Activity score (whether the user is active or inactive)

#### 3. What timeframe is this data relevent for?

Answer: This data is relevant for January 2015 to April 2015. 

#### 4. What is the hypothesis?

Answer: Since we know which users are active and which aren't, we can predict that the users who are active (activity score = 1) will be most likely to convert to paying customers. 

## Let's get started with our dataset

In [24]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np

df = pd.read_csv("../assets/admissions.csv")
df.head()

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.0,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0


#### 1. Create a data dictionary 

Answer: 

Variable | Description | Type of Variable
---| ---| ---
Admit | 0 = not admitted 1 = admitted | categorical
GRE | GRE Score | continuous 
GPA | Academic GPA 0.0 - 4.0 | continuous 
Prestige | 1 = Not prestigious, 2 = Somewhat prestigious, 3 = Presitigious, 4 = Very Prestigious | categorical 


We can explore the association between any of the predictors and a student's admission status (i.e. do higher or lower GRE scores lead to admission? Does a higher undergraduate GPA lead to admission?). The more interesting question here that may not be as obvious is whether there is an association between the prestige of the undergraduate school of the student and that student's admission into graduate school. 

#### 2. What is the outcome?

Answer: Determine the likelihood of admission to graduate school based on the prestige status of the undergraduate university

#### 3. What are the predictors/covariates? 

Answer: The predictor in this specific question is the prestige of the undergraduate univeristy a student attended -- the data set also contains two other predictors, GRE score and GPA of that student.

#### 4. What timeframe is this data relevent for?

Answer: Since there isn't a date of acceptance or some other time stamp associated with each data point in this set, we don't have a clear timeframe. We'd need to establish one before we started the analysis to make sure our data wasn't out-of-date or in some other way unfit for use in answering the question.

#### 4. What is the hypothesis?

Answer: The hypothesis is that students who attented an undergraduate university with higher presige will be more likely to be accepted to graduate school.

    Using the above information, write a well-formed problem statement. 

Determine if there is an association between the pretige of a student's undergraduate university and their acceptance to graduate school using student information provided from the admissions data set.

## Problem Statement

### Exploratory Analysis Plan

Using the lab from a class as a guide, create an exploratory analysis plan. 

#### 1. What are the goals of the exploratory analysis? 

Answer: Exploratory analysis helps you summarize and understand your data. This early analysis will allow you to determine if there's anything missing from the data, if there are any mistakes in the dataset, and if any of the data is already aggregated or needs to be aggregated prior to analysis. It should also provide familiarity with the variables and data types for the data set. 

#### 2a. What are the assumptions of the distribution of data? 

Answer: We'll assume a normal distribution for the data.

#### 2b. How will determine the distribution of your data? 

Answer: We can quickly plot the data to determine the distribution.

#### 3a. How might outliers impact your analysis? 

Answer: Outliers could skew your data and any statistical analysis towards the outlier -- in the case of admissions, for example, a few very high GRE scores could skew the average GRE for certain prestige levels higher.

#### 3b. How will you test for outliers? 

Answer: We can use the plotted data from above to help determine outliers. We can also use a box and whisker plot, which would show us the median and general range of data.

#### 4a. What is colinearity? 

Answer: Colinerarity is when two independant variables, or predictors, are highly correlated (for example, high GPA likely has a high colinerarity with a high GRE score). 

#### 4b. How will you test for colinearity? 

Answer: Based on some quick research, numpy has a built-in way of calculating the correlation coefficient (numpy.corrcoef), as well as a way of determining the correlation matrix (numpy.correlate).

#### 5. What is your exploratory analysis plan?
Using the above information, write an exploratory analysis plan that would allow you or a colleague to reproduce your analysis 1 year from now. 

Answer: 
- Check the data for any missing data or errors in reporting
- Determine if the data distribution fit the normal distribution assumption
- Test for outliers 
- Identify any colinear variables that may affect analysis or we'll need to be aware of

## Bonus Questions:
1. Outline your analysis method for predicting your outcome
2. Write an alternative problem statement for your dataset
3. Articulate the assumptions and risks of the alternative model

For this bonus question, we'll predict a new outcome: we'll determine the liklihood of graduate admission based on a student's GRE score (our primary predictor). The new hypothesis would be that students with higher GRE scores are more likely to be accepted into graduate school.

- New Analysis Method
The exploratory analysis for determining this new outcome won't be much different from the first analysis. We'll want to make sure there isn't missing data or incorrectly inputted scores. We'll also want to determine the distribution of the data, check for outliers, and examine the colinearity of the data. In this case, GRE score could have a high colinearity with GPA, so we'll want to make sure we take this into consideration when performing the analysis.


- New Problem Statement
Determine if there is an associated between a student's GRE score and their acceptance to graduate school based on the information provided in the admissions data set.


- Assumptions & Risks
Because the GRE is a standardized test, there could be risks associated with assumptions about the results of the analysis (i.e. a higher GRE score may not be associated with a high GPA or undergraduate prestige -- a student may be a stronger test taker than others). Another assumption could be that graduate schools are evaluting the GRE the same way or considering them at all. Many schools no longer use just the GRE (if they use it at all), so it's possible another test was more a predictor of acceptance than the GRE.