# Project 1

In this first project you will create a framework to scope out data science projects. This framework will provide you with a guide to develop a well-articulated problem statement and analysis plan that will be robust and reproducible.

## Read and evaluate the following problem statement: 
Determine which free-tier customers will convert to paying customers, using demographic data collected at signup (age, gender, location, and profession) and customer usage data (days since last log in, and activity score 1 = active user, 0= inactive user) based on Hooli data from Jan-Apr 2015. 


#### 1. What is the outcome?

Answer: paying customers

#### 2. What are the predictors/covariates? 

Answer: age, gender, location, profession, days since last login, activity score

#### 3. What timeframe is this data relevant for?

Answer: January to April 2015

#### 4. What is the hypothesis?

Answer: It will be possible to predict which free tier customers will convert to paying customers based on demographic & usage data.

>**Comments**: could also pick a more specific angle.  E.g., users who were active more recently are more likely to convert

## Let's get started with our `admissions.csv` dataset

#### 1. Create a data dictionary 

Answer: 

Variable | Description | Type of Variable
---| ---| ---
Var 1 | 0 = not thing 1 = thing | categorical
Var 2 | thing in unit X | continuous 


In [1]:
import pandas as pd
df = pd.read_csv('assets/admissions.csv')

>**Comments**: the question wanted you to create a data dictionary for the admissions.csv data!

In [33]:
data_dict = pd.DataFrame(columns = ['Variable', 'Description', 'Type of Variable'])

In [36]:
data_dict['Variable'] = df.columns

In [40]:
for x in data_dict['Variable']:
    print df[x].dtype

int64
float64
float64
float64


In [41]:
data_dict['Type of Variable'] = data_dict['Variable'].apply(lambda x: df[x].dtype)

In [43]:
data_dict

Unnamed: 0,Variable,Description,Type of Variable
0,admit,,int64
1,gre,,float64
2,gpa,,float64
3,prestige,,float64


In [44]:
descriptions = ['Boolean. 1 for admitted, 0 for not admitted', 'Applicant\'s GRE score', 'Applicant\'s GPA', 'Measure of prestige of undergrad institution, 1-4']

In [45]:
data_dict['Description'] = descriptions

In [46]:
data_dict

Unnamed: 0,Variable,Description,Type of Variable
0,admit,"Boolean. 1 for admitted, 0 for not admitted",int64
1,gre,Applicant's GRE score,float64
2,gpa,Applicant's GPA,float64
3,prestige,"Measure of prestige of undergrad institution, 1-4",float64


We would like to explore the association between X and Y 

#### 2. What is the outcome?

Answer: admission status

#### 3. What are the predictors/covariates? 

Answer: GRE, GPA, and prestige

#### 4. What timeframe is this data relevant for?

Answer: The data doesn't say

#### 5. What is the hypothesis?

Answer: Presumably, that a statistical model will be able to predict admissions status based on an applicant's GRE score, GPA, and prestige

>**Comments**: again, could be more specific. E.g., students with higher undergrad GPAs are more likely to be admitted to grad school

#### 6. Using the above information, write a well-formed problem statement. 


Answer: Determine which grad school applicants gained admissions (admitted = 1, not admitted = 0) based on their GRE Scores, GPA, and Prestige level.

### Exploratory Analysis Plan

Using the lab from a class as a guide, create an exploratory analysis plan. 

#### 1. What are the goals of the exploratory analysis? 

Answer: To look at attributes of the data set to get a better idea of how to proceed with a potential model.

#### 2a. What are the assumptions regarding the distribution of the data? 

Answer: Theoretically, I would assume the data is totally random. Practically, I would start with the assumption that it has a normal distribution.

#### 2b. How will you determine the distribution of your data? 

Answer: I would look at the skew & kurtosis of the dataset to see if it is close to the values for a normal distribution (skew = 0, kurt = 3 or 0 excess kurtosis)

>**Comments**: you could also plot histograms

#### 3a. How might outliers impact your analysis? 

Answer: Outliers can skew the data by affecting the mean, but can be controlled by either excluding them from the distribution if they are more than a certain number of standard deviations from the mean, or by studying both the mean and the median, which is not affected by outliers

#### 3b. How will you test for outliers? 

Answer: I will check for values more than three standard deviations from the mean of the dataset (using numpy/pandas' mean & stdev)

#### 4. What is your exploratory analysis plan?
Using the above information, write an exploratory analysis plan that would allow you or a colleague to reproduce your analysis 1 year from now. 

Answer: 

Find the mean, mode, median, and standard deviation of GRE, GPA, and prestige, as preliminary descriptive data of those sets.

Plot each predictor/covariate against admission data and determine the correlation coefficient, to determine how linear the relationship is between the data sets, and which of the three has the strongest linear relationship with admission data.

Use matrix multiplication to combine each set of two predictors/covariates and determine the correlation coefficient between the combined predictors/covariates and the outcome (admissions), and compare to the correlation coefficients for each individual predictor/covariate.

Finally, use matrix multiplication to combine all three predictors/covariates, and determine the relationship between the combined predictor/covariate set and the outcome.

The relationships between single and combined predictors/covariates should give a good idea of where to start with creating a predictive model.

Pandas dataframe.describe() function can be used for the initial analysis, and dataframe.corr() can be used for the correlation coefficient

## Bonus Questions:
1. Outline your analysis method for predicting your outcome
2. Write an alternative problem statement for your dataset
3. Articulate the assumptions and risks of the alternative model

## Feedback


| Requirements | Incomplete (0) | Does Not Meet Expectations (1) | Meets Expectations (2) | Exceeds Expectations (3) |
|---|---|---|---|---|
| Create a data dictionary with classification of available variables | X| | | |
| Correctly identify features of the dataset, including the outcome and covariates/predictors | | | X| |
| Write a high-quality problem statement | | | X| |
| State the risks and assumptions of your data | | | X| |
| Outline exploratory analysis methods | | | |X |