# DS-SF-30 | Unit Project 1: Research Design

## Part A.  Evaluate the following problem statement:

> "Determine which free-tier customers will covert to paying customers, using demographic data collected at signup (age, gender, location, and profession) and customer useage data (days since last log in, and `activity score 1 = active user`, `0 = inactive user`) based on Hooli data from January-April 2015."


> ### Question 1.  What is the outcome?

Answer: Converted to paying customer or not 

> ### Question 2.  What are the predictors/covariates?

Answer: Age, gender, location, profession, days since last log in, and activity score

> ### Question 3.  What timeframe is this data relevent for?

Answer: January - April 2015

> ### Question 4.  What is the hypothesis?

Answer: Demographic and customer usage data can be used to predict whether a free-tier customer will convert to be a paying customer.

## Part B.  Let's start exploring our UCLA dataset and answer some simple questions:

In [2]:
import os
import pandas as pd

df = pd.read_csv(os.path.join('..', '..', 'dataset', 'dataset-ucla-admissions.csv'))

df.head()

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.0,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0


> ### Question 5.  Create a data dictionary.

Answer:

Variable | Description | Type of Variable
---|---|---
GRE score | Score on Graduate Record Examination, integer from 0 to 800 | Continuous
GPA score | Grade Point Average, number from 0 to 4.0 | Continuous
prestige | Integer score from 1 (highest prestige) to 4 (lowest) | Categorical
admit | binary, 1 = admitted, 0 = rejected | Categorical

We would like to explore the association between Admission decision and the above academic record data.

> ### Question 6.  What is the outcome?

Answer: Admission decision (`1` for admitted, `0` for denied admission)

> ### Question 7.  What are the predictors/covariates?

Answer: GRE (Graduate Record Examination) score, GPA (Grade Point Average) score, and prestige of alma mater

> ### Question 8.  What timeframe is this data relevent for?

Answer: The data is hypothetical and was generated for the purpose of the R tutorial, thus it is not relevant for any timeframe. 

Source: http://www.ats.ucla.edu/stat/sas/dae/logit.htm

> ### Question 9.  What is the hypothesis?

Answer: Past academic data will allow us to predict whether a UCLA grad school applicant will be admitted. 

> ### Question 10.  What's the problem statement?

> Using your answers to the above questions, write a well-formed problem statement.

Answer: "Determine which applicants will be admitted to UCLA for graduate school, using academic record data included with the application (GRE score, GPA, and prestige of alma mater) based on data from UCLA's Logit Regression in R tutorial."

## Part C.  Create an exploratory analysis plan by answering the following questions:

Because the answers to these questions haven't yet been covered in class yet, this section is optional.  This is by design.  By having you guess or look around for these answers will help make sense once we cover this material in class.  You will not be penalized for wrong answers but we encourage you to give it a try!

> ### Question 11. What are the goals of the exploratory analysis?

Answer: The goals of the exploratory analysis are to understand my data enough to feel comfortable moving forward with the creation of a predictive model. Steps involved in that include getting an idea of the range and distribution of the variable values, exploring any missing data (if any) and predict possible implications of that, exploring any outliers I find and possible explanations for their extreme values, and determining if there are any data points I would like to leave out of my model (either because of missing values or being an outlier). I will also plot variables against each other (one as Y, one as X) to see if any pair seems to be correlated, and calculate correlation coefficients for pairs that seem promising. 

> ### Question 12.  What are the assumptions of the distribution of data?

Answer: The assumptions of the distribution of data are that the scores (gpa and gre) will be positive and falling within an expected range (0 to 4.0 and 0 to 800, respectively). The distributions are assumed to be a normal distribution or a slightly skewed distribution. The admit variable is assumed to be binary, and likely 0 will be a more common occurence than 1, but this is a guess. The prestige variable is assumed to be an integer from 1 to 4 (inclusive), likely with a normal or skewed distribution. 

The risks of the data are that there might not be enough data points (or enough of one value for admit) to properly train a model, that the list of features is too small to predict admission outcomes, or the presence of outliers and/or missing data might interfere with getting enough clean data for the model. 

> ### Question 13.  How will you determine the distribution of your data?

Answer: I will plot a histogram and make a boxplot for each of the variables in my dataset, and use those graphs to get a sense of how the values of each of the variables are distributed.

> ### Question 14.  How might outliers impact your analysis?

Answer: Outliers, if present, might be an error (for instance, if an outlier was outside of the "possible" range of values for a variable), or they might be valid data points that represent an extraordinary circumstance. They might be deemed to be unrepresentative of the normal outcomes for this data set and thus excluded from analysis, leading me to have fewer data points for my predictive model, or they might be included and dominate the results in some way. I might try running the model with and without outliers included in the data to see if they skew the results tremendously, which would probably be bad, but if they are just a few data points out of many, I do not expect that to be the case. 

> ### Question 15.  How will you test for outliers?

Answer: I will compute the inter-quartile range (IQR) for each of the variables in the dataset, and calculate the range of "typical" data values, being from 1.5 * IQR less than the 1st Quartile value to 1.5 * IQR more than the 3rd Quartile value. Any data points with values outside of that range are outliers. I will also see if any data points have values outside the expected range for each variable, which might indicate a bad data point. 

> ### Question 16.  What is colinearity?

Answer: Colinearity is a state where two variables are correlated such that a change in one causes a predictable (close to proportional) change in the other variable. 

> ### Question 17.  How will you test for covariance?

Answer: I will calculate the correlation coefficient for any pair of variables that are graphed together and seem (visually) to have some degree of colinearity. That will give a quantitative measure of covariance. I will also make a correlation matrix to ensure I get a full picture of how all of the variables seem to be related. 

> ### Question 18.  What is your exploratory analysis plan?

> Using the above information, write an exploratory analysis plan that would allow you or a colleague to reproduce your analysis one year from now.

Answer: My exploratory analysis plan is to follow the data exploration steps below:

>1) Print basic statistics of the data (df.info() and df.describe()) to get a sense of how many data points we have, how many missing values might be in the data, and the basic statistics for that data

>2) Plot histograms and boxplots for the variables to see the distributions and any outliers we might have

>3) Look at any outliers and determine best action - leaving them in if they seem significant or removing them if they seem unrepresentative of the data or might be a result of error
>4) Look at data points where any values are missing and determine if there's anything to be done besides dropping those rows. If no, drop those rows

>5) Plot varibles against each other to get a sense of colinearity between them

>6) Create correlation matrix

>7) Plan which variables should be features in the model based on correlations. Likely all since few variables in this data set, but sometimes might not include all variables moving forward. 