# <p style="text-align: center;"> Unit Project 1: Research Design Write-Up </p>

---

# Part A. Deconstructing Problem Statements

## Problem Statement

Determine which free-tier customers will covert to paying customers, using demographic data collected at signup (age, gender, location, and profession) and customer useage data (days since last log in, and activity score 1 = active user, 0 = inactive user) based on Hooli data from January-April 2015.

### business objectives
To better understand the segment of customers that are most likely to convert to a paid subscription, as to maximize marketing resources.

### research goals (outcome)
To determine the association between different customer demographic and useage factors, and conversion to the paid Hooli service.

>**predictors & covariates**
>* age
>* gender
>* location
>* profession
>* days since last log-in
>* activity score

> **timeframe:** January - April 2015 <p>

### hypothesis
Customers with higher activity scores are more likely to convert to the paid service.

### <p style="text-align: center;"> ---------------------- </p>

# Part B. UCLA Admissions Research

## acquire the data

In [3]:
import os
import pandas as pd

df = pd.read_csv(os.path.join('..', 'dataset', 'ucla-admissions.csv'))

df.head()

Unnamed: 0,admit,gre,gpa,prestige
0,0,380.0,3.61,3.0
1,1,660.0,3.67,3.0
2,1,800.0,4.0,1.0
3,1,640.0,3.19,4.0
4,0,520.0,2.93,4.0


## parse the data

###  data dictionary
Variable | Description | Type of Variable
---| ---| ---
admit |admitted(1), not admitted(0)  | binary
gre | applicant's GRE score, integer | continuous
GPA | applicant's GPA, integer | continuous
prestige | prestige of applicant's alma mater, <p>1 as highest tier (most prestigeous) and 4 as the lowest tier (least prestigeous) | categorical

### distribution of the data

In [5]:
df.describe()



Unnamed: 0,admit,gre,gpa,prestige
count,400.0,398.0,398.0,399.0
mean,0.3175,588.040201,3.39093,2.486216
std,0.466087,115.628513,0.38063,0.945333
min,0.0,220.0,2.26,1.0
25%,0.0,,,
50%,0.0,,,
75%,1.0,,,
max,1.0,800.0,4.0,4.0


> **Summary:**
* The range of results in each column fits within what you would expect, (no negative numbers, no numbers above the GPA or GRE score maximums
* We there are some null values in the GRE, GPA, and Prestige columns

## explore the data

### research goals
To determine the association between an applicant's GPA, the prestige of their alma mater, and their GRE scores, and the likelihood of admission to UCLA.

>**predictors & covariates**
* GRE scores
* GPA
* Prestige of alma mater

> **timeframe:** unclear<p>

## hypothesis

The prestige of an applicant's alma mater will have the highest impact on their likelihood of gaining admission to the UCLA graduate school program.

## problem statement

Determine the likelihood that an applicant will gain admission to UCLA's graduate school, using application data (GRE score, GPA, and prestige of the applicant's alma mater), based on UCLA admissions data.

>**assumptions & risks**<p>
>&nbsp;&nbsp;&nbsp;(a) we have no other information about the applicants that may have impacted their chances of admissions (full application, the degree program they were applying to, how competitive that program is, etc.) <p>
>&nbsp;&nbsp;&nbsp;(b) the data set only represents 400 applicants, and the timing of their applications is unclear <p>
>&nbsp;&nbsp;&nbsp;(c) it is unclear how prestige is calculated

### <p style="text-align: center;"> ---------------------- </p>

# Part C. Exploratory Analysis Plan

## goals
To better understand:
>* the types of data in the dataframe
>* the kind of data munging that is required
>* the distribution of the data
>* any outliers that might impact analysis

## distribution & outliers

### distribution
>**assumptions**
>* the data has a regular distribution, with the vast majority of applicants having average GRE and GPA scores, with fewer applicants having very high or very low scores
>* prestige of alma mater will correlate positively with GRE and GPA scores

>**testing**
>* describe the data to identify the min, max, mean, and standard deviation
>* visualize the data to better understand the distribution

### outliers
>**impact**
>* outliers have the biggest impact on the mean and standard deviation
>* outliers can indicate bad data (for example, a GRE score of 1200 is impossible)

>**testing**
>* compare the mean to the median and mode of the data set
>* visualize the data
>* apply various statistical tests

## collinearity & covariance
### collinearity
Collinearity pops up when two independent variables are highly correlated, which can skew the analysis.
>example: in the case of the UCLA data, I would guess GPA and GRE scores are highly correlated 

### covariance
Covariance measures how change in one variable affects a second, independent variable.
>testing: you can vizualize the data, use mathematical formulas, or the pandas .cov( ) function

## the plan



### (1) describe the data
* understand the basic distribution
* calculate the mean, mode, median, max, min, etc.

### (2) visualize the data
* plot the data to get a better understanding of the distribution, the relationship between various variables, and to spot any outliers

### (3) testing
* test for collinearity and covariance