### Read and evaluate the following problem statement: 
Determine which passengers will survive on the Titantic, using all avaliable data (age, passenger class, fare, ect...).



# Project 1
## You may have to do some clever google-searches, slack posting, and collaboration to find answers to these questions!!! (Post in Slack, talk with eachother, ask me for help!)

In this first project you will create a framework to scope out data science projects. This framework will provide you with a guide to develop a well-articulated problem statement and analysis plan that will be robust and reproducible.

#### 1. What is the outcome? (Rather morbid, I realize this)

Answer: The outcome of this data set will determine which Titanic passengers were more likely to survive, based on various traits/characteristics (e.g. age, gender, etc.) that we know about them.

#### 2. What are the predictors/covariates? 

Answer: The predictors/covariates include:
- Ticket class (1st, 2nd, 3rd)
- Gender
- Age in years	
- The total number of siblings/spouses the passenger had aboard the Titanic	
- The total number of parents & children the passenger had aboard the Titanic	
- The passenger fare	
- Port of Embarkation (Cherbourg, Queenstown, Southampton)

We also know the passenger's name, ticket number, and cabin number, but those are unlikely to be predictors because they aren't related with variables such as class/port of embarkation, etc.

#### 3. What timeframe is this data relevent for?

Answer: The timeframe for this data is April 15, 1912, the day the Titanic sank. While it can be used to possibly help us understand other shipwrecks/travel disasters, it only includes data for this singular event. 

#### 4. What is the hypothesis?

Answer: In "plain English", my null hypotheses are:

- Gender is unrelated to survival rate on the Titanic.
- Age is unrelated to survival rate on the Titanic.
- Passenger class (a proxy for socioeconomic status) is unrelated to survival rate on the Titanic.
- Passenger fare (a proxy for SES) is unrelated to survival rate on the Titanic. 
- Port of embarkation is unrelated to survival rate on the Titanic.
- The number of siblings/spouses a passenger had aboard the Titanic is unrelated to survival rate. 
- The number of parents/children a paassenger had aboard the Titanic is unrelated to survival rate. 

## Let's get started with our dataset

In [1]:
# Lets look at our data so we can create a data dictionary

import pandas as pd
import numpy as np

file_path = '/Users/thshih/ds_class/data_sets/'

titanic_df = pd.read_csv(file_path + "titanic_train.csv")
test_df = pd.read_csv(file_path + "titanic_test.csv")

In [2]:
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
test_df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


#### 1. Create a data dictionary 

We would like to explore the association between X and Y 

Answer: 

Outcome/Covariate | Variable | Description | Type of Variable
--- | ---| ---| ---
outcome | survival | 0 = No, 1 = Yes | categorial/discrete
neither | passengerid | id associated w/ passenger | categorial/discrete
covariate | pclass (Ticket class)| 	1 = 1st class, 2 = 2nd class, 3 = 3rd class | categorial/discrete 
covariate | sex	| male/female | categorial/discrete (though the "gender studies" voice in me is saying "gender is a spectrum and there aren't just 2 binary genders :)". But, given the existing data set, categorical/discrete)
covariate | Age	| Age in years | categorial/discrete (but only because it's in whole/half years - otherwise, it'd be continuous)
covariate | sibsp |	# of siblings / spouses on the Titanic | categorial/discrete
covariate | parch |	# of parents / children on the Titanic | categorial/discrete
neither | ticket | Ticket number | categorial/discrete
covariate | fare | Passenger fare | This one is tricky. In theory, there's a non countable, infinite number of values between 0.01 [units of money] and 0.02 [units of money]. In reality, because of how we conceputalize money/currency, there __is__ a discrete, countable number of values between each unit of money (e.g. if you think of the smallest unit of money, say a penny, we don't think of money in terms of fractions of the smallest unit of money). __But, but also__, money has no "limit" in terms of value we can count (e.g. in theory money can go on into infinity). Therefore, it's not truly "limited" in terms of its countable values, which is a key part of discrete/categorical variables. So, I'm going to say continuous (but continue to internally debate with myself).
neither | cabin |	Cabin number | categorial/discrete 	
covariate | embarked | Port of Embarkation | categorial/discrete


#### 2. What is the outcome?

Answer: The outcome is whether or not a passenger survives. We can look at the outcome in tandem with the different variables we have to understand how certain variables impacted/influenced survival rate on the Titanic. 

#### 3. What are the predictors/covariates? 

Answer: Based on the above data, the predictors/covariates include:
- Ticket class (1st, 2nd, 3rd)
- Gender
- Age in years	
- The total number of siblings/spouses the passenger had aboard the Titanic	
- The total number of parents & children the passenger had aboard the Titanic	
- The passenger fare	
- Port of Embarkation (Cherbourg, Queenstown, Southampton)

We also know the passenger's name, ticket number, and cabin number, but those are unlikely to be predictors because they aren't related with variables such as class/port of embarkation, etc.

#### 4. What timeframe is this data relevent for?

Answer: The timeframe for this data is April 15, 1912, the day the Titanic sank. While it can be used to possibly help us understand other shipwrecks/travel disasters, it only includes data for this singular event. 

#### 4. What is the hypothesis?

Answer: In "plain English", my null hypotheses are:

- Gender is unrelated to survival rate on the Titanic.
- Age is unrelated to survival rate on the Titanic.
- Passenger class (a proxy for socioeconomic status) is unrelated to survival rate on the Titanic.
- Passenger fare (a proxy for SES) is unrelated to survival rate on the Titanic. 
- Port of embarkation is unrelated to survival rate on the Titanic.
- The number of siblings/spouses a passenger had aboard the Titanic is unrelated to survival rate. 
- The number of parents/children a paassenger had aboard the Titanic is unrelated to survival rate. 

    Using the above information, write a well-formed problem statement. 


## Problem Statement

### Exploratory Analysis Plan

Using the lab from a class as a guide, create an exploratory analysis plan. 

#### 1. What are the goals of the exploratory analysis? 

Answer: The goals of this exploratory analysis are:
- Explore what a "typical" passenger aboard the Titanic was like (e.g. what was the distribution of ages? How many men / women? What were the ranges of fares, etc.)
- Take a deeper dive, and look at what a "typical" passenger who survived vs. didn't survive was like (e.g. was there a higher prevalence of men vs. women who survived? How did age influence survival (both very young & very old)? What were survival rates like for different fares / passenger classes?
- Look at cross-sections of data. E.g. what do survival rates look like by age & gender? By gender & class? We want to be careful not to look at all the variables, or to think about "confounding" variables, but it's very possible that a variety of factors "tell the story" of who was most likely to survive. 
- Build models that will help us test our above hypotheses about what variables do/don't have an impact on survival rate. 

#### 2a. What are the assumptions of the distribution of data? 

Answer: Our assumption is that the data is "normally" distributed, which is especially important because we will be using the data set to calculate:
- averages / "measures of center"
- standard deviation
- confidence intervals
- interquartile range / "outliers"

#### 2b. How will determine the distribution of your data? 

Answer: There are a couple of ways I can determine the distribution of my data:

1) Use the ".describe()" function to get the mean, median, counts, and IQR values of each of the covariates I'm looking to analyze
2) Use histogram plots to get a visual representation of how distributed the data is 

In [4]:
titanic_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [6]:
pd.plot(titanic_df'Pclass')

SyntaxError: invalid syntax (<ipython-input-6-15a6372effe2>, line 1)

#### 3a. How might outliers impact your analysis? 

Answer: Outliers might positively or negatively skew my data, so that the mean value is either much higher or much lower than my median value. If I use the average as my "measure of center", this could cause me to think that a covariate, such as the age or fare, is actually might higher/lower than the median/middle value is, which could cause me to falsely represent and conclude which age/fare/family size, etc. was "most likely" to survive the Titanic sinking.

#### 3b. How will you test for outliers? 

Answer: Median +/- 1.5x IQR. Also, 2 standard deviations from mean, depending on the "error rate"

#### 4a. What is colinearity? 

Answer: Collinearity is when there is a strong, linear relationship between two or more covariates/predictor variables. This can make it difficult to understand what the individual impact of a covariate is, and can lead us to "overfit" our data (i.e. add too many variables). It can also mean that our coefficients are not as accurate as they could be, or that we have error rates that are very high/erratic.

#### 4b. How will you test for colinearity? 

Answer: 

#### 5. What is your exploratory analysis plan?
Using the above information, write an exploratory analysis plan that would allow you or a colleague to reproduce your analysis 1 year from now. 

Answer: 

## Bonus Questions:
1. Outline your analysis method for predicting your outcome
2. Write an alternative problem statement for your dataset
3. Articulate the assumptions and risks of the alternative model