## Getting starting with data analysis in python
Previously, I've only done data analysis in R. Here's my first real try of doing something besides linear regression/GLMs in Python.

The titanic dataset is a dataset of the passengers onboard the ill-fated titanic when it sunk over a century ago. Let's see if we can predict whether or not a passenger would've survived based on the other characteristics we knew about them.

Let's use pandas, import the classic titanic dataset, and print the columns names, and the top 7 rows of the dataset as a sanity check/ quick look of the data we're importing. 

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

train = pd.DataFrame.from_csv('C:/Users/vlee/PycharmProjects/Jupyter-Notebooks/Kaggle/Titanic/Data/train.csv', index_col = None) 
# Originally, I had issues importing data as the first column was not being recognized
# When you import csv files using pandas, by default the first column of the file is an index column
# index_col=None tells pandas that the first column given is a column with actual data

print(train.columns.values)
train.head(n=7)

['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch'
 'Ticket' 'Fare' 'Cabin' 'Embarked']


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S


It looks like some of the columns may be unsuitable for prediction. Let's see what the columns are actually representing. Here's an explanation of the variables taken from Kaggle.


## Data Dictionary
survival	- Survival	0 = No, 1 = Yes
pclass	- Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
sex	- Sex	
Age	- Age in years	
sibsp -	# of siblings / spouses aboard the Titanic	
parch -	# of parents / children aboard the Titanic	
ticket -	Ticket number	
fare -	Passenger fare	
cabin -	Cabin number	
embarked -	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.


## Sanity check of variables

Some variables that should stand out are the Name and Ticket columns. By intuition, the name should not be a significant determinant in whether or not someone died in a ship sinking.

In a similar fashion, the ticket number shouldn't really matter either. Also pulling the first 7 ticket numbers from the data set, we see that ticket numbers have no clear meaning, as some of the ticket numbers have characters included, and the numbers range from 17463 to 373450, which means that the ticket number does not match the number of passengers either, or boarding order, as the titanic certainly did not have room for 300,000 people.

## Data Janitor Work

Since we've identified the Name and Ticket columns are not being particularly useful in predicting whether or not a given titanic passenger would've survived, let's drop those first. 

In [2]:
train.drop(['Name','Ticket'], axis = 1, inplace = True)
train.head(n=7)

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,1,0,3,male,22.0,1,0,7.25,,S
1,2,1,1,female,38.0,1,0,71.2833,C85,C
2,3,1,3,female,26.0,0,0,7.925,,S
3,4,1,1,female,35.0,1,0,53.1,C123,S
4,5,0,3,male,35.0,0,0,8.05,,S
5,6,0,3,male,,0,0,8.4583,,Q
6,7,0,1,male,54.0,0,0,51.8625,E46,S


Looks better. Looks like we might have some missing data (PassengerID 7 has a missing age value, and the Cabin column has many missing values). Let's see if we have any missing data values elsewhere.

In [3]:
train.isnull().sum()
# age = train['Age'].tolist()

PassengerId      0
Survived         0
Pclass           0
Sex              0
Age            177
SibSp            0
Parch            0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Looks like we have some missing data. The Age, Cabin, and Embarked columns have some missing values. We will need to deal with this. 