In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

<h3>Import data</h3>

In [44]:
df = pd.read_csv('train.csv', header=0)
df.dtypes
df.info()
df.describe()
df.head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C


**Data Munging Tutorial**

In [45]:
# Obtain first 10 rows of Age column
df['Age'][0:10] 
df.Age[0:10]
# Get mean
df['Age'].mean()

# Obtain subsets 
df[['Sex', 'Pclass', 'Age']]

# Filtering
df[df['Age'] > 60]
df[df['Age'].isnull()]
print ''




**Data Cleaning Tutorial**

In [46]:
# Binarize Sex column
df['Gender'] = df['Sex'].map( {'female': 0, 'male': 1} ).astype(int)

In [47]:
# Numerize Embarked column
df['Embarked_Num'] = df['Embarked'].map({'C':1, 'Q':2, 'S':3}).astype(float)

In [48]:
# Fill in missing data with median age of each class
median_ages = np.zeros((2,3))
for i in range(0, 2):
    for j in range(0, 3):
        median_ages[i,j] = df[(df['Gender'] == i) & \
                              (df['Pclass'] == j+1)]['Age'].dropna().median()
        
df['AgeFill'] = df['Age']
for i in range(0, 2):
    for j in range(0, 3):
        df.loc[ (df.Age.isnull()) & (df.Gender == i) & (df.Pclass == j+1),\
                'AgeFill'] = median_ages[i,j]

df['AgeIsNull'] = pd.isnull(df.Age).astype(int)
df[ df['Age'].isnull() ][['Gender','Pclass','Age','AgeFill','AgeIsNull']].head(2)

Unnamed: 0,Gender,Pclass,Age,AgeFill,AgeIsNull
5,1,3,,25,1
17,1,2,,30,1


In [49]:
df.dtypes[df.dtypes.map(lambda x: x=='object')]

Name        object
Sex         object
Ticket      object
Cabin       object
Embarked    object
dtype: object

In [52]:
df_new = df.drop(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], axis=1)
df_new.head()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,Gender,Embarked_Num,AgeFill,AgeIsNull
0,1,0,3,22,1,0,7.25,1,3,22,0
1,2,1,1,38,1,0,71.2833,0,1,38,0
2,3,1,3,26,0,0,7.925,0,3,26,0
3,4,1,1,35,1,0,53.1,0,3,35,0
4,5,0,3,35,0,0,8.05,1,3,35,0


In [53]:
train_data = df_new.values

VARIABLE DESCRIPTIONS:
survival        Survival
                (0 = No; 1 = Yes)
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1)
 If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored.  The following are the definitions used
for sibsp and parch.

Sibling:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent:   Mother or Father of Passenger Aboard Titanic
Child:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws.  Some children travelled
only with a nanny, therefore parch=0 for them.  As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.