This is my first foray into Kaggle and machine learning. The Titanic dataset is a great place to get started because of the huge amount of tutorials and kernels about this project, many of which I gratefully referenced to help me complete my own kernel. All references are included at the end of this notebook.

1. Import libraries and data
2. Assess the quality of the data
3. Transform the data to allow for model building
4. Perform EDA to find important variables as well as new variables that could be created
5. Try out different machine learning algorithms
6. Make final predictions

In [None]:
import numpy as np 
import pandas as pd 

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.datasets import make_blobs
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import xgboost as xgb

train_df = pd.read_csv('../input/train.csv')
test_df = pd.read_csv('../input/test.csv')


In [None]:
train_df.head()

In [None]:
train_df.describe(include="all")

In [None]:
test_df.head()

In [None]:
test_df.describe(include="all")

In [None]:
train_df.dtypes

Int: PassengerId, Survived, Pclass (ticket class), SibSp (# of siblings/spouses aboard), Parch (# of parents/children aboard)
Float: Age, Fare
Object: Name, Sex, Ticket (ticket number), Cabin, Embarked

Age and Cabin are the only two variables for which we need to fill in missing data. Can likely use mean or median for age, but it's a little harder to deal with the missing data in Cabin. 

Let's do some exploratory data analysis to see what we can find out about the dataset before building the models.



In [None]:
#first look at age and see how to best deal with missing values

train_df['Age'].hist()

In [None]:
#the data is right skewed. Let's take a  look at survival rates by age to see if age is a big factor.
#If it is not, then we can just impute median. If it is, however, then we will have to find another way.

#let's first group age into 10 bins

train_df['AgeBins'] = pd.qcut(train_df['Age'], 7)
sns.barplot(x="AgeBins", y="Survived", data=train_df)

#we find that babies, and surprisingly 20-36 year olds, have a high survival rate

In [None]:
#We can also take a look at survival rates based on name title
train_df['Title'] = train_df.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
test_df['Title'] = test_df.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
sns.barplot(x="Title", y="Survived", data=train_df)

In [None]:
np.unique(train_df['Title'])

#Since survival rate varies quite a bit by Title, let's impute avg age by title

In [None]:
sns.barplot(x="Title", y="Age", data=train_df)

In [None]:
train_df['Title'].value_counts()

#However, looking at this list of ages below, pretty much everyone is either a Mr, Miss, Mrs, or Master.
#So now we know that title is quite predictive, but it doesn't really help us determine age.
#So let's impute median age for the missing values. As we saw from the summary up top, 28 is median age.

In [None]:
train_df.loc[(train_df.Age.isnull()),'Age'] = train_df["Age"].median()
test_df.loc[(test_df.Age.isnull()),'Age'] = test_df["Age"].median()

In [None]:
#Let's fill out the rest of the missing values in the train 

train_df['Cabin'].value_counts()

#Since cabin is pretty much unique to the person, we can just drop it from the dataset

train_df = train_df.drop(['Cabin'], axis = 1)
test_df = test_df.drop(['Cabin'], axis = 1)

In [None]:
train_df['Embarked'].value_counts()

#Since S is most common and there's only two missing, let's bring S for the two missing values
#Only one missing fare (in test dataset), so replace that with mean test fare
train_df['Embarked'].fillna("S", inplace=True)
test_df['Fare'].fillna(35.6, inplace=True)

#Let's also drop the AgeBins variable since it doesn't tell us anything more than what age already does
train_df = train_df.drop(['AgeBins'], axis = 1)

In [None]:
train_df.describe(include="all")

In [None]:
test_df.describe(include="all")

Now that both the test and train datasets have the same columns (besides 'Survived') and no missing values, let's do some EDA and build our model.

In [None]:
#Let's see how each variable correlates to survival rates.
#But first, let's see which variables are unique to passenger
#We can drop Name since it's unique to each passenger and we have already pulled out Title
#Let's see if ticket is unique

test_df.describe(include="all")

In [None]:
train_df[train_df.Ticket == "347082"]

#looks like each famiy has the same ticket. So let's keep it in the model for now, though it seems like
#SibSp and Parch give us the same information.

In [None]:
#So the only variables we'll drop for now is Name

train_df = train_df.drop(['Name'], axis = 1)
test_df = test_df.drop(['Name'], axis = 1)

In [None]:
test_df.describe(include="all")

In [None]:
#features = ['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked','Title']
#categorical = ['Sex','Embarked','Title']

features = ['Pclass','Sex','Age','SibSp','Parch','Fare']
categorical = ['Sex']

y = ['Survived']

In [None]:
le = LabelEncoder()
train_df['Sex'] = le.fit_transform(train_df['Sex'])

le = LabelEncoder()
test_df['Sex'] = le.fit_transform(test_df['Sex'])

In [None]:
train_x = train_df[features].as_matrix()
test_x = test_df[features].as_matrix()
train_y = train_df['Survived']

In [None]:
train_x

In [None]:
gbm = xgb.XGBClassifier(max_depth=5, n_estimators=250, learning_rate=0.05).fit(train_x, train_y)
predictions = gbm.predict(test_x)

In [None]:
submit_preds = pd.DataFrame({ 'PassengerId': test_df['PassengerId'], 'Survived': predictions })
submit_preds.to_csv("submit_preds.csv", index=False)