# Getting started with Titanic
## Kaggle Tutorial

Let's begin with the default Kaggle notebook starting commands which will load the `numPy` and `pandas` packages and show a list of available files in the `kaggle/input/` folder.

In [3]:
import numpy as np # linear algebra
import pandas as pd # data processing

import os
for dirname, _, filenames in os.walk('data-in'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

data-in/test.csv
data-in/train.csv


Now we can read the `train.csv` file into pandas and view the `head`.

In [4]:
train_data = pd.read_csv("data-in/train.csv")
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


And do the same for the `test.csv` file.

In [5]:
test_data = pd.read_csv("data-in/test.csv")
test_data.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


## Improving our score
Our goal is to find patterns in `train.csv` that help us predict whether the passengers in `test.csv` survived. We'll start simple.

### Exploring a pattern
Remember that the sample submission file in `submissions/gender_submission.csv` assumes that all female passengers survived (and all male passengers survived). Let's check if this pattern holds true in `train.csv`.

In [6]:
women = train_data.loc[train_data.Sex == 'female']['Survived']
rate_women = sum(women) / len(women)
print('% of women who survived:', rate_women)

% of women who survived: 0.7420382165605095


Let's see what proportion of male passengers survived.

In [5]:
men = train_data.loc[train_data.Sex == 'male']['Survived']
rate_men = sum(men) / len(men)
print('% of men who survived:', rate_men)

% of men who survived: 0.18890814558058924


From the data in `train.csv` we can see that almost 75% of women survived whereas only 19% of men survived. Gender does appear to be a fairly strong indicator of survival. However, the gender-based submission only considers one column. By considering multiple columns we can uncover more complex patterns that may potentially lead to better predictions. Doing this by hand is difficult, and this is where we can use machine learning to help.

### Our first machine learning model
We will build what is known as a **random forest model**. This model is constructed of several "trees" (there are three trees in the picture below but we will construct 100) that will individually consider each passenger's data and vote on whether that passenger survived. Then the random forest model makes a democratic decision: the outcome with the most votes wins.

![](https://i.imgur.com/AC9Bq63.png)

The code cell below looks for patterns in four different columns (`Pclass`, `Sex`, `SibSp`, and `Parch`) of the data. It constructs the trees in the random forest model based on patterns in the `train.csv` file, before generating predictions for the passengers in `test.csv`. The code also saves these new predictions in a CSV file `my_submission.csv`.

In [8]:
from sklearn.ensemble import RandomForestClassifier

y = train_data["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('my_submission.csv', index=False)
print("Your submission was successfully saved!")
output.head()

Your submission was successfully saved!


Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1
