## How to complete a Kaggle Competition with Machine Learning

In this code along session, you'll build several algorithms of increasing complexity that predict whether any given passenger on the Titanic survived or not, given data on them such as the fare they paid, where they embarked and their age.

<img src="img/nytimes.jpg" width="500">

In particular, you'll build _supervised learning_ models. _Supervised learning_ is the branch of machine learning (ML) that involves predicting labels, such as 'Survived' or 'Not'. Such models:

* it learns from labelled data, e.g. data that includes whether a passenger survived (called model training).
* and then predicts on unlabelled data.

On Kaggle, a platform for predictive modelling and analytics competitions, these are called train and test sets because

* You want to build a model that learns patterns in the training set
* You _then_ use the model to make predictions on the test set!

Kaggle then tells you the **percentage that you got correct**: this is known as the _accuracy_ of your model.

## Approach

A good way to approach supervised learning:

* Exploratory Data Analysis (EDA);
* Build a quick and dirty model (baseline);
* Iterate;
* Engineer features;
* Get model that performs better.



## Import the data and check it out

In [None]:
# Import modules
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import tree
from sklearn.metrics import accuracy_score

# Figures inline and set visualization style
%matplotlib inline
sns.set()

In [None]:
# Import test and train datasets
df_train = pd.read_csv('data/train.csv')
df_test = pd.read_csv('data/test.csv')

# View first lines of training data
df_train.head(15)

* What are all these features? Check out the Kaggle data documentation [here](https://www.kaggle.com/c/titanic/data).

**Important note on terminology:** 
* The _target variable_ is the one you are trying to predict;
* Other variables are known as _features_ (or _predictor variables_).

In [None]:
# View first lines of test data
df_test.head(10)

* Use the DataFrame `.info()` method to check out datatypes, missing values and more (of `df_train`).

In [None]:
df_train.info()

* Use the DataFrame `.describe()` method to check out summary statistics of numeric columns (of `df_train`).

In [None]:
df_train.describe()


**Recap:**
* you've loaded your data and had a look at it.

**Up next:** Explore your data visually and build a first model!

For more on `pandas`, check out our [Data Manipulation with Python track](https://www.datacamp.com/tracks/data-manipulation-with-python). 


## Visual exploratory data analysis and your first model

* Use `seaborn` to build a bar plot of Titanic survival (your _target variable_).

In [None]:
sns.countplot(x='Survived', data=df_train)

**Take-away:** In the training set, less people survived than didn't. Let's then build a first model that **predict that nobody survived**.

This is a bad model as we know that people survived. But it gives us a _baseline_: any model that we build later needs to do better than this one.

* Create a column 'Survived' for `df_test` that encodes 'did not survive' for all rows;
* Save 'PassengerId' and 'Survived' columns of `df_test` to a .csv and submit to Kaggle.

In [None]:
df_test['Survived'] = 0
df_test[['PassengerId', 'Survived']].to_csv('predictions/no_survivors.csv', index=False)

* What accuracy did this give you?

Accuracy on Kaggle = 62.7

**Essential note!** There are metrics other than accuracy that you may want to use.

**Recap:**
* you've loaded your data and had a look at it.
* you've explored your target variable visually and made your first predictions.

**Up next:** More EDA and you'll build another model.

## EDA on feature variables

* Use `seaborn` to build a bar plot of the Titanic dataset feature 'Sex' (of `df_train`).

In [None]:
sns.countplot(x='Sex', data=df_train)

* Use `seaborn` to build bar plots of the Titanic dataset feature 'Survived' split (faceted) over the feature 'Sex'.

In [None]:
sns.catplot(x='Survived', col='Sex', kind='count', data=df_train)

**Take-away:** Women were more likely to survive than men.

* Use `pandas` to figure out how many women and how many men survived.

In [None]:
df_train.groupby(['Sex']).Survived.sum()

* Use `pandas` to figure out the proportion of women that survived, along with the proportion of men:

In [None]:
print(df_train[df_train.Sex == 'female'].Survived.sum()/df_train[df_train.Sex == 'female'].Survived.count())
print(df_train[df_train.Sex == 'male'].Survived.sum()/df_train[df_train.Sex == 'male'].Survived.count())

74% of women survived, while 19% of men survived.

Let's now build a second model and predict that all women survived and all men didn't. Once again, this is an unrealistic model, but it will provide a baseline against which to compare future models.

* Create a column 'Survived' for `df_test` that encodes the above prediction.
* Save 'PassengerId' and 'Survived' columns of `df_test` to a .csv and submit to Kaggle.

In [None]:
df_test['Survived'] = df_test.Sex == 'female'
df_test['Survived'] = df_test.Survived.apply(lambda x: int(x))
df_test.head()

In [None]:
df_test[['PassengerId', 'Survived']].to_csv('predictions/women_survive.csv', index=False)

* What accuracy did this give you?

Accuracy on Kaggle = 76.5

**Recap:**
* you've loaded your data and had a look at it.
* you've explored your target variable visually and made your first predictions.
* you've explored some of your feature variables visually and made more predictions that did better based on your EDA.

**Up next:** EDA of other feature variables, categorical and numeric.

For more on `pandas`, check out our [Data Manipulation with Python track](https://www.datacamp.com/tracks/data-manipulation-with-python). 

For more on `seaborn`, check out Chapter 3 of our [Intro. to Datavis with Python course](https://www.datacamp.com/courses/introduction-to-data-visualization-with-python).

## Explore your data more!

* Use `seaborn` to build bar plots of the Titanic dataset feature 'Survived' split (faceted) over the feature 'Pclass'.

In [None]:
sns.catplot(x='Survived', col='Pclass', kind='count', data=df_train);

* Use `seaborn` to build bar plots of the Titanic dataset feature 'Survived' split (faceted) over the feature 'Embarked'.

In [None]:
sns.catplot(x='Embarked', col='Pclass', kind='count', data=df_train);

**Take-away:** [Include take-away from figure here]

## EDA with numeric variables

* Use `seaborn` to plot a histogram of the 'Fare' column of `df_train`.

In [None]:
sns.catplot(x='Fare', col='Survived', kind='count', data=df_train);

**Recap:**
* you've loaded your data and had a look at it.
* you've explored your target variable visually and made your first predictions.
* you've explored some of your feature variables visually and made more predictions that did better based on your EDA.
* you've done some serious EDA of feature variables, categorical and numeric.

**Up next:** Time to build some Machine Learning models, based on what you've learnt from your EDA here. Open the notebook `2-titanic_first_ML-model.ipynb`.
