# Sample code for the Titanic competition

This notebook provides sample code for:
* loading packages
* reading in a data set
* creating a simple graph 
* spliting a single column into multiple columns based on a given condition
* training a logistic regression model

## Packages

In [None]:
# data processing
import pandas as pd
import numpy as np

# visualisation
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# machine learning
from sklearn.linear_model import LogisticRegression

## Loading data

In [None]:
# Setting directory
project_dir = "/home/jovyan/titanic-data-classification/data/"

In [None]:
train_df = pd.read_csv(project_dir + "train.csv")
test_df = pd.read_csv(project_dir + "test.csv")

In [None]:
# Preview the training data
train_df.head()

In [None]:
# Information on columns
train_df.info()

## Visualisation

In [None]:
# Simple count plot
sns.countplot(x='Sex', data = train_df)

In [None]:
# box plot of age, split by class
sns.boxplot(x = 'Pclass', y = 'Age', data = train_df)

## One-hot encoding

Categorical variables need to be prepared before being used in models. One-hot encoding refers to the process of creating dummy variables for each category. We've provided a function to do this below.

In [None]:
# Creates dummy columns for given categorical variables and adds it back to the original dataframe
def create_dummies(df,column_name):
    dummies = pd.get_dummies(df[column_name],prefix=column_name)
    df = pd.concat([df,dummies],axis=1)
    return df

In [None]:
# Example - remember, whatever processing is carried out on the training set should also be carried out on the test set (as both datasets are processed by the same algorithm)
train_df_ohe = create_dummies(train_df, "Pclass")
test_df_ohe = create_dummies(test_df, "Pclass")

In [None]:
# View the result
train_df_ohe.head()

## Machine Learning model

Here is some simple code to prepare training and test datasets, train a logistic regression model and view the accuracy of the model. 

**Note:** This will not function with the default dataset due to the presence of missing values. Also, the categorical variables will need to be one-hot encoded. 

In [None]:
# Preparing the training and test datasets
# Replace train_df and test_df with dataframes with selected features and ensure no missing data. 
# Categorical features will also need to be one-hot encoded

X_train = train_df.drop("Survived",axis=1)
Y_train = train_df["Survived"]
X_test  = test_df.drop("PassengerId",axis=1).copy()

In [None]:
# Logistic Regression

# Initialise variable
logreg = LogisticRegression()
# Fit the training data
logreg.fit(X_train, Y_train)
# Make predictions using the model on the test set
Y_pred = logreg.predict(X_test)
# Find out the accuracy score
logreg.score(X_train, Y_train)