### Summary

I'll be doing a simple classification problem demo to highlight the steps for machine learning
So this demo is about classify if a person survived in the titanic incident. I'll be using the data from kaggle (https://www.kaggle.com/c/titanic/data)
Also i'll be doing this demo in python with these particular packages library.
There are other library and also in other languages such as R to do some machine learning


### Exploring the data

ok the first step is look at what data we have, so let have look at the first few row
so each row would represent a person in the titanic and each column would be the feature
We can see that some feature are in numerical value while some are in string. Also some column doesn't have value

In [14]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LogisticRegression

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

def print_initial_summary(combined):
    print(combined.head(10))
    print("")
    print(combined.describe())
    print("")
    print(combined.info())

# reading train data
train = pd.read_csv('dataset/titanic/train.csv')

# reading test data
test = pd.read_csv('dataset/titanic/test.csv')

# extracting and then removing the targets from the training data
targets = train.Survived
train.drop('Survived', 1, inplace=True)

# merging train data and test data for future feature engineering
combined = train.append(test)
combined.reset_index(inplace=True)
combined.drop('index', inplace=True, axis=1)

print_initial_summary(combined)

   PassengerId  Pclass                                               Name     Sex   Age  SibSp  Parch            Ticket     Fare Cabin Embarked
0            1       3                            Braund, Mr. Owen Harris    male  22.0      1      0         A/5 21171   7.2500   NaN        S
1            2       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1      0          PC 17599  71.2833   C85        C
2            3       3                             Heikkinen, Miss. Laina  female  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S
3            4       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1      0            113803  53.1000  C123        S
4            5       3                           Allen, Mr. William Henry    male  35.0      0      0            373450   8.0500   NaN        S
5            6       3                                   Moran, Mr. James    male   NaN      0      0            330877   8.4583   NaN  

In [15]:
# lets fill in the missing value for ages. a simple approach will be taking the median and put into the missing slot
combined['Age'].fillna(combined['Age'].median(), inplace=True)
combined['Fare'].fillna(combined['Fare'].median(), inplace=True)
combined["Embarked"].fillna("S", inplace=True)

# encode the non numerical value
embarked_dummies = pd.get_dummies(combined['Embarked'], prefix='Embarked')
combined = pd.concat([combined, embarked_dummies], axis=1)
combined.drop('Embarked', axis=1, inplace=True)

sex_dummies = pd.get_dummies(combined['Sex'], prefix='Sex')
combined = pd.concat([combined, sex_dummies], axis=1)
combined.drop('Sex', axis=1, inplace=True)

def process_title(name):
    title = name.split(',')[1].split('.')[0].strip()
    white_list = ['Mr', 'Mrs', 'Miss', 'Master', 'Lady', 'Ms', 'Sir']
    if title in white_list:
        return title
    else:
        return 'Rare'

combined['Title'] = combined['Name'].map(lambda name: process_title(name))
title_dummies = pd.get_dummies(combined['Title'], prefix="Title")
combined = pd.concat([combined, title_dummies], axis=1)
combined.drop('Title', axis=1, inplace=True)

# dropping feature that aren't useful
combined = combined.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)

print()
print("===resultant dataset===")
print(combined.head())

# now we need to scale / normalise the value
features = list(combined.columns)
combined[features] = combined[features].apply(lambda x: (x - x.mean()) / (x.max()-x.min()), axis=0)

print()
print("===normalized dataset===")
print(combined.head())


===resultant dataset===
   Pclass   Age  SibSp  Parch     Fare  Embarked_C  Embarked_Q  Embarked_S  Sex_female  Sex_male  Title_Lady  Title_Master  Title_Miss  Title_Mr  Title_Mrs  Title_Ms  Title_Rare  Title_Sir
0       3  22.0      1      0   7.2500           0           0           1           0         1           0             0           0         1          0         0           0          0
1       1  38.0      1      0  71.2833           1           0           0           1         0           0             0           0         0          1         0           0          0
2       3  26.0      0      0   7.9250           0           0           1           1         0           0             0           1         0          0         0           0          0
3       1  35.0      1      0  53.1000           0           0           1           1         0           0             0           0         0          1         0           0          0
4       3  35.0      0      0 

In [16]:
train0 = pd.read_csv('dataset/titanic/train.csv')

targets = train0.Survived
train = combined[0:891]
test = combined[891:]

# perform a logistic regression algorithm

logreg = LogisticRegression(verbose=1)

logreg.fit(train, targets)

print(logreg.score(train, targets))

# getting prediction for the test set

Y_pred = logreg.predict(test)

test_df = pd.read_csv('dataset/titanic/test.csv')

submission = pd.DataFrame({
        "PassengerId": test_df["PassengerId"],
        "Survived": Y_pred
    })

submission.to_csv('dataset/titanic/output.csv', index=False)

[LibLinear]0.83164983165
