### Titanic Dataset ([Kaggle](https://www.kaggle.com/c/titanic/overview)) - Data Manipulation and SVC

In this notebook I will be covering application of SVC (Support Vector Classification) on Titanic dataset (Kaggle competition). Titanic dataset is a good starting point for Kaggle prediction competititons. 

#### Data Import
Let's start with the standard inputs:

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn; seaborn.set_style('whitegrid')
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

Next, the data sets, provided with the competition, are loaded to pandas DataFrame. The data has already been split into two files: 
* training set (train.csv)
* test set (test.csv)

The challenge is to train the model on the provided training set, and then predict survivors on the test set.

__train.csv__ contains 'Survived' column as a response variable with the values 0 and 1 for not survived and survived respectively.

In [2]:
# Importing the files to pandas DataFrame
train_data = pd.read_csv('Data/train.csv')
test_data = pd.read_csv('Data/test.csv')

# Separating 'Survived' column and assigning it to y
y = train_data.pop('Survived')

# Displaying the shapes of new DataFrames
print('Training data set - {} \nY Training - {} \nTest data set - {}'
      .format(train_data.shape, y.shape, test_data.shape))

Training data set - (891, 11) 
Y Training - (891,) 
Test data set - (418, 11)


Before we start applying machine learning algorithm, we need to consider few important steps: 

* To obtain better results with SVC, provided data set needs to be preprocessed
* SVC performs better on the scaled data
* Categorical data >= 3 will be each represented as OneHotEncoded
* GridSearch will be used for the selection of the best parameters

Two sets are combined for preprocessing and feature engineering. We can obtain more information from the combined data set. The data set will be later split for training the model. 

In [3]:
# Combining two datasets into one
df = pd.concat([train_data, test_data], ignore_index=True)

# Checking combined dataset
df.head(10)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [4]:
# Checking the info of a combined set
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 11 columns):
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Name           1309 non-null object
Sex            1309 non-null object
Age            1046 non-null float64
SibSp          1309 non-null int64
Parch          1309 non-null int64
Ticket         1309 non-null object
Fare           1308 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 112.6+ KB
