# Data Exploration & Cleaning

This notebook is divided into:
* **Data Exploration** 
  * Descriptive Statistics
  * Missing Values
  * Outliers
* **Data Cleaning**
  * Features
  * Imputation for missing data
  * Handling of outliers
* **Data Analysis**
  * Standardisation
  * Cross - Validation
  * Training
  * Results submission

#### Packages 

In [7]:
import os
import pandas as pd
import seaborn as sns


#### Loading the dataset 

In [None]:
path='/Users/xxx/Desktop'
df_train = pd.read_csv(os.path.join(path,'train.csv'), header=0, index_col='PassengerId')
df_test = pd.read_csv(os.path.join(path,'test.csv'), header=0, index_col='PassengerId')
df = pd.concat([df_train, df_test], keys=['train', 'test'], sort=False)

## Data Exploration 

### 1) Descriptive Statistics 

In [None]:
df.head()

In [None]:
df.describe()

### 2) Missing values

In [None]:
df.isnull().sum()

In [8]:
sns.heatmap(df.isnull(), cbar=False)

NameError: name 'df' is not defined

### 3) Outliers

In [None]:
sns.boxplot(x=df['Age'])

In [None]:
sns.boxplot(x=df['SibSp'])

In [None]:
sns.boxplot(x=df['Parch'])

In [None]:
sns.boxplot(x=df['Fare'])

## Data Cleaning 

### Dropping features and creating new ones 
*  Two new features are created: *Title*, which is derived from the *Name* feature, and *Family*, which summarize the information contained in the *Sibsp* and *Parch* features.
*  The feature *Cabin* is dropped, because it has too many missing values.
*  The features *Name*, *Sibsp* and *Parch* are dropped since they are now redundant.



In [None]:
df['Title'] = df['Name'].apply(lambda name: name[name.index(',') + 2 : name.index('.')])
df['FamilySize'] = (df['SibSp'] + df['Parch'] + 1)

In [None]:
df.FamilySize= df.FamilySize.astype(float)
df.Pclass = df.Pclass.astype(float)

In [None]:
print(df.Title.value_counts())

In [None]:
ReducedTitles = {
    "Capt":       "Officer",
    "Col":        "Officer",
    "Major":      "Officer",
    "Jonkheer":   "Royalty",
    "Don":        "Sir",
    "Sir" :       "Sir",
    "Dr":         "Officer",
    "Rev":        "Officer",
    "the Countess":"Royalty",
    "Dona":       "Royalty",
    "Mme":        "Mrs",
    "Mlle":       "Miss",
    "Ms":         "Mrs",
    "Mr" :        "Mr",
    "Mrs" :       "Mrs",
    "Miss" :      "Miss",
    "Master" :    "Sir",
    "Lady" :      "Royalty"
}

df.Title = df.Title.map(ReducedTitles)


In [None]:
print(df.Title.value_counts())

### Outliers

All the outliers are kept.

### Missing Values Imputation
Once the *Cabin* feature has been dropped, we are left with:
* 263 missing values for the *Age* feature
* 2 missing values for the *Embarked* feature
* 1 missing values for the *Fare* feature


In [None]:
df.drop(columns=['Cabin'], inplace=True) #cabin has too many missing

In [None]:
df.loc['train'].Embarked.mode()

In [None]:
df.Embarked.fillna("S", inplace=True)

In [None]:
groupby_Pclass = df.loc['train'].Fare.groupby(df.loc['train'].Pclass)
groupby_Pclass.mean()

In [None]:
df.Fare.fillna(13.302889, inplace=True)

In [None]:
#df['Age'].fillna(df.loc['train'].Age.median(),inplace=True)
median_age_by_title = pd.DataFrame(df.groupby('Title')['Age'].median())
median_age_by_title.rename(columns = {'Age': 'MedianAgeByTitle'}, inplace=True)
df = df.merge(median_age_by_title, left_on='Title', right_index=True)
df.Age.fillna(df.MedianAgeByTitle, inplace=True)
df.drop(columns=['MedianAgeByTitle'], inplace=True)

In [None]:
df.isnull().sum()


# Data Analysis


* **Standardisation** 
* **Cross - Validation**
* **Training**
* **Results submission**


### Packages

In [None]:
import numpy as np
from sklearn_pandas import DataFrameMapper as DFM
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelBinarizer
from sklearn.grid_search import GridSearchCV as KCV
from sklearn import svm


### Dropping features, dividing train and test

In [None]:
df.drop(columns=['Parch', 'SibSp', 'Name','Ticket'], inplace=True)

In [None]:
#separo train e test, dati e target
train_df, test_df = df.loc['train'], df.loc['test']

In [None]:
#adjust train set
train_predvalues = train_df.pop('Survived')
train_data = train_df

#adjust test set
test_data = test_df.drop(columns=['Survived'])
test_IDs = test_df.index.values


### Standardisation

In [None]:
mapper = DFM([(['Age', 'Fare', 'Pclass'], StandardScaler()),
              ('Sex'                , LabelBinarizer()), 
              ('Embarked'           , LabelBinarizer()),
              ('Title'              , LabelBinarizer())],
             default=None,
             df_out=True)

train_data = mapper.fit_transform(train_data)
test_data = mapper.transform(test_data)

### Cross-Validation and Training

In [None]:
param_grid = {
    'kernel': ['linear', 'rbf', 'sigmoid'],
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'gamma' : [0.001, 0.01, 0.1, 1]
}

grid_svc = KCV(svm.SVC(), param_grid, cv=10, scoring='accuracy')
grid_svc.fit(train_data, train_predvalues)

print('Best score: {}'.format(grid_svc.best_score_))
print('Best parameters: {}'.format(grid_svc.best_params_))

In [None]:
svc = svm.SVC(**grid_svc.best_params_).fit(train_data, train_predvalues)

### Saving Results 

In [None]:
#prep submission
res = pd.DataFrame({'PassengerId': test_IDs,
                    'Survived'   : svc.predict(test_data).astype(int)})

res.to_csv(os.path.join(path,'predictions.csv'), index=False)

