# Engineering Features

**Methods:**
>1. Load data
>2. Create feature for which deck they're on
>3. Create feature for family size
>4. Create features for familial status
>5. Add procedure to src file

In [1]:
import sys
sys.path.append('./../../src/')
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import Data_Loader

## 1. Load data

In [5]:
X, y = Data_Loader.load_training_data()

## 2. Create feature for which deck they're on

In [4]:
X.head()

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Deck,FamilySize
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,UNK,2
2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,C,2
3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,UNK,1
4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,C,2
5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,UNK,1


This will use the Cabin number: specifically, it will use the first character of the Cabin name. If it doesn't exist, we replace it with an "UNK" for unknown:

In [4]:
X['Deck'] = X['Cabin'].map(lambda x: str(x)[0]).replace('n', 'UNK')

In [5]:
X.head()

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Deck
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,UNK
2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,C
3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,UNK
4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,C
5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,UNK


## 3. Create feature for family size

The family size is the sum of the passenger's Sibsp + Parch + 1. The 1 is for themselves

In [6]:
X['FamilySize'] = X['SibSp'] + X['Parch'] + 1

In [7]:
X.head()

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Deck,FamilySize
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,UNK,2
2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,C,2
3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,UNK,1
4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,C,2
5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,UNK,1


## 4. Create features for familial status

For this, I want to determine if the person is traveling [alone, with spouse, or with family]. 

* Alone is simple, this will be people whose family size == 1

* With a spouse is a little more complex, the passenger needs to be over ~18, with 0 parch, and only 1 sibsp:

* With family is everything else

In [45]:
X.loc[(X['SibSp'] == 1)
     & (X['Parch'] == 0)].sort_values(['Name'])

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Deck,FamilySize
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
309,2,"Abelson, Mr. Samuel",male,30.0,1,0,P/PP 3381,24.0000,,C,UNK,2
875,2,"Abelson, Mrs. Samuel (Hannah Wizosky)",female,28.0,1,0,P/PP 3381,24.0000,,C,UNK,2
41,3,"Ahlin, Mrs. Johan (Johanna Persdotter Larsson)",female,40.0,1,0,7546,9.4750,,S,UNK,2
193,3,"Andersen-Jensen, Miss. Carla Christine Nielsine",female,19.0,1,0,350046,7.8542,,S,UNK,2
276,1,"Andrews, Miss. Kornelia Theodosia",female,63.0,1,0,13502,77.9583,D7,S,D,2
519,2,"Angle, Mrs. William A (Florence ""Mary"" Agnes H...",female,36.0,1,0,226875,26.0000,,S,UNK,2
354,3,"Arnold-Franchi, Mr. Josef",male,25.0,1,0,349237,17.8000,,S,UNK,2
50,3,"Arnold-Franchi, Mrs. Josef (Josefine Franchi)",female,18.0,1,0,349237,17.8000,,S,UNK,2
701,1,"Astor, Mrs. John Jacob (Madeleine Talmadge Force)",female,18.0,1,0,PC 17757,227.5250,C62 C64,C,C,2
207,3,"Backstrom, Mr. Karl Alfred",male,32.0,1,0,3101278,15.8500,,S,UNK,2


In [36]:
X.loc[(X['SibSp'] == 1)
     & (X['Parch'] == 0), 'Age'].dropna().map(lambda x: x >= 18).mean()

0.92156862745098034

92% of people with 1 sibling and 0 parents are over 18 years old

In [32]:
def get_familial_status(passenger_data):
    try:
    if (passenger_data['SibSp'] == 1 
        & passenger_data['Parch'] == 0
        & passenger_data['Age'] >= 18):
        return 'with_spouse'
    elif passenger_data['FamilySize'] == 0:
        return 'single'
    else:
        return 'with_family'

In [33]:
X.apply(get_familial_status, axis=1)

TypeError: ("unsupported operand type(s) for &: 'int' and 'float'", u'occurred at index 3')

This is more complex than anticipated. I believe it's possible but will take time

### Exporting data

In [9]:
X['Survived'] = y

In [11]:
X.to_csv('./../../data/feature_data/train_with_fam_deck.csv')

## 5. Add procedure to src file