## Feature Selection

I want a reasonable hypothesis for why a given feature is useful for formulating a prediction for who will live or die on the titanic.  Selected features will be added to a Linear Regression model and predictive power will be compared to a benchmark.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
pd.set_option("display.max_columns", 20)

In [2]:
df = pd.read_csv(r"C:\titanic\train.csv")
df.sample(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
867,868,0,1,"Roebling, Mr. Washington Augustus II",male,31.0,0,0,PC 17590,50.4958,A24,S
377,378,0,1,"Widener, Mr. Harry Elkins",male,27.0,0,2,113503,211.5,C82,C
157,158,0,3,"Corn, Mr. Harry",male,30.0,0,0,SOTON/OQ 392090,8.05,,S
641,642,1,1,"Sagesser, Mlle. Emma",female,24.0,0,0,PC 17477,69.3,B35,C
354,355,0,3,"Yousif, Mr. Wazli",male,,0,0,2647,7.225,,C


### Feature 1
"Sex" will be the first feature for inclusion given the assumption that priority is given to women in life or death situations and therefore is a useful feature for predicting alive/dead.  It is also the lone feature used to generate my benchmark to beat which states that all women survive and all men die.  

In [3]:
#Define function to binarize male/female assignments in the original dataframe where {female: 0 and male: 1}
def binarize_male_female(df, col, male_female):

    le = LabelEncoder()
    le.fit(male_female)
    binary_labels = le.transform(df[col].values)
    
    return binary_labels, le

In [4]:
binary_labels, le = binarize_male_female(df, 'Sex', ["male", "female"])

In [5]:
df['Sex_binary'] = binary_labels
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex_binary
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1


### Feature 2
I hypothesize that a passenger's socioeconomic status will likley influence whether they will survive with the assumption that lower class passengers will be given lower priorty to lifeboats than first class passengers.  

In [9]:
X = df.loc[:, ['Sex_binary', 'Pclass']].values
y = df.loc[:, 'Survived'].values
clf = LogisticRegression(random_state=1).fit(X, y)
clf.predict(X)
clf.predict_proba(X) 
accuracy = clf.score(X, y)
print("Our two feature model predicts alive/dead on the training set with " + str(accuracy) + " success")

Our two feature model predicts alive/dead on the training set with 0.7867564534231201 success


### Benchmark 

In [7]:
benchmark = (df['Sex_binary'] != df['Survived']).sum()/df.shape[0]

In [10]:
np.abs(np.subtract(benchmark, accuracy))

0.0