## MLearn 210: Titanic Kaggle Challenge - Kei and Aditi

The Titanic challenge is a classification problem because the model has to determine if an instance falls within one of two classes - survived or did not survived.  For this assignment, we will develop a logistic regression model to be run on each observation and it will predict which class that observation belongs to.

In [1]:
# Load the training and testing data
import pandas as pd

train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Data Cleaning:
Before analyzing the data, we will run a pairwise plot and and correlation matrix.  However, before we can do that, we will clean the data so that we can produce a more extensive plot and correlation matrix.
- Missing values for Age will be replaced with the mean age value.
- Embarked will be encoded to numeric values, the two missing embarked values will be replaced with the most commonly occuring value for embarked.
- Sex will be encoded to numeric values.
- The Cabin attribute will be removed from the dataset because there are too many missing values.
   

In [2]:
import numpy as np
from sklearn import preprocessing

def printLabelEncoding(title, labels):
    print(title)
    for index, item in enumerate(labels):
        print(str(index) + ": " + item)

# Cleans data and prints a little info on what it does
def cleanData(data):
    featuresPlot = data.copy()
    print("Missing values\n", featuresPlot.isnull().sum())

    # drop cabin, passenger ID, Ticket, and Name columns
    featuresPlot = featuresPlot.drop(["Cabin", "PassengerId", "Ticket", "Name"], axis=1)

    # replace nan values for age with the mean value
    featuresPlot["Age"] = featuresPlot.loc[:, "Age"].replace(np.nan, featuresPlot.loc[:, "Age"].mean())
    featuresPlot["Embarked"] = featuresPlot.loc[:, "Embarked"].replace(np.nan, "?")

    # encode string data - sex and embarked
    le = preprocessing.LabelEncoder()
    le.fit(featuresPlot["Sex"])
    featuresPlot["Sex"] = le.transform(featuresPlot.loc[:, "Sex"].values)
    printLabelEncoding("\nEncoding for Sex", list(le.classes_))

    # replacing missing embarked values with most common embarked value
    mostCommonEmbarkedVal = featuresPlot.loc[:, "Embarked"].value_counts().idxmax()
    featuresPlot.loc[:, "Embarked"] = featuresPlot.loc[:, "Embarked"].replace('?', mostCommonEmbarkedVal)

    le.fit(featuresPlot["Embarked"])
    featuresPlot["Embarked"] = le.transform(featuresPlot.loc[:, "Embarked"].values)
    printLabelEncoding("\nEncoding for Embarked", list(le.classes_))
    
    return featuresPlot

print("Training data")
trainClean = cleanData(train)
trainClean.head(20)

print("\nTest data")
testClean = cleanData(test)
testClean.head(20)


Training data
Missing values
 PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Encoding for Sex
0: female
1: male

Encoding for Embarked
0: C
1: Q
2: S

Test data
Missing values
 PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

Encoding for Sex
0: female
1: male

Encoding for Embarked
0: C
1: Q
2: S


Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,1,34.5,0,0,7.8292,1
1,3,0,47.0,1,0,7.0,2
2,2,1,62.0,0,0,9.6875,1
3,3,1,27.0,0,0,8.6625,2
4,3,0,22.0,1,1,12.2875,2
5,3,1,14.0,0,0,9.225,2
6,3,0,30.0,0,0,7.6292,1
7,2,1,26.0,1,1,29.0,2
8,3,0,18.0,0,0,7.2292,0
9,3,1,21.0,2,0,24.15,2


In [3]:
# Initial data exploration
import seaborn as sns

# Pair plot
sns.pairplot(trainClean)

# Correlation matrix
trainClean.corr(method='pearson')

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
Survived,1.0,-0.338481,-0.543351,-0.069809,-0.035322,0.081629,0.257307,-0.167675
Pclass,-0.338481,1.0,0.1319,-0.331339,0.083081,0.018443,-0.5495,0.162098
Sex,-0.543351,0.1319,1.0,0.084153,-0.114631,-0.245489,-0.182333,0.108262
Age,-0.069809,-0.331339,0.084153,1.0,-0.232625,-0.179191,0.091566,-0.026749
SibSp,-0.035322,0.083081,-0.114631,-0.232625,1.0,0.414838,0.159651,0.06823
Parch,0.081629,0.018443,-0.245489,-0.179191,0.414838,1.0,0.216225,0.039798
Fare,0.257307,-0.5495,-0.182333,0.091566,0.159651,0.216225,1.0,-0.224719
Embarked,-0.167675,0.162098,0.108262,-0.026749,0.06823,0.039798,-0.224719,1.0


We are going to split our data so that it's 80% training data and 20% testing.
We are going to use forward feature selection to produce our model.  We will add the feature with the highest correlation to survival to the model, check its accuracy, then select the feature with the next highest correlation.  If that one yields a higher success rate, we will keep it in the model, else we will move on the next feature.
- This does not take into account how the relationship amongst different features may affect information gain, but we can optimize for that after an initial pass with the model.
- We are going to use the ROC and general accuracy metrics (????)


In [4]:
# Feature selection
# Remove features we believe are not necessary at all
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression


from sklearn.metrics import accuracy_score

# Split the training data into training and test data
data = trainClean.copy()
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(["Survived"], axis=1),
    data.loc[:, "Survived"],
    test_size=0.2,
)


estimator = LogisticRegression(solver="lbfgs")
selector = RFECV(estimator, cv=4)
selector = selector.fit(X_train, y_train)

print("support", selector.support_ )
print("ranking", selector.ranking_)

y_pred = selector.estimator_.predict(X_test)

print(accuracy_score(y_test, y_pred))




support [ True  True  True  True  True  True  True]
ranking [1 1 1 1 1 1 1]
0.8491620111731844
