# Titianic Dataset Description
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

Number of Objects = 1309 
Number of Features = 10
train and test Split is 1/3 by 2/3

### Variable Definition	Key
* Survived	Survival	0 = No, 1 = Yes
* pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
* sex	gender	
* Age	Age in years	
* sibsp 	# of siblings / spouses aboard the Titanic	
* parch	  # of parents / children aboard the Titanic	
* ticket	Ticket number	
* fare	Passenger fare	
* cabin	Cabin number	
* embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

### Variable Notes

pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

In [1]:
from sklearn import tree
import random
import math
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
import numpy as np
from sklearn import metrics

# Load Dataset

In [2]:
data = pd.read_csv('train.csv', index_col='PassengerId')

# Binarize features 

In [3]:
#dropping columns that have confusing features
data.drop(['Name', 'Ticket'], axis = 1, inplace = True)
# chaning the Cabin feature to a binary feature of having or not having a cabin
data['hasCabin'] = data['Cabin'].apply(lambda x:0 if type(x) == float else 1)
data.drop(['Cabin'], axis = 1, inplace=True)
# binarizing the sex feature
data['Sex'] = data['Sex'].map({'female':0, 'male':1}).astype(int)
#binarizing the Pclass feature to 1 if 1st class and 0 if 2nd or 3rd class
data['Pclass'] = data['Pclass'].map({1:1, 2:0 , 3:0}).astype(int)
#binarzing the Age features to be 0 if less than mean or 1 greater than mean
data['Age'] = data['Age'].apply(lambda x: 0 if x<np.mean(data['Age']) else 1)
#binarizing the Embarked feature to 1 if S 0 otherwise
#I chose to binarize this way because of the unequal distribtuion of values
data['Embarked'] = data['Embarked'].fillna('S')
data['EmbarkedS'] = data['Embarked'].map({'S':1, 'C':0, 'Q':0}).astype(int)
data.drop(['Embarked'], axis = 1, inplace=True)
#binaring the Fare feature to 0 if less than mean and 1 if greater than
data['Fare'] = data['Fare'].apply(lambda x: 0 if x< np.mean(data['Fare']) else 1)
# binarizing the Parch and SibSp feature to one feature if isAlone or not
data['FamilySize'] = data['SibSp'] + data['Parch']+1
data['isAlone'] = data['FamilySize'].apply(lambda x: 1 if x==1 else 0)
data.drop(['FamilySize', 'Parch', 'SibSp'], inplace = True, axis = 1)

In [4]:
# spliting the data into training and testing
train_x, test_x, train_y, test_y =train_test_split(data.drop('Survived', axis = 1),data['Survived'], test_size = 0.33, random_state= 42)

# Training BernoulliNB classifier

In [5]:
clf = BernoulliNB()
clf.fit(train_x, train_y)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [6]:
print("train accurary is ", clf.score(train_x,train_y))
print("test accurary is ", clf.score(test_x,test_y))

train accurary is  0.743288590604
test accurary is  0.769491525424


In [7]:
log_class_prior = clf.class_log_prior_[0]-clf.class_log_prior_[1]

In [8]:
features_1_yes=[math.exp(x)for x in clf.feature_log_prob_[0,:]]
features_1_no = [math.exp(x)for x in clf.feature_log_prob_[1,:]]
features_0_yes = [ 1-x for x in features_1_yes]
features_0_no = [1-x for x in features_1_no]
ln_features_0 = [np.log(features_0_yes[i]/features_0_no[i]) for i in range(0,len(features_0_no))]
ln_features_1 = [np.log(features_1_yes[i]/features_1_no[i]) for i in range(0,len(features_0_no))]
ln_features=[ln_features_0,ln_features_1]
#ln_features.append(ln_features_0)
#ln_features.append(ln_features_1)

# Section 5:

In [9]:
def log_evidence_calc(obj):
    positive_log_evidence = []
    negative_log_evidence = []
    for i,value in enumerate(obj):
        if ln_features[value][i]>0:
            positive_log_evidence.append((ln_features[value][i],i))
        else:
            negative_log_evidence.append((ln_features[value][i],i))
    return(positive_log_evidence,negative_log_evidence)
def features_pos(pos,obj):
    pos.sort(key=lambda tup: tup[0])
    pos_feature_1 = ""
    pos_feature_2 = ""
    if len(pos)>0:
        pos_feature_1 = train_x.columns[pos[len(pos)-1][1]]
        if(len(pos)>1):
            pos_feature_2 = train_x.columns[pos[len(pos)-2][1]]
            return (pos_feature_1,obj[pos_feature_1],pos_feature_2,obj[pos_feature_2])
    return

def features_neg(pos,obj):
    pos.sort(key=lambda tup: tup[0])
    pos_feature_1 = ""
    pos_feature_2 =""
    if len(pos)>0:
        pos_feature_1 = train_x.columns[pos[0][1]]
        if(len(pos)>1):
            pos_feature_2 = train_x.columns[pos[1][1]]
            return (pos_feature_1,obj[pos_feature_1],pos_feature_2,obj[pos_feature_2])
    return 

### 1.The most positive with respect to the probabilities.

In [10]:
predicted_prob = clf.predict_proba(test_x)
index = np.argmax(predicted_prob[:,0])
obj = test_x.iloc[index,]
print(obj)
print("total positive log-evidence= ",log_class_prior+sum(i for i, _ in log_evidence_calc(obj)[0]))
print("total negative log-evidence= ",sum(i for i, _ in log_evidence_calc(obj)[1]))
print("top 2 features values that contribute most to the positive evidence ", features_pos(log_evidence_calc(obj)[0],obj))
print("top 2 features values that contribute most to the negative evidence ", features_neg(log_evidence_calc(obj)[1],obj))
print("probability distribution = ", predicted_prob[index])

Pclass       0
Sex          1
Age          1
Fare         0
hasCabin     0
EmbarkedS    1
isAlone      1
Name: 440, dtype: int64
total positive log-evidence=  2.96272381509
total negative log-evidence=  0
top 2 features values that contribute most to the positive evidence  ('Sex', 1, 'hasCabin', 0)
top 2 features values that contribute most to the negative evidence  None
probability distribution =  [ 0.95086142  0.04913858]


### 2.The most negative object with respect to the probabilities.

In [11]:
index = np.argmax(predicted_prob[:,1])
obj = test_x.iloc[index]
print(obj)
print("total positive log-evidence= ",log_class_prior+sum(i for i, _ in log_evidence_calc(obj)[0]))
print("total negative log-evidence= ",sum(i for i, _ in log_evidence_calc(obj)[1]))
print("top 2 features values that contribute most to the positive evidence ", features_pos(log_evidence_calc(obj)[0],obj))
print("top 2 features values that contribute most to the negative evidence ", features_neg(log_evidence_calc(obj)[1],obj))
print("probability distribution = ", predicted_prob[index])

Pclass       1
Sex          0
Age          0
Fare         1
hasCabin     1
EmbarkedS    0
isAlone      0
Name: 312, dtype: int64
total positive log-evidence=  0.521578415542
total negative log-evidence=  -5.27727683209
top 2 features values that contribute most to the positive evidence  None
top 2 features values that contribute most to the negative evidence  ('Sex', 0, 'hasCabin', 1)
probability distribution =  [ 0.00852916  0.99147084]


In [12]:
pos_evidence_objects_list= []
neg_evidence_objects_list= []
for i in range(0,len(test_x)):
    pos_evidence_objects_list.append(log_class_prior+sum(i for i, _ in log_evidence_calc(test_x.iloc[i])[0]))
    neg_evidence_objects_list.append(sum(i for i, _ in log_evidence_calc(test_x.iloc[i])[1]))
pos_evidence_objects_list = np.array(pos_evidence_objects_list)
neg_evidence_objects_list = np.array(neg_evidence_objects_list)

### 3. The object that has the largest positive evidence.

In [13]:
index = np.argmax(pos_evidence_objects_list)
obj = test_x.iloc[index]
print(obj)
print("total positive log-evidence= ",log_class_prior+sum(i for i, _ in log_evidence_calc(obj)[0]))
print("total negative log-evidence= ",sum(i for i, _ in log_evidence_calc(obj)[1]))
print("top 2 features values that contribute most to the positive evidence ", features_pos(log_evidence_calc(obj)[0],obj))
print("top 2 features values that contribute most to the negative evidence ", features_neg(log_evidence_calc(obj)[1],obj))
print("probability distribution = ", predicted_prob[index,:])

Pclass       0
Sex          1
Age          1
Fare         0
hasCabin     0
EmbarkedS    1
isAlone      1
Name: 440, dtype: int64
total positive log-evidence=  2.96272381509
total negative log-evidence=  0
top 2 features values that contribute most to the positive evidence  ('Sex', 1, 'hasCabin', 0)
top 2 features values that contribute most to the negative evidence  None
probability distribution =  [ 0.95086142  0.04913858]


### 4. The object that has the largest (in magnitude) negative evidence.

In [14]:
index = np.argmin(neg_evidence_objects_list)
obj = test_x.iloc[index]
print(obj)
print("total positive log-evidence= ",log_class_prior+sum(i for i, _ in log_evidence_calc(obj)[0]))
print("total negative log-evidence= ",sum(i for i, _ in log_evidence_calc(obj)[1]))
print("top 2 features values that contribute most to the positive evidence ", features_pos(log_evidence_calc(obj)[0],obj))
print("top 2 features values that contribute most to the negative evidence ", features_neg(log_evidence_calc(obj)[1],obj))
print("probability distribution = ", predicted_prob[index,:])

Pclass       1
Sex          0
Age          0
Fare         1
hasCabin     1
EmbarkedS    0
isAlone      0
Name: 312, dtype: int64
total positive log-evidence=  0.521578415542
total negative log-evidence=  -5.27727683209
top 2 features values that contribute most to the positive evidence  None
top 2 features values that contribute most to the negative evidence  ('Sex', 0, 'hasCabin', 1)
probability distribution =  [ 0.00852916  0.99147084]


#### Even though we see that we have some value for the total postive evidence but no corresponding feature and thats just due to class prior

### 5. The most uncertain object (the probabilities are closest to 0.5)

In [15]:
clearPRE = np.abs(np.add(predicted_prob[:,1],-0.5))
index = np.argmin(clearPRE)
obj = test_x.iloc[index]
print(obj)
print("total positive log-evidence= ",log_class_prior+sum(i for i, _ in log_evidence_calc(obj)[0]))
print("total negative log-evidence= ",sum(i for i, _ in log_evidence_calc(obj)[1]))
print("top 2 features values that contribute most to the positive evidence ", features_pos(log_evidence_calc(obj)[0],obj))
print("top 2 features values that contribute most to the negative evidence ", features_neg(log_evidence_calc(obj)[1],obj))
print("probability distribution = ", predicted_prob[index])

Pclass       1
Sex          1
Age          1
Fare         1
hasCabin     0
EmbarkedS    0
isAlone      1
Name: 494, dtype: int64
total positive log-evidence=  2.21020960302
total negative log-evidence=  -2.15556740553
top 2 features values that contribute most to the positive evidence  ('Sex', 1, 'hasCabin', 0)
top 2 features values that contribute most to the negative evidence  ('Pclass', 1, 'Fare', 1)
probability distribution =  [ 0.51365715  0.48634285]
