# Titanic: Machine Learning from Disaster
[Kaggle page for this problem](https://www.kaggle.com/c/titanic/data)

*Amir Hossein Binesh*, Amir Kabir University of Tehran

---

## Part 1 : Read data, visualize and preprocess

**Reading the data**

In [None]:
import pandas as pd

# Read train data
train_df = pd.read_csv('train.csv', index_col='PassengerId')
# Read test data and the answers
test_features = pd.read_csv('test.csv', index_col='PassengerId')
test_classes = pd.read_csv('gender_submission.csv', index_col='PassengerId')

# concat test data and answers for integration
test_df = pd.concat([test_features, test_classes], axis = 1)

In [None]:
train_df

In [None]:
# Number of features and data for train data
print("Number of features : ", len(train_df.columns))
print("Number of data : ", len(train_df.index))

In [None]:
# Number of features and data for test data
print("Number of features : ", len(test_df.columns))
print("Number of data : ", len(test_df.index))

----

**Visualize the data**
        
With a little bit of common sense, and according to Titanic the movie, which I vaguely remember, we examine the features, to get a grasp of the data.
This helps to choose features in next steps.

"Women and children first", that gives a fairly easy clue, sex and age.
The other thing is money, all the Pclass, Cabin, Fare and Ticket features point to this very specific matter, and also that lines up with the movie, so we have to have an eye on that.

By the way, I couldn't find Jack in the data set, so maybe it's an alternative reality, who knows?

In [None]:
import matplotlib.pyplot as plt

# Fraction of survival and death for men and women
dead_df = train_df[train_df['Survived'] == 0]
alive_df = train_df[train_df['Survived'] == 1]
plt.hist(train_df['Sex'].values, histtype='bar', bins=4, color = "green")
plt.hist(dead_df['Sex'].values, histtype='bar', bins=4, color = "red")


plt.ylabel('Count')
plt.legend(('Survived', 'Dead'))

plt.show()

In [None]:
plt.hist(train_df['Age'].values, bins=5, color = "green", range = (0, 80))
plt.hist(dead_df['Age'].values, bins=5, color = "red", range = (0, 80))


plt.ylabel('Count')
plt.xlabel('Age')
plt.legend(('Survived', 'Dead'))

plt.show()

In [None]:
ax = pd.DataFrame(alive_df['Fare']).plot(kind='density', color = "green")
ax.set_xlim(-100, 500)
pd.DataFrame(dead_df['Fare']).plot(ax=ax, kind='density', color = "red")

plt.ylabel('Density')
plt.xlabel('Fare')
plt.legend(('Survived', 'Dead'))
plt.rcParams["figure.figsize"] = [10,5]
plt.show()


---

**Data Preprocessing**

As mentioned before, all the Pclass, Cabin, Fare and Ticket features, can sum up into one feature. The cabin NaN is probably for workers, we can make sure with a correlation method with fare, which is done below.

So the Pclass and fare can represent the wealth.

The other features are irrelevant to me. We can use Viktor Frankl teachings to add SibSp and Parch, but I don't think, that's gonna work here.

In [None]:
train_df.corr(method ='kendall')

Correlation(Pclass, Fare) = -0.57, which is good, but not enough to ignore one over another.

In [None]:
# Count null data
train_df.isnull().sum()

In [None]:
# Count null data for test data
test_df.isnull().sum()

In [None]:
# Use mean of age for NaN ages
mean_age = round(train_df.mean(axis = 0, skipna = True)['Age'])
train_df['Age'] = train_df['Age'].fillna(mean_age)
mean_age = round(test_df.mean(axis = 0, skipna = True)['Age'])
test_df['Age'] = test_df['Age'].fillna(mean_age)

# Set Other NaNs
train_df['Cabin'] = train_df['Cabin'].fillna("NoRoom")
train_df['Embarked'] = train_df['Embarked'].fillna("U")
test_df['Cabin'] = test_df['Cabin'].fillna("NoRoom")
test_df['Embarked'] = test_df['Embarked'].fillna("U")

# Test data had one NaN for fare, use mean to replace it
mean_fare = round(test_df.mean(axis = 0, skipna = True)['Fare'])
test_df['Fare'] = test_df['Fare'].fillna(mean_fare)

In [None]:
test_df.isnull().sum()

In [None]:
# This piece of code, make a column into a numeric column
# Since we decided to not use Cabin, I comment this out

# from sklearn.preprocessing import LabelEncoder
# enc = LabelEncoder()
# enc.fit(train_df['Cabin'])
# train_df['Cabin'] = enc.transform(train_df['Cabin'])

train_df['Sex'].replace(['female', 'male'],[-1, 1], inplace=True)
train_df['Embarked'].replace(['S', 'C', 'Q', 'U'], [0, 1, 2, 1], inplace=True)
train_df['Survived'].replace([0, 1], [-1, 1], inplace = True)

test_df['Sex'].replace(['female', 'male'],[0, 1], inplace=True)
test_df['Embarked'].replace(['S', 'C', 'Q', 'U'], [0, 1, 2, 1], inplace=True)
test_df['Survived'].replace([0, 1], [-1, 1], inplace = True)

In [None]:
train_df

---

## Part 2 : Training and testing the decision tree

**Training**

In [None]:
from sklearn import tree

# == SETTINGS ==
MAX_DEPTH = 3
CRITERION = "entropy"

features = ['Sex', 'Age', 'Fare', 'Pclass']
X = train_df[features]
y = train_df['Survived']

clf = tree.DecisionTreeClassifier(random_state=0, max_depth=MAX_DEPTH, criterion=CRITERION)
clf = clf.fit(X, y)

**Accuracy Calculation**

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(y, clf.predict(train_df[features]))

**Tree Visualization**

In [None]:
import graphviz

file_name = "Result-" + str(MAX_DEPTH) + "-" + CRITERION

dot_data = tree.export_graphviz(clf, out_file=None) 
graph = graphviz.Source(dot_data) 
graph.render(file_name)
dot_data = tree.export_graphviz(clf, out_file=None, 
                     feature_names=features,  
                     class_names=['Survived', 'Dead'],  
                     filled=True, rounded=True,  
                     special_characters=True)  
graph = graphviz.Source(dot_data)  
graph

**Testing**

In [None]:
X = test_df[features]
y = test_df['Survived']
predicc = clf.predict(X)

In [None]:
accuracy_score(y, predicc)