Join the competition: https://www.kaggle.com/c/titanic/data 

### Let's make our predictions! 

Predictions are about common factors that occur collectively. We don't want to predict if someone is called "John" or any other identity property. We want to predict if a middle class, married father in his thirties has a higher chance of surviving than a low class elderly travelling alone. Therefore, we can drop id's and variables that can be inferred by other variables: 'PassengerId', 'Name', 'Ticket', 'Cabin'. However, the 'PassengerId' is not dropped in the test variable because our solution is a file that relates 'PassengerId' with 'Survival'. 

### Simplicity

Let's make a simple and efficient model. We will ignore some extra variable engineering to make the models more accurate. For example, to understand not only the social class but the power that a person had, we can make a variable described by: the 'title' that appears in the 'Name' variable, the fare it paid and/or the port they embarked. 

### Playing with the Kaggle filters

We can start discovering the variables on the dataset by playing with the filters that the Kaggle website provides. 

Characteristics:
- More likely to have survived: Women (Sex=female), children (Age<?), the upper-class passengers (Pclass=1).
- Total samples are 891 or 40% of the actual number of passengers on board the Titanic (2,224).
- Around 38% samples survived representative of the actual survival rate at 32%.
- Most passengers (> 75%) did not travel with parents or children.
- Nearly 30% of the passengers had siblings and/or spouse aboard.
- Fares varied significantly with few passengers (<1%) paying as high as $512.
- Few elderly passengers (<1%) within age range 65-80.

## Importing tools and data

In [1]:
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

In [2]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

In [3]:
train_df.describe()
#get collumn names  
train_df.columns.values
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [4]:
train2 = train_df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin', 'Fare', 'Embarked'], axis=1)

test2 = test_df.drop(['Name', 'Ticket', 'Cabin', 'Fare', 'Embarked'], axis= 1)

df2 = [train2, test2]

In [5]:
#Making Sex a binary dummy variable 
for dataset in df2: 
      dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)

In [6]:
#We need to fill the NaN values for age 

for dataset in df2: 
      dataset['Age'] = dataset['Age'].fillna(dataset['Age'].mean())

In [7]:
dataset['Age'].isna().sum()

for dataset in df2: 
      dataset['Age'] = dataset['Age'].astype(int)

In [8]:
#train = train.dropna()

train2.isna().sum()
train2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 6 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Sex         891 non-null int32
Age         891 non-null int32
SibSp       891 non-null int64
Parch       891 non-null int64
dtypes: int32(2), int64(4)
memory usage: 34.9 KB


In [9]:
#Model variables 

X_train2 = train2.drop(["Survived"], axis=1)
Y_train2 = train2['Survived']
X_test2  = test2.drop(["PassengerId"], axis=1) 
Y_test2 = test2["PassengerId"]

X_train2.shape, Y_train2.shape, X_test2.shape, Y_test2.shape

((891, 5), (891,), (418, 5), (418,))

In [10]:
X_train2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 5 columns):
Pclass    891 non-null int64
Sex       891 non-null int32
Age       891 non-null int32
SibSp     891 non-null int64
Parch     891 non-null int64
dtypes: int32(2), int64(3)
memory usage: 27.9 KB


In [11]:
# Decision Tree

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train2, Y_train2)
Y_pred = decision_tree.predict(X_test2)
acc_decision_tree = round(decision_tree.score(X_train2, Y_train2) * 100, 2)
acc_decision_tree

91.58

## Am I overfitting? 
Hmmm. How to know? Decision Trees don't handle well NaN. Let's see what happens if instead of filling NaN atributes of Age with the mean value, I will simply delete the NaN rows. If we obtain a higher score, then we could trust that our previous model is not overfitting.


In [12]:
train3 = train_df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin', 'Fare', 'Embarked'], axis=1)

test3 = test_df.drop(['Name', 'Ticket', 'Cabin', 'Fare', 'Embarked'], axis= 1)

df3 = [train3, test3]

df3

[     Survived  Pclass     Sex   Age  SibSp  Parch
 0           0       3    male  22.0      1      0
 1           1       1  female  38.0      1      0
 2           1       3  female  26.0      0      0
 3           1       1  female  35.0      1      0
 4           0       3    male  35.0      0      0
 5           0       3    male   NaN      0      0
 6           0       1    male  54.0      0      0
 7           0       3    male   2.0      3      1
 8           1       3  female  27.0      0      2
 9           1       2  female  14.0      1      0
 10          1       3  female   4.0      1      1
 11          1       1  female  58.0      0      0
 12          0       3    male  20.0      0      0
 13          0       3    male  39.0      1      5
 14          0       3  female  14.0      0      0
 15          1       2  female  55.0      0      0
 16          0       3    male   2.0      4      1
 17          1       2    male   NaN      0      0
 18          0       3  female 

In [13]:
#Making Sex a binary dummy variable 
for dataset in df3: 
      dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)

In [14]:
train3 = train3.dropna()
    
train3.isna().sum()

Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
dtype: int64

In [15]:
for dataset in df3: 
      dataset = dataset.dropna(axis = 0, how = 'any')       

In [16]:
test3 = test3.dropna()
    
test3.isna().sum()

PassengerId    0
Pclass         0
Sex            0
Age            0
SibSp          0
Parch          0
dtype: int64

In [17]:
#Model variables 

X_train3 = train3.drop(["Survived"], axis=1)
Y_train3 = train3['Survived']
X_test3  = test3.drop(["PassengerId"], axis=1) 
Y_test3 = test3["PassengerId"]

X_train3.shape, Y_train3.shape, X_test3.shape, Y_test3.shape

((714, 5), (714,), (332, 5), (332,))

In [18]:
# Decision Tree

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train3, Y_train3)
Y_pred1 = decision_tree.predict(X_test3)
acc_decision_tree1 = round(decision_tree.score(X_train3, Y_train3) * 100, 2)
acc_decision_tree1

93.28

## Submission 

We will submit the asked dataframe with 418 entries plus a header row. 

In [21]:
submission = pd.DataFrame({
        "PassengerId": test2["PassengerId"],
        "Survived": Y_pred
    })


#submission

In [22]:
submission.to_csv('submission.csv', index=False)