Knowing from a training set of samples listing passengers who survived or did not survive the Titanic disaster, can our model determine based on a given test dataset not containing the survival information, if these passengers in the test dataset survived or not.

We may also want to develop some early understanding about the domain of our problem. This is described on the Kaggle competition description page.

On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. Translated 32% survival rate.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew.
Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

* Question or problem definition.
* Acquire training and testing data.
* Wrangle, prepare, cleanse the data.
* Analyze, identify patterns, and explore the data.
* Model, predict and solve the problem.
* Visualize, report, and present the problem solving steps and final solution.
* Supply or submit the results.

In [None]:
import pandas as pd
import numpy as np

# Data Visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Machine Learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

In [None]:
test_df = pd.read_csv("../input/titanic/test.csv")
train_df = pd.read_csv("../input/titanic/train.csv")
combine = [train_df,test_df]
train_df.info()
print('_'*40)
test_df.info()
train_df.sample(10)

* **Categorical:** Survived, Sex, and Embarked. Ordinal: Pclass.
* **Continous:** Age, Fare. Discrete: SibSp, Parch.
* Seven features are integer or floats. Six in case of test dataset.
* Five features are strings (object).

In [None]:
#Check Missing values
print('Train columns with null values:\n', train_df.isnull().sum())
print("-"*40)
print('Test columns with null values:\n', test_df.isnull().sum())

* **Train columns with null values:** Cabin > Age > Embarked 
* **Test columns with null values:** Cabin > Age > Fare


In [None]:
train_df.describe()

**The distribution of numerical feature values across the samples:**
* Total samples are 891 or 40% of the actual number of passengers on board the Titanic (2,224).
* The sample sruvival rate is around 38%.
* Fares varied significantly with few passengers (<1%) paying as high as $512.
* Few elderly passengers (<1%) within age range 65-80.

In [None]:
train_df.describe(include='O')

**The distribution of categorical features:**
* Names are unique across the dataset (count=unique=891)
* Sex variable as two possible values with 65% male (top=male, freq=577/count=891).
* Ticket feature has high ratio (22%) of duplicate values (unique=681).
* Cabin values have several dupicates across samples. Alternatively several passengers shared a cabin.(147/204)
* Embarked takes three possible values. S port used by most passengers (top=S)


Next, we consider and explore several assumption factors.

In [None]:
g = sns.FacetGrid(train_df, col='Survived')
g.map(plt.hist, 'Age', bins=20)

* Most passengers are in 15-35 age range
* Large number of passengers in age range(15-30) didn't survive.
* Infants (Age <=4) had high survival rate.
* Oldest passengers (Age = 80) survived.

In [None]:
g = sns.catplot(x="Pclass", y="Survived", hue="Sex", data=train_df,
                height=6, kind="bar", palette="muted")
g.despine(left=True)
g.set_ylabels("survival probability")

* In all classes, most survived passenegers are female.
* The survival rate of female is much higher than males'.
* The survival rate decreased from class 1 to class 3.

In [None]:
grid = sns.FacetGrid(train_df, col='Pclass', hue='Survived')
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend();

* Pclass=3 had most passengers, however most did not survive..
* Most passengers in Pclass=1 survived.

# Data Processing and Exploration

missing data, new features, converting.

In [None]:
# Impute missing data; Drop columns.
for dataset in combine:    
    dataset['Age'].fillna(dataset['Age'].median(), inplace = True)
    dataset['Embarked'].fillna(dataset['Embarked'].mode()[0], inplace = True)
    dataset['Fare'].fillna(dataset['Fare'].median(), inplace = True)
    
drop_column = ['PassengerId','Cabin', 'Ticket']
train_df.drop(drop_column, axis=1, inplace = True)
test_id=test_df['PassengerId']
test_df.drop(drop_column, axis=1, inplace = True)
print('Train columns with null values:\n', train_df.isnull().sum())
print("-"*40)
print('Test columns with null values:\n', test_df.isnull().sum())

**Analyze by pivoting features**

To confirm some of our observations and assumptions, we can quickly analyze our feature correlations by pivoting features against each other.

In [None]:
for x in train_df.columns[1:9]:
    if train_df[x].dtype != 'float64' :
        print('Survival Correlation by:', x)
        print(train_df[[x,"Survived"]].groupby(x, as_index=False).mean().sort_values(by='Survived', ascending=False))
        print('-'*40)

* **Pclass : ** We observe significant correlation (>0.5) among Pclass=1 and Survived.We decide to include this feature in our model.
* **Name : ** Name values were mixed texture, we can extract new feature "Title" based on this.
* **Sex : ** Sex=female had very high survival rate at 74%.
* **SibSp and Parch : ** These features had zero correlation for certain values. We can derive features from these individual features.
* **Embarked : ** Embarked=C had higher survival rate at 55%.


In [None]:
#Name --> Title
#extract these.count less than 10 with title = "Rare"
for dataset in combine:  
    dataset['Title'] = dataset['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]
    title_names = (dataset['Title'].value_counts() < 10)
    dataset['Title'] = dataset['Title'].apply(lambda x: 'Rare' if title_names.loc[x] == True else x)
    dataset.drop(['Name'], axis=1, inplace = True)

print('Train Count of Titles:\n',train_df['Title'].value_counts())
print('-'*40)
print('Test Count of Titles:\n',test_df['Title'].value_counts())
print('-'*40)
print('Train title with null values:\n', train_df["Title"].isnull().sum())
print("-"*40)
print('Test title with null values:\n', test_df["Title"].isnull().sum())
print("-"*40)
print(train_df[['Title', 'Survived']].groupby(['Title'], as_index=False).mean())

In [None]:
#Create new feature (Family Size/ IsAlone) combining existing features (SibSp/ Parch) 
for dataset in combine:
    dataset['FamilySize'] = dataset ['SibSp'] + dataset['Parch'] + 1   #Discrete variables
    dataset['IsAlone'] = 1 #initialize to yes/1 is alone
    dataset['IsAlone'].loc[dataset['FamilySize'] > 1] = 0 # now update to no/0 if family size is greater than 1
    drop_column = ['SibSp','Parch','FamilySize']
    dataset.drop(drop_column, axis=1, inplace = True)

In [None]:
# Create Fare and Age bands (reduce the effects of minor observation errors.)
for dataset in combine:
    dataset['FareBin'] = pd.qcut(dataset['Fare'], 4)
    dataset['AgeBin'] = pd.cut(dataset['Age'].astype(int), 5)
print(train_df[['FareBin', 'Survived']].groupby(['FareBin'], as_index=False).mean().sort_values(by='FareBin', ascending=True))
print("-"*40)
print(train_df[['AgeBin', 'Survived']].groupby(['AgeBin'], as_index=False).mean().sort_values(by='AgeBin', ascending=True))

In [None]:
# Replace Fare and Age with ordinals based on these bands.
for dataset in combine:    
    dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2
    dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3
    dataset['Fare'] = dataset['Fare'].astype(int)
 #--------------------------------------------------------------------------------------   
    dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
    dataset.loc[ dataset['Age'] > 64, 'Age'] = 4
    dataset['Age'] = dataset['Age'].astype(int)
 #--------------------------------------------------------------------------------------      
    drop_column = ['FareBin','AgeBin']
    dataset.drop(drop_column, axis=1, inplace = True)

train_df.sample(10)

In [None]:
#Convert the categorical values (Title/ Sex/ Embarked) to ordinal.
#That categorical data is defined as variables with a finite set of label values. 
#That most machine learning algorithms require numerical input and output variables. 
#That an integer and one hot encoding is used to convert categorical data to integer data.
for dataset in combine:
    dataset['Title'] = dataset['Title'].map({"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}).astype(int)
    dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
    dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)
    
train_df.sample(10)

In [None]:
# Correlation
for x in train_df.columns[1:8]:
    if train_df[x].dtype != 'float64' :
        print('Survival Correlation by:', x)
        print(train_df[[x,"Survived"]].groupby(x, as_index=False).mean().sort_values(by='Survived', ascending=False))
        print('-'*40)

In [None]:
#Positive coefficients increase the log-odds of the response (and thus increase the probability).
#Negative coefficients decrease the log-odds of the response (and thus decrease the probability).
correlation = train_df.corr()
plt.figure(figsize=(10,8))
mask = np.zeros_like(correlation)#https://seaborn.pydata.org/generated/seaborn.heatmap.html
mask[np.triu_indices_from(mask)] = True
sns.heatmap(correlation,linewidths=.3,annot=True,mask=mask,cmap="YlGnBu",cbar=False)

* Sex had highest correlation with Survived.
* Title was second highest positive correlation. and it's related with Sex and Fare.
* Pclass had negative correlation with survived.

In [None]:
#graph individual features by survival
fig, saxis = plt.subplots(2, 3,figsize=(14,10))
list1=['Pclass', 'Sex', 'Age', 'Fare', 'Embarked','IsAlone'];list2=[0,0,0,1,1,1];list3=[0,1,2,0,1,2]
for (x,y,z) in zip(list1,list2,list3): 
    sns.barplot(x = x, y = 'Survived', data=train_df, ax = saxis[y,z])
    print(fig)

Above graph show that in each conditions, which type of passenger had higher survival rate.

# Model, predict and solve

The purpose of machine learning is to solve human problems.Machine learning can be categorized as: supervised learning, unsupervised learning, and reinforced learning. 

    Supervised learning is where you train the model by presenting it a training dataset that includes the correct answer. 

    Unsupervised learning is where you train the model using a training dataset that does not include the correct answer.
   

We are doing supervised machine learning, because we are training our algorithm by presenting it with a set of features and their corresponding target.There are many machine learning algorithms, however they can be reduced to four categories: classification, regression, clustering, or dimensionality reduction, depending on your target variable and data modeling goals.

We want to identify relationship between output (Survived or not) with other variables or features (Gender, Age, Port...). We can generalize that a continuous target variable requires a regression algorithm and a discrete target variable requires a classification algorithm. So our problem is a classification and regression problem. We can narrow down our choice of models to a few. These include:

* Logistic Regression
* KNN or k-Nearest Neighbors
* Support Vector Machines
* Naive Bayes classifier
* Decision Tree
* Random Forest


In [None]:
X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test  = test_df.copy()
X_train.shape, Y_train.shape, X_test.shape

In [None]:
# Logistic Regression
logreg = LogisticRegression()
Y_pred1 = logreg.fit(X_train, Y_train).predict(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
# KNN
knn = KNeighborsClassifier(n_neighbors = 4)
Y_pred2 = knn.fit(X_train, Y_train).predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
# Support Vector Machines
svm = SVC()
Y_pred3 = svm.fit(X_train, Y_train).predict(X_test)
acc_svm = round(svm.score(X_train, Y_train) * 100, 2)
# Naive Bayes classifier
nb = GaussianNB()
Y_pred4 = nb.fit(X_train, Y_train).predict(X_test)
acc_nb = round(nb.score(X_train, Y_train) * 100, 2)
# Decision Tree
decision_tree = DecisionTreeClassifier()
Y_pred5 = decision_tree.fit(X_train, Y_train).predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
# Random Forrest
random_forest = RandomForestClassifier(n_estimators=100)
Y_pred6 = random_forest.fit(X_train, Y_train).predict(X_test)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
#--------------------------------------------------------------------------
models = pd.DataFrame({
    'Model': ['Logistic Regression','KNN','Support Vector Machines','Naive Bayes','Decision Tree', 'Random Forest'],
    'Score': [acc_log, acc_knn, acc_svm,  acc_nb,acc_decision_tree,acc_random_forest]})
models=models.sort_values(by='Score', ascending=False)
models

In [None]:
sns.barplot(x='Score', y = 'Model', data = models)
plt.title('Machine Learning Algorithm Accuracy Score \n')
plt.xlabel('Accuracy Score (%)')
plt.ylabel('Algorithm')

In [None]:
submission = pd.DataFrame({
        "PassengerId": test_id,
        "Survived": Y_pred1
    })

submission.to_csv('Submission.csv', index=False)