## Titanic: Machine Learning from Disaster

**The objectives of this competition: In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc)**

The prediction for which passengers survived the Titanic shipwreck looks like a case of predicting discrete variables. Yes or No, 1 or 0. This is apparently a classification problem.

In [1]:
# Imported the python and machine learning modules
# for data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd

# for visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# machine learning algorithms for classification
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

**Load the dataset**

In [None]:
# Previewing the first 5 rows of the training dataset
titanicTrainingData = pd.read_csv('train.csv')
titanicTrainingData.head()

In [None]:
# Previewing the last 5 rows of the training dataset
titanicTrainingData.tail()

In [None]:
# Previewing the first 5 rows of the test dataset
titanicTestData = pd.read_csv('test.csv')
titanicTestData.head()

In [None]:
# Previewing the last 5 rows of test dataset
titanicTestData.tail()

In [None]:
# combine both train and test
#combinedDataset = [titanicTrainingData, titanicTestData]

**Features available in the dataset**

In [None]:
print(titanicTrainingData.columns)

**Which features are categorical and or numerical?**

In [None]:
titanicTrainingData.dtypes

**Pandas dataframe.info() function is used to get a concise summary of the dataframe. It comes really handy when doing exploratory analysis of the data. To get a quick overview of the dataset**

In [None]:
titanicTrainingData.info()

In [None]:
titanicTestData.info()

In [None]:
titanicTrainingData.describe()

**I decided to look at the target variable**

**I want to know the distribution of the target variable**

**I think there's a need to understand the distribution of the target variable with regards to the sex of the survivors before moving into analyzing the categorical and numerical variables** 

In [None]:
titanicSurvived = titanicTrainingData[['Survived', 'Sex']]
titanicSurvived.shape

**I plotted a horizontal bar chart to get a better understanding** 

In [None]:
fig, ax = plt.subplots(2, 1, sharex=True)

dist_target = titanicSurvived.shape[0]

(titanicTrainingData['Survived']
    .value_counts()
    .div(dist_target)
    .plot.barh(title="Distribution of Survivors", ax=ax[0])
)
ax[0].set_ylabel("Survived")

(titanicTrainingData['Sex']
    .value_counts()
    .div(dist_target)
    .plot.barh(title="Distribution of Males and Females", ax=ax[1])
)
ax[1].set_ylabel("Sex")


fig.tight_layout()

**My observations from the chart**

*1. In the training dataset, an estimated 40% of passengers survived the mishap*

*2. Of the estimated 40%, female survivors make up more than 60%, males less than 40%*

*3. Why do we have more female survivors than males? Is that a relevant question?*

*4. Should this be taken into consideration as I proceed?*

**I think I'll plot more charts to understand the relationships between the target variable(Survived) and other independent variables.**

**By doing this, I think I'll definitely hit some snags that will inform data processing**

In [None]:
plt.figure(figsize=(18,9)) 

sns.lineplot(x = 'Age', y = 'Survived', hue = 'Sex', data = titanicTrainingData)
plt.title("Distribution of Survivors/Victims by Age")
plt.show()

**Observations**

*1. I think age plays a major factor in the survival numbers. A lot of the female survivors are within the Age group 3 - 48 yrs*

*2. I think the motto for helping people to safety was "women and children"*

*3. A good number of men who survived where the elderly*

In [None]:
sns.set(style="whitegrid")
plt.figure(figsize=(10,8))
ax = sns.boxplot(x='Age', data=titanicTrainingData, orient="v")

In [None]:
ax = sns.boxplot(x='Sex', y='Survived', data=titanicTrainingData, orient="v")

In [None]:
ax = sns.boxplot(x='Pclass', y='Survived', data=titanicTrainingData, orient="v")

In [None]:
ax = sns.boxplot(x='SibSp', y='Survived', data=titanicTrainingData, orient="v")

In [None]:
ax = sns.boxplot(x='Embarked', y='Survived', data=titanicTrainingData, orient="v")

In [None]:
plt.figure(figsize=(18,9)) # ah.. the sweet 18 by 9 ratio

sns.lineplot(x = 'Pclass', y = 'Survived', hue = 'Sex', data = titanicTrainingData)
plt.title("Distribution of Survivors/Victims by Ticket class (1 = 1st Class, 2 = 2nd Class, 3 = 3rd Class)")
plt.show()

In [None]:
display(titanicTrainingData['Age'].value_counts())

**Obervations**

**What are categorical variables?**

*A categorical variable is a variable that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group or nominal category on the basis of some qualitative property.*

**Which features are categorical?**

*Pclass(ordinal), Survived, Sex and Embarked

In [None]:
titanicTrainingData['Pclass'].head(4)

In [None]:
titanicTrainingData['Survived'].head(4)

In [None]:
titanicTrainingData['Sex'].head(4)

In [None]:
titanicTrainingData['Embarked'].head(4)

**What are numerical variables**

*A numerical variable is a variable where the measurement or number has a numerical meaning*

**Which features are numerical?**

*Age, fare, Discrete:SibSp, Parch 

In [None]:
titanicTrainingData['Age'].head(4)

In [None]:
titanicTrainingData['Fare'].head(4)

In [None]:
titanicTrainingData['SibSp'].head(4)

In [None]:
titanicTrainingData['Parch'].head(4)

In [None]:
# preview the training dataset
titanicTrainingData.head()

In [None]:
titanicTrainingData.tail()

**Which features are mixed data types?**

*Tickets has a mix of numbers and alphanumeric data types*

*Cabin is alphanumeric*

In [None]:
titanicTrainingData['Parch'].head(4)

**Which features may contain errors or typos?**

*The Name column looks to have errors because the data shows the existence of commas, brackets, titles etc being
juxtaposed*

In [None]:
titanicTrainingData.tail()

**Which features contain null, blank & or empty values?**

*Why is this question relevant?*

**The existence of missing values may make or mar the prediction one is about to undertake**

**The existence of missing values may or may not be important depending on the problem to be solved**

In [None]:
titanicTrainingData.describe()

In [None]:
titanicTrainingData.isnull().sum()

**The train data has the following features having having null values: Age, cabin, embarked**

In [None]:
titanicTestData.isnull().sum()

**The test data has the following features having null values: Age & cabin**

**What are the data types for various features?**

**Why do we check for data types?**

*It is a crucial prerequisite for doing Exploratory Data Analysis (EDA) and Feature Engineering for Machine Learning models*

*Depending on the type of data, this might have some repercussions for the type of algorithms that you can use for feature engineering and modelling, or the type of questions that you can ask of it*

In [None]:
titanicTrainingData.info()

**# The train data has 7 numerical features and 5 non numerical features**

In [None]:
titanicTestData.info()

**# The test data has 6 numerical features and 4 non numerical features**