# Titanic Dataset Practice

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
train = pd.read_csv('train_titanic.csv')
train.head()

## Exploratory Data Analysis

### Missing Data

In [None]:
train.isnull()

- As its very difficult to understand whether the data consists of null values or not using the above approach, we will make use of seaborn to visualize the data to get a better understanding of the null values in each columns.

In [None]:
sns.heatmap(train.isnull(), yticklabels=False, cbar=False, cmap='viridis')

Roughly 20 percent of the Age data is missing. The proportion of Age missing is likely small enough for reasonable replacement with some form of imputation. Looking at the Cabin column, it looks like we are just missing too much of that data to do something useful with at a basic level. We'll probably drop this later, or change it to another feature like "Cabin Known: 1 or 0"

Let's continue on by visualizing some more of the data! Check out the video for full explanations over these plots, this code is just to serve as reference.

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Survived',data=train)

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Sex',data=train,palette='RdBu_r')

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Pclass',data=train,palette='rainbow')

In [None]:
sns.distplot(train['Age'].dropna(),kde=False,color='darkred',bins=40)

Or you can use below using matplotlib hist()

In [None]:
train['Age'].hist(bins=30,color='darkred',alpha=0.3)

In [None]:
sns.countplot(x='Parch',data=train)

In [None]:
sns.countplot(x='SibSp',data=train)

In [None]:
sns.countplot(x='Cabin',data=train)

In [None]:
train.Cabin.unique()

In [None]:
train['Fare'].hist(color='green',bins=40,figsize=(8,4))

Or use the following for a nice view

In [None]:
import cufflinks as cf
cf.go_offline()

In [None]:
train['Fare'].iplot(kind='hist',bins=30,color='green')

## Data Cleaning
We want to fill in missing age data instead of just dropping the missing age data rows. One way to do this is by filling in the mean age of all the passengers (imputation). However we can be smarter about this and check the average age by passenger class. For example:

In [None]:
plt.figure(figsize=(12, 7))
sns.boxplot(x='Pclass',y='Age',data=train,palette='winter')

In [None]:
grouped = train.groupby('Pclass').mean()
mean_ages = grouped['Age']
df = pd.DataFrame(mean_ages)
df

We can see the wealthier passengers in the higher classes tend to be older, which makes sense. We'll use these average age values to impute based on Pclass for Age.

In [None]:
def impute_age(cols):
    Age = cols[0]
    Pclass = cols[1]
    
    if pd.isnull(Age):

        if Pclass == 1:
            return 38

        elif Pclass == 2:
            return 29

        else:
            return 25

    else:
        return Age

Now apply that function

In [None]:
train['Age'] = train[['Age','Pclass']].apply(impute_age,axis=1)

Now let's check that heat map again!

In [None]:
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')

Here Embarked feature shows very few null values, which we will take care as below. First we find out the rows which contain the empty Embarked values

In [None]:
train[train['Embarked'].isnull()==True]

In order to fill the null values, we shall see which category in Embarked has the highest count, and the category with the highest count will be used to fill in the empty Embarked Values in the dataset.

In [None]:
sns.countplot(x='Embarked',data=train)

Here 'S' has the highest value count, so we use it for the null values

In [None]:
train['Embarked'].fillna('S', inplace=True)

In [None]:
train[train['Embarked'].isnull()==True]

In [None]:
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')

Great! Let's go ahead and drop the Cabin column

In [None]:
train.drop('Cabin',axis=1,inplace=True)

In [None]:
train.head()

In [None]:
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')

## Converting Categorical Features
We'll need to convert categorical features to dummy variables (One-Hot Encoding) using pandas! Otherwise our machine learning algorithm won't be able to directly take in those features as inputs.

In [None]:
train.info()

In [None]:
pd.get_dummies(train['Embarked']).head()

Above we can see that 3 columns are being created, but we don't need the first column as Q & S is sufficient to distinguish C, Q & S i.e. 0 1 would mean its S, 1 0 would mean its Q and 0 0 would mean its C. We are dropping 'C' mainly to avoid <b>redundancy</b> and <b>dummy variable trap</b>. Likewise, we shall do it for the column 'Sex'

In [None]:
pd.get_dummies(train['Embarked'],drop_first=True).head()

In [None]:
pd.get_dummies(train['Sex'],drop_first=True).head()

In [None]:
sex = pd.get_dummies(train['Sex'],drop_first=True, prefix='sex')
embark = pd.get_dummies(train['Embarked'],drop_first=True, prefix='embarked')

In [None]:
train.drop(['PassengerId','Sex','Embarked','Name','Ticket'],axis=1,inplace=True)

In [None]:
train.head()

In [None]:
train = pd.concat([train,sex,embark],axis=1)

In [None]:
train.head()

Great! Our data is ready for our model!

# Building a Logistic Regression model
Let's start by splitting our data into a training set and test set (there is another test.csv file that you can play around with in case you want to use all this data for training).
## Train Test Split

In [None]:
test = pd.read_csv('test_titanic.csv')
test.head()

In [None]:
train.drop('Survived',axis=1).head()


In [None]:
train['Survived'].head()


In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train.drop('Survived',axis=1), 
                                                    train['Survived'], test_size=0.30, 
                                                    random_state=101)

## Training and Predicting

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

In [None]:
predictions = logmodel.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
accuracy=confusion_matrix(y_test,predictions)
accuracy

In [None]:
from sklearn.metrics import accuracy_score


In [None]:
accuracy=accuracy_score(y_test,predictions)
accuracy

In [None]:
predictions

Let's move on to evaluate our model!
## Evaluation
We can check precision,recall,f1-score using classification report!

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(y_test,predictions))