In this notebook lets explore the data and do the Pre-processing and feature engineering alone. <br>
Please check out my another notebook [Modelling Titanic](https://www.kaggle.com/aakashveera/modelling-titanic). It has the modelling section with several Machine Learning Algorithms

# Importing and Reading the Data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv("/kaggle/input/titanic/train.csv")

In [None]:
df.info()

In [None]:
df.head()

Passenger Id is just a unique Id given to each passengers, so it has no use in Modelling and can be dropped.<br>
SibSp refers to the siblings and spouses along with the person in the titanic. <br>
Parch refers to the parents/ Childern abroad the titanic.

In [None]:
df.drop("PassengerId",axis=1,inplace=True)

# EDA & Visualizations

In [None]:
sns.countplot(df['Pclass'])

In [None]:
sns.countplot(df['Survived'])

In [None]:
sns.countplot(df['Survived'],hue=df['Sex'],palette='twilight_shifted_r')

In [None]:
sns.countplot(df['Survived'],hue=df['Pclass'],palette='viridis')

In [None]:
sns.countplot(df['SibSp'])

In [None]:
sns.countplot(df['Parch'])

Most of the People were alone on the titanic ship.<br>
The below image shows the Age distribution. **Titanic has more people from 18-35 years old**

In [None]:
sns.distplot(df['Age'],kde=False,bins=40)

The black dots on the right of image represents the outliers in the Age. <br>


In [None]:
sns.boxplot(df['Age'])

# Filling out missing values

In [None]:
df.isnull().sum()

In [None]:
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

* Cabin has most of the values as null.
* Age is partially missing.
* Only few columns were missing in Embarked.

## Age

In [None]:
plt.figure(figsize=(12, 7))
sns.boxplot(x='Pclass',y='Age',data=df,palette='winter')

Here we can see that 1st class people were mostly around 30 -50 <br>
2nd class people were mostly around 28 - 38 and 3rd class people were younger than both. <br>
So it will good to fill the null values based on thier class

In [None]:
def age_fill(cols):
    Age = cols[0]
    Pclass = cols[1]
    
    if pd.isnull(Age):

        if Pclass == 1:
            return 37

        elif Pclass == 2:
            return 29

        else:
            return 24

    else:
        return Age

In [None]:
df['Age'] = df[['Age','Pclass']].apply(age_fill,axis=1)

## Cabin and Embarked

Lets covert all the null values as 0 and cabins as 1 and fill the most repeated values on Embarked

In [None]:
df['Cabin'] = df['Cabin'].apply(lambda x: 0 if pd.isnull(x) else 1)

In [None]:
df['Embarked'].value_counts()

In [None]:
df['Embarked'].fillna(value='S',inplace=True)

In [None]:
df.head()

# Feature Engineering

Let's Extract out the titles from the Name

In [None]:
def extract_title(arg):
    return arg.split(' ')[1]

df['Title'] = df['Name'].apply(extract_title)
#df['Title'] = df['Name'].apply(lambda x: x.split(' ')[1])  equivalent lambda function
df.drop('Name',axis=1,inplace=True)

Ticket and Fare mostly represents the class of the person. Since it is available as a seperate feature both can be dropped.

In [None]:
df.drop(['Ticket','Fare'],axis=1,inplace=True)

Now lets convert the categorical values into numerical ones.

In [None]:
#you may also use function approach to convert the data. But let's see with LabelEncoder
from sklearn.preprocessing import LabelEncoder
encoder_sex = LabelEncoder()
encoder_embarked = LabelEncoder()
encoder_title = LabelEncoder()

In [None]:
df['Sex'] = encoder_sex.fit_transform(df['Sex'])
df['Embarked'] = encoder_embarked.fit_transform(df['Embarked'])
df['Title'] = encoder_title.fit_transform(df['Title'])

In [None]:
df['Class_sex'] = df['Pclass'].astype(str) + df['Sex'].astype(str)
encoder_Class_sex = LabelEncoder()
df['Class_sex'] = encoder_Class_sex.fit_transform(df['Class_sex'])

In [None]:
df

Now let's do the same feature Engineering with test set

In [None]:
test = pd.read_csv("/kaggle/input/titanic/test.csv")

In [None]:
#Don't drop the PassengerId is it neccassary for submission
test.drop(['Ticket','Fare'],axis=1,inplace=True)

In [None]:
test['Age'] = test[['Age','Pclass']].apply(age_fill,axis=1)

In [None]:
test['Title'] = test['Name'].apply(extract_title)
test.drop('Name',axis=1,inplace=True)

In [None]:
test['Cabin'] = test['Cabin'].apply(lambda x: 0 if pd.isnull(x) else 1)

In [None]:
test.head()

In [None]:
test['Sex'] = encoder_sex.transform(test['Sex'])
test['Embarked'] = encoder_embarked.transform(test['Embarked'])

Some of the titles in test set were not in training set labelEncoder throws error while transforming a unseen data.<br>
Khalil,Palmquist,Brito were the three and all three were men lets convert it into Mr

In [None]:
test['Title'].value_counts()

In [None]:
test.loc[test['Title']=='Khalil,','Title'] = 'Mr.'
test.loc[test['Title']=='Palmquist,','Title'] = 'Mr.'
test.loc[test['Title']=='Brito,','Title'] = 'Mr.'

In [None]:
test['Title'] = encoder_title.transform(test['Title'])

In [None]:
test['Class_sex'] = test['Pclass'].astype(str) + test['Sex'].astype(str)
test['Class_sex'] = encoder_Class_sex.transform(test['Class_sex'])

Lets save our pre-processed data and publish as a new dataset so next time while modelling we can use this cleaned data instead doing the same from beginning.<br>
<br>
**Note for Begginers:**  For publishing our own data hit the Save Version button and hit commit. Now once your notebook is executed you can create a new dataset from the output at the bottom of the notebook

In [None]:
test.to_csv("test.csv",index=False)
df.to_csv("train.csv",index=False)

Continue with  my another notebook [Modelling Titanic](https://www.kaggle.com/aakashveera/modelling-titanic). It has the modelling section with several Machine Learning Algorithms and the submission on kaggle