**Author:** Manish KC

# Introduction to Data Pre-processing
Data pre-processing is one of the important step of the Data Science pipeline. The quality of data and the useful information can be derived from it which directly affects our model to learn. It is also necessary to convert the categorical data to numerical as machine learning algorithm takes only numerical data. In this tutorial, you will learn basic data pre-processing steps.

### Agenda
*  Loading Libraries
*  Loading Data
*  Data Overview and Summary
*  Data Pre-processing
    *  Dropping Irrelavent Features
    *  Dropping Rows with Missing Values
    *  Problems with dropping rows
    *  Taking care of missing data
    *  Handling Categorical Variables - Creating Dummy Variables
    * Separating input variables and output variable
    * Splitting the data into train set and test set

We are going to use the titanic dataset.

## About Titanic
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

In the Hollywood blockbuster that was modelled on this tragedy, it seemed to be the case that upper-class people, women and children were more likely to survive than others. But did these properties (socio-economic status, sex and age) really influence one's survival chances?

#### **Problem Identification: The goal is to predict who survived during this titanic incident (shipwreck).**

#### Dataset download link
https://docs.google.com/spreadsheets/d/1hFOPnxVT9fyT4TFlwuGGbDLfclY43P48UV24PNfAW2M/edit#gid=1297342310

## Loading Libraries
You can load all the libraries that you think will require or you can import as you go along.

**Alias for libraries:** numpy --> np, pandas --> pd

In [0]:
import numpy as np        # A fundamental package for linear algebra and multidimensional arrays
import pandas as pd       # Data analysis and manipulation tool.

## Loading Data
The data is in csv format. Let's load the csv data using pandas read_csv() function.

In [0]:
# I have provided the path of data which I have saved in my drive as 'titanic_train_data'.
# You provide the path where you have saved the data.
titanic_data = pd.read_csv("/content/drive/My Drive/Datasets/titanic_train_data.csv")

## Data Overview and Summary
Let's look how the data look like and a concise summary of the data.

In [0]:
# Displaying 5 random records
titanic_data.sample(5)     # You can pass the number of random records that you want to be displayed in 'sample()'

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
535,536,1,2,"Hart, Miss. Eva Miriam",female,7.0,0,2,F.C.C. 13529,26.25,,S
633,634,0,1,"Parr, Mr. William Henry Marsh",male,,0,0,112052,0.0,,S
552,553,0,3,"O'Brien, Mr. Timothy",male,,0,0,330979,7.8292,,Q
489,490,1,3,"Coutts, Master. Eden Leslie ""Neville""",male,9.0,1,1,C.A. 37671,15.9,,S
570,571,1,2,"Harris, Mr. George",male,62.0,0,0,S.W./PP 752,10.5,,S


We can observe that there are some null values (i.e. NaN) in 'Age' and 'Cabin' attributes.

In [0]:
# A concise summary of the data
titanic_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


The concise summary of data tells us:
*  There are total 891 observations / records in the dataset
*  Age, Cabin and Embarked features have missing values. Cabin has a lot of missing values. Embarked has only two missing values.
*  There are some categorical variables like Embarked, Sex, which are required to be converted into numerical.

Similarly, you can observe some other information from above concise summary.



#### Exploring the attributes:
*  'Pclass' column contains a number which indicates class of the passenger's ticket:  1 for first class, 2 for second class and 3 for third class. This could function as a proxy for the socio-economic status of the passenger ('upper', 'middle', 'low'). 


*  The 'SibSp' column contains the number of siblings + spouses of the passenger also aboard the Titanic;

*  The 'ParCh' column indicates the number of parents + children of the passenger also aboard the Titanic. 

*  The 'Ticket' column contains the ticket numbers of passengers (which are not likely to have any predictive power regarding survival);

*  'Cabin' contains the cabin number of the passenger, if he/she had a cabin, and lastly, 

*  'Embarked' indicates the port of embarkation of the passenger: **C**herbourg, **Q**ueenstown or **S**outhampton. The meaning of the other columns is clear, I think.

# Data Pre - processing
Now we come to the main agenda of this tutorial i.e. data pre-processing.

### Dropping Irrelavent Features / Columns
Here the goal is to predict the survival of the passengers. We can understand form our common sense or understanding that a person cannot survive because of his / her name, ticket number or cabin. So, we can say that these are irrelavent features. Let's drop these features from the dataset as these do not contribute much for the survival of the passenger.

This is very subjective and solely depends on the nature of the dataset and underlying context. We cannot generalize this procedure to all the datasets.

In [0]:
cols_to_drop = ['Name', 'Ticket', 'Cabin']     # columns / features to be dropped

# We can use .drop() frunction of pandas to drop the columns. If you remember from the pandas session, for columns to be dropped the axis must be 1.
titanic_data.drop(columns = cols_to_drop, axis = 1, inplace = True)     # we are making changes in the dataframe itself, so, inplace = True

In [0]:
# Let's check what columns we have using 'columns' method
titanic_data.columns                   # returns list of columns in the dataframe

Index(['PassengerId', 'Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch',
       'Fare', 'Embarked'],
      dtype='object')

If you notice, there are no 'Name', 'Ticket' and 'Cabin' features in the dataframe.

### Dropping Rows with Missing Values
We can also drop some rows with missing values.

In [0]:
# First let's make a copy of the dataset
data = titanic_data.copy()        

Now, we will drop rows with missing values from 'data'.

In [0]:
# Again the drop() function of pandas can be used to drop rows with missing values. The only change will be axis = 0 instead of axis = 1
data.dropna(axis = 0, inplace = True)

In [0]:
# looking at the info
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 712 entries, 0 to 890
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  712 non-null    int64  
 1   Survived     712 non-null    int64  
 2   Pclass       712 non-null    int64  
 3   Sex          712 non-null    object 
 4   Age          712 non-null    float64
 5   SibSp        712 non-null    int64  
 6   Parch        712 non-null    int64  
 7   Fare         712 non-null    float64
 8   Embarked     712 non-null    object 
dtypes: float64(2), int64(5), object(2)
memory usage: 55.6+ KB


### Problems with droping rows

In [0]:
# Number of rows dropped
891- 712

179

If you notice here we have dropped 179 rows out of 891 rows and have lost lot of data i.e. out of 891 records, 179 records are good amount of data and we have lost those data.

The more data you feed the machine learning model, the better performance it gives. So, we always try to preserve data as much as possible.

In [0]:
# We have our titanic_data where we have not removed any rows.
titanic_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Sex          891 non-null    object 
 4   Age          714 non-null    float64
 5   SibSp        891 non-null    int64  
 6   Parch        891 non-null    int64  
 7   Fare         891 non-null    float64
 8   Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(2)
memory usage: 62.8+ KB


### Taking Care of Missing Data
There are some missing values in 'Age' and 'Embarked' features.

We can compute median or interpolate() to fill the missing values of 'Age' feature.

In [0]:
titanic_data['Age'] = titanic_data['Age'].interpolate()

'Embarked' is a categorical variable. Let's take a look at its unique values.

In [0]:
titanic_data.Embarked.unique()

array(['S', 'C', 'Q', nan], dtype=object)

There are 3 unique values with two missing values.

In [0]:
# Looking at frequency of each values in 'Embarked'
titanic_data.Embarked.value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

'S' is the most occuring value in 'Embarked' column. So, we can fill the two missing values with 'S' assuming that the two passengers whose port of Embarkation is missing might have embarked from 'Southampton' i.e. 'S'.

Let's fill the null values of Embarked.

In [0]:
# We can use fillna() function to fill the missing values as discussed in the pandas session / notebook.
titanic_data.Embarked.fillna(value='S', axis = 0, inplace = True)

In [0]:
# We can check if there still exist any missing value using the info() method
titanic_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Sex          891 non-null    object 
 4   Age          891 non-null    float64
 5   SibSp        891 non-null    int64  
 6   Parch        891 non-null    int64  
 7   Fare         891 non-null    float64
 8   Embarked     891 non-null    object 
dtypes: float64(2), int64(5), object(2)
memory usage: 62.8+ KB


All the features have 891 non - null values. Now we don't have nay missing values.

### Handling Categorical Variables - Creating Dummy Variables
There are many methods of dealing with categorical data. One of the known method is creating dummy variables.

**Creating Dummy Variables:**
Let's understand using Embarked feature. This feature has 3 unique values. 
In this case 3 new features will be created - 'C', 'Q' and 'S'. If the passenger had embarked from 'Cherbourg' (i.e. C), the newly generated column 'C' will have its value as 1 else 0 and similarly for other dummy variables.

There are three categorical feature with us in this dataset - Pclaa, Sex and Embarked. Let's create dummy variables for this.

In [0]:
titanic_data = pd.get_dummies(titanic_data, columns=['Pclass', 'Sex', 'Embarked'])

In [0]:
titanic_data.head(1)

Unnamed: 0,PassengerId,Survived,Age,SibSp,Parch,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,1,0,22.0,1,0,7.25,0,0,1,0,1,0,0,1


The columns: Pclass_1, Pclass_2, Pclass_3, Sex_female, Sex_male, Embarked_C, Embarked_Q and Embarked_S are our dummy variables. The original columns are dropped by itself from the dataframe. If we had created dummhy variables separately for each column and then concatenated all those variables to the datafram then we had needed to drop those parent columns.

### Separating Target and Input Features
We feed the data to the algorithms as the input features (i.e. independent variables) and the target feature (i.e. dependent variable). So, we separate these features form the dataset. 

In [0]:
X = titanic_data.drop(columns=['Survived'])     # these are independent features. The change of dropping the column 'Survived' is not inplace.

Y = titanic_data.Survived               # The target feature or dependent feature.

### Splitting data into Train set and Test set
The test set of the data is used to check how the built model is performing.
Generally, people use 70% of the data for trainig the model and 30% for testing the model. Some people also use the ratio as 80% or 90% for training and 20% or 10%  for testing.

The splitting is done using train_test_split class from sklearn.model selection.

In [0]:
# import the class
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 42)

# X_train: independent feature data for training the model
# Y_train: dependent feature data for training the model
# X_test: independent feature data for testing the model; will be used to predict the target values
# Y_test: original target values of X_test; We will compare this values with our predicted values.
 
# test_size = 0.30: 30% of the data will go for test set and 70% of the data will go for train set
# random_state = 42: this will fix the split i.e. there will be same split for each time you run the code

# Conclusion
That's it in this tutorial about Data Pre - processing. Here we discussed some basic stuff about data pre - processing. Later we will learn stuffs like standardization, normalization, One Hot Encoding, etc. in data pre - processing.
Thanks!

#### **References:**
1. [Implementation of Data Preprocessing on Titanic Dataset by Afroz Chakure](https://towardsdatascience.com/implementation-of-data-preprocessing-on-titanic-dataset-6c553bef0bc6)
2. [Data Pre-Processing By Ayon Roy at DPhi](https://www.youtube.com/watch?v=ni5BO0mO1x8)
3. [Introduction to Data Preprocessing in Machine Learning by Dhairya Kumar](https://towardsdatascience.com/introduction-to-data-preprocessing-in-machine-learning-a9fa83a5dc9d#:~:text=Data%20preprocessing%20is%20an%20integral,feeding%20it%20into%20our%20model.)