# Data Preprocessing

This notebook will be a part of the **Data Preprocessing/Cleaning** node of the Klee project.
Data cleaning is the process of preparing data for analysis by removing or modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted. Usually *Raw data* can come in all kinds of strange distribution and non uniform formats, that it makes analyzing of data and creation of model very difficult. In this notebook we will be using multiple libraries for performing common data cleaning steps on tabular dataset.


#### Usage: 
This notebook can perform some data cleaning operations on any type of tabular dataset and read it in format of CSV or Datasets. Refer to **Load the dataset** part of the notebook to play around your custom dataset file

#### Methods used in this notebook:
- Removing unwanted features
- Imputing missing values
- Scaling
- Normalization
- Encoding for categorical columns

### Libraries used 
- We use ``ScikitLearn`` to investigate and clean the dataset. Cleaning the input data is a very important step before moving to modeling part of the data Science pipeline, as it can potentially harm model's prediction. 

#### Input: 
The input to this notebook is Tabular dataset.

#### Output:
Output of this notebook is a clean form of the same dataset.

In [18]:
# Imports

import math
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.feature_extraction import DictVectorizer

In [22]:
# loading dataset
url = "https://github.com/nikbearbrown/Visual_Analytics/raw/main/CSV/titanic.csv"
df = pd.read_csv(url)
df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


### Impute misssing values
Real world datasets usually contain alot of missing values, often encoded as blanks, NaNs or other placeholders. Such datasets however are incompatible with alot of functions and models which assume that all values in an array are numerical, and that all have and hold meaning. A basic strategy to use incomplete datasets is to discard entire rows and/or columns containing missing values or to impute the missing values, i.e., to infer them from the known part of the data. 
- **categorical value** represented as string values or pandas categoricals when using the 'most_frequent' or 'constant' strategy
- **numerical values** are inputed using "mean" of the feature
\
\
> Methods to be included:
- Nearest neighbour imputation
- A model trained on exisiting data to impute missing values

In [23]:
def imputeMissingValues(inputData, columns = list()):

    if not columns:

        ContinuousColumns = inputData.select_dtypes(exclude = 'object').columns
        CategoricalColumns = inputData.select_dtypes(include = 'object').columns

    else:

        ContinuousColumns = inputData[columns].select_dtypes(exclude = 'object').columns
        CategoricalColumns = inputData[columns].select_dtypes(include = 'object').columns

    # for all continuous columns
    imp = SimpleImputer()
    imp.fit(inputData[ContinuousColumns])
    inputData[ContinuousColumns] = imp.transform(inputData[ContinuousColumns])

    # for all categorical columns
    imp = SimpleImputer(strategy="most_frequent")
    imp.fit(inputData[CategoricalColumns])
    inputData[CategoricalColumns] = imp.transform(inputData[CategoricalColumns])
    
    return inputData

In [24]:
df.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [25]:
imputeMissingValues(df,["Cabin","Age","Fare"])

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.50000,0,0,330911,7.8292,B57 B59 B63 B66,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.00000,1,0,363272,7.0000,B57 B59 B63 B66,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.00000,0,0,240276,9.6875,B57 B59 B63 B66,Q
3,895,3,"Wirz, Mr. Albert",male,27.00000,0,0,315154,8.6625,B57 B59 B63 B66,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.00000,1,1,3101298,12.2875,B57 B59 B63 B66,S
...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,"Spector, Mr. Woolf",male,30.27259,0,0,A.5. 3236,8.0500,B57 B59 B63 B66,S
414,1306,1,"Oliva y Ocana, Dona. Fermina",female,39.00000,0,0,PC 17758,108.9000,C105,C
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.50000,0,0,SOTON/O.Q. 3101262,7.2500,B57 B59 B63 B66,S
416,1308,3,"Ware, Mr. Frederick",male,30.27259,0,0,359309,8.0500,B57 B59 B63 B66,S


In [26]:
df.isnull().sum()

PassengerId    0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

### Scaling Input
Input variables may have different units (e.g. feet, kilometers, and hours) that, in turn, may mean the variables have different scales. The two most popular techniques for scaling numerical data prior to modeling are **normalization** and **standardization**. Normalization scales each input variable separately to the range 0-1, which is the range for floating-point values where we have the most precision. Standardization scales each input variable separately by subtracting the mean (called centering) and dividing by the standard deviation to shift the distribution to have a mean of zero and a standard deviation of one.

In [27]:
def scalingInput(inputData):
    """
    Changing the range of numerical input features of dataset

    """
    columnNames = inputData.select_dtypes(include='number').columns
    inputData[columnNames] = preprocessing.scale(inputData[columnNames])
    
    return inputData

In [28]:
scalingInput(df)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,-1.727912,0.873482,"Kelly, Mr. James",male,0.334993,-0.499470,-0.400248,330911,-0.498407,B57 B59 B63 B66,Q
1,-1.719625,0.873482,"Wilkes, Mrs. James (Ellen Needs)",female,1.325530,0.616992,-0.400248,363272,-0.513274,B57 B59 B63 B66,S
2,-1.711337,-0.315819,"Myles, Mr. Thomas Francis",male,2.514175,-0.499470,-0.400248,240276,-0.465088,B57 B59 B63 B66,Q
3,-1.703050,0.873482,"Wirz, Mr. Albert",male,-0.259330,-0.499470,-0.400248,315154,-0.483466,B57 B59 B63 B66,S
4,-1.694763,0.873482,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,-0.655545,0.616992,0.619896,3101298,-0.418471,B57 B59 B63 B66,S
...,...,...,...,...,...,...,...,...,...,...,...
413,1.694763,0.873482,"Spector, Mr. Woolf",male,0.000000,-0.499470,-0.400248,A.5. 3236,-0.494448,B57 B59 B63 B66,S
414,1.703050,-1.505120,"Oliva y Ocana, Dona. Fermina",female,0.691586,-0.499470,-0.400248,PC 17758,1.313753,C105,C
415,1.711337,0.873482,"Saether, Mr. Simon Sivertsen",male,0.651965,-0.499470,-0.400248,SOTON/O.Q. 3101262,-0.508792,B57 B59 B63 B66,S
416,1.719625,0.873482,"Ware, Mr. Frederick",male,0.000000,-0.499470,-0.400248,359309,-0.494448,B57 B59 B63 B66,S


In [29]:
def normalizeData(inputData, columnNames):
    """
    changing the shapr of distribution of data
    """

    if not columnNames:
        columnNames = inputData.select_dtypes(include='number').columns

    transformer = preprocessing.Normalizer().fit(inputData[columnNames])
    inputData[columnNames] = transformer.transform(inputData[columnNames])
    
    return inputData

In [30]:
normalizeData(df,["Pclass"])

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,-1.727912,1.0,"Kelly, Mr. James",male,0.334993,-0.499470,-0.400248,330911,-0.498407,B57 B59 B63 B66,Q
1,-1.719625,1.0,"Wilkes, Mrs. James (Ellen Needs)",female,1.325530,0.616992,-0.400248,363272,-0.513274,B57 B59 B63 B66,S
2,-1.711337,-1.0,"Myles, Mr. Thomas Francis",male,2.514175,-0.499470,-0.400248,240276,-0.465088,B57 B59 B63 B66,Q
3,-1.703050,1.0,"Wirz, Mr. Albert",male,-0.259330,-0.499470,-0.400248,315154,-0.483466,B57 B59 B63 B66,S
4,-1.694763,1.0,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,-0.655545,0.616992,0.619896,3101298,-0.418471,B57 B59 B63 B66,S
...,...,...,...,...,...,...,...,...,...,...,...
413,1.694763,1.0,"Spector, Mr. Woolf",male,0.000000,-0.499470,-0.400248,A.5. 3236,-0.494448,B57 B59 B63 B66,S
414,1.703050,-1.0,"Oliva y Ocana, Dona. Fermina",female,0.691586,-0.499470,-0.400248,PC 17758,1.313753,C105,C
415,1.711337,1.0,"Saether, Mr. Simon Sivertsen",male,0.651965,-0.499470,-0.400248,SOTON/O.Q. 3101262,-0.508792,B57 B59 B63 B66,S
416,1.719625,1.0,"Ware, Mr. Frederick",male,0.000000,-0.499470,-0.400248,359309,-0.494448,B57 B59 B63 B66,S


### One Hot Encoding
A one hot encoding is a representation of categorical variables as binary vectors.This first requires that the categorical values be mapped to integer values. Then, each integer value is represented as a binary vector that is all zero values except the index of the integer, which is marked with a 1.

In [31]:
            
def encodeCategory(inputData):
    """
    Encoding of Categorical features with one hot numeric array
    """

    columnNames= inputData.select_dtypes(include=['object']).columns 
    print(columnNames)
    print(inputData[columnNames])
    drop_enc = preprocessing.OneHotEncoder(drop='first')
    df = pd.DataFrame(drop_enc.fit_transform(inputData[columnNames]).toarray())
    inputData = inputData.join(df)
    
    return inputData

In [32]:
encodeCategory(df)

Index(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], dtype='object')
                                             Name     Sex              Ticket  \
0                                Kelly, Mr. James    male              330911   
1                Wilkes, Mrs. James (Ellen Needs)  female              363272   
2                       Myles, Mr. Thomas Francis    male              240276   
3                                Wirz, Mr. Albert    male              315154   
4    Hirvonen, Mrs. Alexander (Helga E Lindqvist)  female             3101298   
..                                            ...     ...                 ...   
413                            Spector, Mr. Woolf    male           A.5. 3236   
414                  Oliva y Ocana, Dona. Fermina  female            PC 17758   
415                  Saether, Mr. Simon Sivertsen    male  SOTON/O.Q. 3101262   
416                           Ware, Mr. Frederick    male              359309   
417                      Peter, Master.

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,...,847,848,849,850,851,852,853,854,855,856
0,-1.727912,1.0,"Kelly, Mr. James",male,0.334993,-0.499470,-0.400248,330911,-0.498407,B57 B59 B63 B66,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,-1.719625,1.0,"Wilkes, Mrs. James (Ellen Needs)",female,1.325530,0.616992,-0.400248,363272,-0.513274,B57 B59 B63 B66,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,-1.711337,-1.0,"Myles, Mr. Thomas Francis",male,2.514175,-0.499470,-0.400248,240276,-0.465088,B57 B59 B63 B66,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,-1.703050,1.0,"Wirz, Mr. Albert",male,-0.259330,-0.499470,-0.400248,315154,-0.483466,B57 B59 B63 B66,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,-1.694763,1.0,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,-0.655545,0.616992,0.619896,3101298,-0.418471,B57 B59 B63 B66,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
413,1.694763,1.0,"Spector, Mr. Woolf",male,0.000000,-0.499470,-0.400248,A.5. 3236,-0.494448,B57 B59 B63 B66,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
414,1.703050,-1.0,"Oliva y Ocana, Dona. Fermina",female,0.691586,-0.499470,-0.400248,PC 17758,1.313753,C105,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
415,1.711337,1.0,"Saether, Mr. Simon Sivertsen",male,0.651965,-0.499470,-0.400248,SOTON/O.Q. 3101262,-0.508792,B57 B59 B63 B66,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
416,1.719625,1.0,"Ware, Mr. Frederick",male,0.000000,-0.499470,-0.400248,359309,-0.494448,B57 B59 B63 B66,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [33]:
df.columns

Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')