# Data Preprocessing Example

**Data preprocessing** is a step in the data mining and data analysis process that takes raw data and transforms it into a format that can be understood and analyzed by computers and machine learning.

Data preprocessing is important to improve the overall data quality.

Why is data preprocessing important?

- Duplicate or missing values may give an incorrect view of the overall statistics of data.
- Outliers and inconsistent data points often tend to disturb the model’s overall learning, leading to false predictions.
Data preprocessing handles them.

There are four main steps of data preprocessing. Which are:
- Data Cleaning
- Data Intergration
- Data Transformation
- Data Reduction

**1. Data Cleaning:**

It is the process of adding missing data and correcting, repairing, or removing incorrect or irrelevant data from a dataset.


- **Missing data:** Some data is missing in the dataset. It can be handled in various ways:

    - Ignore the tuples: when dataset is huge and multiple values are missing within a tuple.

    - Fill the missing values: fill the missing values manually, by attribute mean or the most probable value.
    

- **Noisy data:** It is a meaningless data. It can be generated due to faulty data collection, data entry errors. It can be handled in following ways :

    - Binning: It works on sorted data values. The data is divided into equal-sized bins, and each bin/bucket is dealt with independently. 

    - Regression: It helps to smoothen noise by fitting all the data points in a regression function. The regression used may be linear (having one independent variable) or multiple (having multiple independent variables).
   
    - Clustering: Creation of groups/clusters from data having similar values. The values that don't lie in the cluster can be treated as noisy data and can be removed.

**2. Data Integration:**

It is one of the data preprocessing steps that are used to merge the data present in multiple sources into a single larger data store like a data warehouse.

**3. Data Transformation:**

It is taken in order to transform the data in appropriate forms suitable for mining process. This involves following ways:

- **Aggregation:** Data aggregation combines all of your data together in a uniform format.


- **Normalization:** It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0).

    - Min-max normalization
    - Z-Score normalization
    - Decimal scaling normalization
    
    
- **Feature selection:** Feature selection is the process of deciding which variables are most important to your analysis. New properties of data are created from existing attributes to help in the data mining process.


- **Discreditization:** This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.


- **Concept hierarchy generation:** Concept hierarchy generation can add a hierarchy within and between your features that wasn’t present in the original data. If your analysis contains wolves and coyotes, for example, you could add the hierarchy for their genus: canis.

**4. Data Reduction:**

The size of the dataset can be too large to be handled by data analysis and data mining algorithms.

One possible solution is to obtain a reduced representation of the dataset that is much smaller in volume but produces the same quality of analytical results. For this, we use data reduction techniques.

- **Data Cube Aggregation:** It is a way of data reduction, in which the gathered data is expressed in a summary form.


- **Attribute Subset Selection:** The highly relevant attributes should be used, rest all can be discarded. It, essentially, combines tags or features.


- **Numerosity Reduction:** The data can be represented as a model or equation like a regression model. This would save the burden of storing huge datasets instead of a model.


- **Dimensionality Reduction:** This technique aims to reduce the number of redundant features we consider in machine learning algorithms. Dimensionality reduction can be done using techniques like Principal Component Analysis etc.

## The Dataset

In this notebook we will try to apply steps of data preprocessing.

The **dataset** that we will use contains 4 columns and it gives information about customers who bought or did not a related product:
- **Country:** Country of the customer
- **Age:** Age of the customer
- **Salary:** Salary of the customer 
- **Purchased:** Information about if the customer buy the related product

## Importing Libraries

In [31]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

## Importing Data

In [2]:
df = pd.read_csv("data.csv")

In [3]:
df.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Country    10 non-null     object 
 1   Age        9 non-null      float64
 2   Salary     9 non-null      float64
 3   Purchased  10 non-null     object 
dtypes: float64(2), object(2)
memory usage: 448.0+ bytes


In [9]:
df.isna()

Unnamed: 0,Country,Age,Salary,Purchased
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False
4,False,False,True,False
5,False,False,False,False
6,False,True,False,False
7,False,False,False,False
8,False,False,False,False
9,False,False,False,False


As seen above, we have two null values, we will handle them soon.

Now, we will extract dependent and independent variables. We have 3 independent variables that are **Country, Age** and **Salary**, and the dependent one is **Purchased** column.

In [4]:
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

In [5]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [6]:
y

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
      dtype=object)

## Data Preprocessing

### Missing Data

We can handle missing data by two ways:
- Ignore the tuples: when dataset is huge and multiple values are missing within a tuple.
- Fill the missing values: fill the missing values manually, by attribute mean or the most probable value.
    

To handle missing values, we will use SimpleImputer function. It complements missing values with simple strategies. We will use fill them with the mean value.

In [13]:
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

In [14]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

The null values are filled with the mean of the column.

### Encoding Categorical Data

For encoding variables, we will use **One-hot encoding** which is the conversion of categorical information into a format that may be fed into machine learning algorithms to improve prediction accuracy. One-hot encoding is a common method for dealing with categorical data in machine learning. 

It is useful for data that has no relationship to each other and it represents categorical variables as binary.

In [18]:
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [19]:
X

array([[1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [0.0, 1.0, 0.0, 30.0, 54000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 35.0, 58000.0],
       [0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

For the **dependent variable**, we will use **label encoding**. Label Encoding is a popular encoding technique for handling categorical variables and it is useful **when there are only two possible values of a categorical features**.

In [22]:
le = LabelEncoder()
y = le.fit_transform(y)

In [23]:
y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

### Splitting The Dataset 

In data preprocessing, we divide our dataset into a training set and test set.

The train_test_split function splits the data as 30% of the data will be test and the remaining values for training.

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 1)

In [27]:
X_train.shape

(7, 5)

In [28]:
X_test.shape

(3, 5)

In [29]:
y_train.shape

(7,)

In [30]:
y_test.shape

(3,)

### Feature Scaling

Feature scaling is a method used to normalize the range of independent variables or features of data. 

It we look at our dataset, we can see how values are far away from them in the Age and Salary column. With feature scaling, we will put our variables in the same range and in the same scale so that no any variable dominate the other variable.

In [38]:
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [32]:
sc = StandardScaler()

In [34]:
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [35]:
X_train

array([[-0.8660254 ,  1.58113883, -0.63245553, -0.03891021, -0.22960023],
       [ 1.15470054, -0.63245553, -0.63245553,  0.50583275,  0.49120535],
       [-0.8660254 , -0.63245553,  1.58113883, -0.31128169, -0.47311563],
       [-0.8660254 , -0.63245553,  1.58113883, -1.80932482, -1.6127677 ],
       [ 1.15470054, -0.63245553, -0.63245553,  1.0505757 ,  1.10486416],
       [-0.8660254 ,  1.58113883, -0.63245553,  1.32294718,  1.45552633],
       [ 1.15470054, -0.63245553, -0.63245553, -0.71983891, -0.73611226]])

In [36]:
X_test

array([[-3.17206578e-17,  1.00000000e+00, -3.17206578e-17,
         3.00000000e+01,  5.40000000e+04],
       [ 1.00000000e+00, -3.17206578e-17, -3.17206578e-17,
         3.70000000e+01,  6.70000000e+04],
       [-3.17206578e-17, -3.17206578e-17,  1.00000000e+00,
         3.87777778e+01,  5.20000000e+04]])