<a href="https://colab.research.google.com/github/coder-KB/Data-Preprocessing/blob/master/Data_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Importing the Dataset

In [0]:
#Import library for reading the dataset  
import pandas as pd
import numpy as np

In [0]:
#Import the dataset and store it in a variable named dataset
dataset = pd.read_csv('Data.csv')

In [3]:
#Now we peek at the top of the datset. Notice that missing values are stored as NaN.
dataset.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


In [4]:
# or else we can print the whole dataset
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


Now we do a general check to see if there are any missing values and then 
count their number

In [5]:
dataset.isnull().any()

Country      False
Age           True
Salary        True
Purchased    False
dtype: bool

In [6]:
#Now this will count the number of missing values in each column
dataset.isnull().sum()

Country      0
Age          1
Salary       1
Purchased    0
dtype: int64

# Handling Missing Data

Here we have various options:
1.Ignore the data row
2.Use a global constant for the missing values (e.g. NaN, minus infinity)
3.Use attribute median
4.Use attribute mode
5.Use attribute mean
We shall use the imputer class for handling the missing data.

In [0]:
from sklearn.impute import SimpleImputer

We create the imputer object with the following arguments
1. missing_values = 'NaN' (this is how missing values are stored)
2. strategy = 'mean' (here we use the mean of the attributes to infer the missing data; other possibilities are most_frequent etc) 
3. axis = 0 (means that the mean is calculated per column; if it was 1 mean would be caluculated per row)

In [0]:
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')

Now we use the fit function to replace the NaN values in the respective cloumns. Here Age and Salary columns are changed.

In [0]:
imputer = imputer.fit(dataset[['Age','Salary']])

In [0]:
dataset[['Age','Salary']] = imputer.transform(dataset[['Age','Salary']])

In [11]:
#Now we check the dataset for the missing values.
dataset.isnull().any()

Country      False
Age          False
Salary       False
Purchased    False
dtype: bool

In [12]:
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,63777.777778,Yes
5,France,35.0,58000.0,Yes
6,Spain,38.777778,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


# Separating the independent and dependent (i.e. target) features

In [13]:
#Independent Features
X = dataset.iloc[:,0:3].values
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [14]:
# target
Y = dataset.iloc[:,[3]].values
Y

array([['No'],
       ['Yes'],
       ['No'],
       ['No'],
       ['Yes'],
       ['Yes'],
       ['No'],
       ['Yes'],
       ['No'],
       ['Yes']], dtype=object)

# Categorical Encoding

Categorical Data
1.Non-numeric data
2.It represents types of data which may be divided into groups

The different types of categorical data are:
1.Nominal : These have no order among them. e.g. gender
2.Ordinal : These have some order associated with them. e.g. rating[1,2,3,4,5]

In [0]:
from sklearn.preprocessing import LabelEncoder

In [16]:
#The 0th column of X has non-ordinal or nominal data
X[:,0]

array(['France', 'Spain', 'Germany', 'Spain', 'Germany', 'France',
       'Spain', 'France', 'Germany', 'France'], dtype=object)

In [17]:
#first label encode 0th column in X
labelencoder_X = LabelEncoder()
X[:,0] = labelencoder_X.fit_transform(X[:,0])
X[:,0]

array([0, 2, 1, 2, 1, 0, 2, 0, 1, 0], dtype=object)

In [18]:
X

array([[0, 44.0, 72000.0],
       [2, 27.0, 48000.0],
       [1, 30.0, 54000.0],
       [2, 38.0, 61000.0],
       [1, 40.0, 63777.77777777778],
       [0, 35.0, 58000.0],
       [2, 38.77777777777778, 52000.0],
       [0, 48.0, 79000.0],
       [1, 50.0, 83000.0],
       [0, 37.0, 67000.0]], dtype=object)

In [0]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

In [20]:
# again read the data as per the input
X = dataset.iloc[:,0:3].values
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [0]:
#Now we one-hot encode 0th column of X
onehotencoder = ColumnTransformer(
    [('one_hot_encoder', OneHotEncoder(), [0])],    # The column numbers to be transformed (here is [0] but can be [0, 1, 3])
    remainder='passthrough'                         # Leave the rest of the columns untouched
)

In [22]:
# transform the data to get the encoding done
X = onehotencoder.fit_transform(X.tolist())
X

array([[1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [0.0, 1.0, 0.0, 30.0, 54000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 35.0, 58000.0],
       [0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

In [23]:
# before doing cateforical encoding just analyse the data Y
Y.shape

(10, 1)

In [24]:
# flatten the whole two dimentional list into an single dimentional list
Y.ravel().shape

(10,)

In [25]:
# Now categorical encoding the target variable
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y.ravel())
Y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

# Splitting the data into test set and train set

For this purpose we shall use the train_test_split function
The arguments to this function are:
1.X : The independent feature
2.Y : The dependent or target feature
3.test_size : this is the ratio with which the split takes place
4.random state : if 0 selects the same random rows each time you run the program; else if any other number, selects random rows each time you run the program  

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
# split the dataset into testing and training data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state = 0)

In [28]:
X_train

array([[0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 37.0, 67000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [1.0, 0.0, 0.0, 44.0, 72000.0],
       [1.0, 0.0, 0.0, 35.0, 58000.0]], dtype=object)

In [29]:
X_test

array([[0.0, 1.0, 0.0, 30.0, 54000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0]], dtype=object)

In [30]:
Y_train

array([1, 1, 1, 0, 1, 0, 0, 1])

In [31]:
Y_test

array([0, 0])

In [32]:
len(X_train), len(X_test), len(Y_train), len(Y_test)

(8, 2, 8, 2)

# Feature Scaling

Feature scaling is a method used to standardize the range of independent variables or features of data.

Standardize features by removing the mean and scaling to unit variance

The standard score of a sample x is calculated as:

z = (x - u) / s

where u is the mean of the training samples and

s is the standard deviation of the training samples

In [0]:
from sklearn.preprocessing import StandardScaler

In [0]:
# initializing the standard scaler
sc_X = StandardScaler()

In [35]:
X_train = sc_X.fit_transform(X_train)
X_train

array([[-1.        ,  2.64575131, -0.77459667,  0.26306757,  0.12381479],
       [ 1.        , -0.37796447, -0.77459667, -0.25350148,  0.46175632],
       [-1.        , -0.37796447,  1.29099445, -1.97539832, -1.53093341],
       [-1.        , -0.37796447,  1.29099445,  0.05261351, -1.11141978],
       [ 1.        , -0.37796447, -0.77459667,  1.64058505,  1.7202972 ],
       [-1.        , -0.37796447,  1.29099445, -0.0813118 , -0.16751412],
       [ 1.        , -0.37796447, -0.77459667,  0.95182631,  0.98614835],
       [ 1.        , -0.37796447, -0.77459667, -0.59788085, -0.48214934]])

In [36]:
X_test = sc_X.transform(X_test)
X_test

array([[-1.        ,  2.64575131, -0.77459667, -1.45882927, -0.90166297],
       [-1.        ,  2.64575131, -0.77459667,  1.98496442,  2.13981082]])

In [37]:
#Inverse of scaling to get the original data
X_train = sc_X.inverse_transform(X_train)
X_train

array([[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,
        6.37777778e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,
        6.70000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 2.70000000e+01,
        4.80000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.87777778e+01,
        5.20000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,
        7.90000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,
        6.10000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.40000000e+01,
        7.20000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.50000000e+01,
        5.80000000e+04]])

# Thank You