<a href="https://colab.research.google.com/github/Utkarsh472/Data-Preprocessing-/blob/main/DataPreprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preprocessing Tools

## Importing the libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [None]:
dataset=pd.read_csv('Data.csv')
#Important rule of machine learning
#We have to select features and dependent variable vetor for our machine learning model
#Here Country,Age,salary are features and purchased column is our dependent variable vector
X=dataset.iloc[:,:-1].values # matrix of features
Y=dataset.iloc[:,-1].values #dependent variable vector
#iloc is one of attribute of pandas means to locate indexes


In [None]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [None]:
print(Y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

In [None]:
from sklearn.impute import SimpleImputer #SimpleImputer is class ,#Scikit learn(sklearn ) is a preprocessing library,#impute is module of sklearn
imputer = SimpleImputer(missing_values=np.nan,strategy='mean') #imputer is the instance of class(object)to replace nan values with mean
#now will we apply to matrix features
#fit method: this will connect imputer to matrix features or simply will fill up missing values with average of columns
imputer.fit(X[:,1:3])
#Now we will call transform method from our imputer
X[:,1:3]=imputer.transform(X[:,1:3])


In [None]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

In [None]:
#One Hot Encoding consists of creating binary vectors for each of the countries
#For this we are going to use two classes
#the first one is column transform, a class from compose module of sklearn
#and second class OneHotEncoder from preprocessing module of sklearn


### Encoding the Independent Variable

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct=ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[0])],remainder='passthrough')
#connecting to matrix features X
X=np.array(ct.fit_transform(X))
#we need to have conversion in numpy for future training models

In [None]:
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable

In [None]:
#Now we will call  preprocessing module form sklearn again
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
Y=le.fit_transform(Y)  #This time we dont need to have numpy array as this is dependent variable vector for future models


In [None]:
print(Y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Splitting the dataset into the Training set and Test set

In [None]:
#using sklearn library for model selection which contains function train to split
#So basicaly we are going to get four sets X_train(matrix of features of training set), X_test(matrix of features of test set),
#Y_train(dependent variable of training set),Y_test(dependent variable of test set)
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(X, Y, test_size=0.2,random_state=1) #We dont have to split dataset equally as we need to give a lot of data
#to train for our model to understand correlations better in our dataset
#Well to make sure we have same kind of random factors we just add random_state=1

In [None]:
print(X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [None]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [None]:
print(Y_train)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes']


In [None]:
print(Y_test)

['No' 'Yes']


## Feature Scaling

In [None]:
#Feature scaling simply consists of scaling all the variables or your features actually to make sure they
# all take values at same scale and we do this to prevent the dominance of one feature over the other which would be neglected
#by machine learning model.
#Feature scaling is simply a technique that will get the mean and the standard devaition of features in order to perform scaling.
#So if we apply scaling before splitting then we actually get mean and standard deviation off all values including the values in
#test set
#And test set is supposed to have future data in production, you know applying future scaling on features in original data set 
# before the split would cost information leakage on the test set.

#Feature scaling methods
#1)Standardization 2)Normalization
#1)Standardisation: xstand=(x-mean(x))/standard deviation(x)
#2)Normalisation: xnorm=(x-min(x))/max(x)-min(x)

#Normalisation is recommended when we have normal distribution in most of your features.
#Standardisation works well all the time so this technique will work all the time because we will always do some relevant scaling of features and this will 
#improve training process
#We will apply feature scaling not on X but on X_train and X_test separately and the scalar will fitted on X_train

#StandardScalar class will be used for standardisation of matrix of features of X_train and X_test
#Apply feature scalling to numerical values only not on dummy as we will loose information to which dummy variable corresspond to which country

In [None]:
from sklearn.preprocessing import StandardScaler
sc= StandardScaler()
X_train[:,3:]=sc.fit_transform(X_train[:,3:])
X_test[:,3:]=sc.transform(X_test[:,3:])

In [None]:
print(X_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


We need to apply same scalar on matrix features of X_test as we applied on X_train because X_test will be the input for predict function that will be used in machine learing model.
So in order to make predictions that will be congruent with the way the model was trained so we need to apply the same scalar as of used in X_train.

In [None]:
print(X_test)

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]
