# Data Preprocessing Tools

## Importing the libraries

In [18]:
import numpy as np # numpy used for array
import matplotlib.pyplot as plt # matplot for visualisation and charts
import pandas as pd # pandas is used for datasets

## Importing the dataset

In [19]:
dataset = pd.read_csv("Data.csv")# this helps in reading the file
# in any entity there are features and dependent variables.
# features are the column with which we are going to predict the dependent variables.
#usually the first column are features and last columns are the dependemt variables vector in the last column.
X = dataset.iloc[:, :-1].values # iloc here means to locate indexes thus this function will extract data. [rows, column] you want to extract : means every row
# :-1 means all but excluding last column just like string slicing. values means to take values
y = dataset.iloc[:, -1].values # only last in column
#In machine learning, the convention is to use uppercase 'X' to represent the matrix of feature variables and lowercase 'y' for the target variable vector.
#Identify missing data and print the number of missing entries in each column
missing_data = dataset.isnull().sum()
print("Missing Data:")
print(missing_data)

Missing Data:
Country      0
Age          1
Salary       1
Purchased    0
dtype: int64


In [20]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [21]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

In [22]:
# we would be replacing the missing data with the average of all
# to do this we would be using SimpleImputer class  from sckit-learn(sklearn)
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean') # object  similary there is media or remove row
#now applying imputer
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3]) #will replace the data, this will return the updated matrix X




In [23]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

this data set contain column with categories threfore we have to turn this categeroies into number one is to 0 1 2 but it will affect future accuranct or one hot encoding it consist of turning it into three columns as three categories(countty) are present in the country one hot coding is crwating bnary vector

### Encoding the Independent Variable

In [24]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])],remainder='passthrough')
X = np.array(ct.fit_transform(X))


In [25]:
print(X) # countryu will be converted to 

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable

In [26]:
# yes - 1 and no - 0 to do this we would be using labelencode class
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y=le.fit_transform(y=y)

In [27]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


## Splitting the dataset into the Training set and Test set

Some people will say feature scaling should be applied before splitting of the dataset.
but the anwer is it should be after beacuse it will get the mean and median of the test case also in featur scaling.
spitting of test involves spiltiing into two sets one for performance of model and few for test.


In [28]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [29]:
print(X_train) # now country are called dummy varible - made of 0 and 1

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [30]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [31]:
print(y_train)

[0 1 0 0 1 1 0 1]


In [32]:
print(y_test)

[0 1]


## Feature Scaling

feature scaling means enbsure all features take values in same scale. this is because in order to avoid the dominate feature to be avoided by ml model.
Xstd = x- Xmean/stand devia |   Xnorm = x- min(x)/ max(X) -min
 bw -3 and 3                |    bw 0 1
recommendation - standardiation.
we wotn apply fs on whole but seperately on x test and x train

in case of x test we would be using x train mean and xtrain s d


In [33]:
from sklearn.preprocessing import StandardScaler
scalar = StandardScaler()
# we dont apply FS on dummy variales - These attributes created are called Dummy Variables. Hence, dummy variables are “proxy” variables for categorical data in regression models. These dummy variables will be created with one-hot encoding and each attribute will have a value of either 0 or 1
X_train[:, 3:] = scalar.fit_transform(X_train[:, 3:])
# we would be apllyinfg the same scalar on training set to test set.
X_test[:, 3:] = scalar.transform(X_test[:, 3:])



In [34]:
print(X_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [35]:
print(X_test)

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]
