# Data Preprocessing Tools

**Handling Missing Date Using Mean Imputation: <br>`sklearn.impute -> SimpleImputer(missing_values,strategy)` <br><br>
Encoding Independent Variables: <br>`sklearn.compose-> ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[0])],remainder='passthrough')`,<br>`sklearn.preprocessing -> OneHotEncoder()`<br><br>
Encoding Dependent Variables: <br>`sklearn.preprocessing -> LabelEncoder()`<br><br>
Train-Test Data Split: <br>`sklearn.model_selection -> train_test_split(X,y,test_size=0.2,random_state=1)`<br><br>
Feature Scaling: <br>`sklearn.preprocessing -> StandardScaler()`**

## Importing the libraries

In [23]:
import numpy as np
import pandas as pd

## Importing the dataset

In [24]:
dataset= pd.read_csv("C:\\Users\gurun\Desktop\VAC Predictive Analysis Python\Datasets\Data.csv")
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,No
5,France,35.0,58000.0,No
6,Spain,,52000.0,No
7,France,48.0,79000.0,No
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


## Splitting Data into Dependent and Independent Variables

In [25]:
#Split dataset into Dependent and Independent Variables
X= dataset.iloc[:,:-1].values
y= dataset.iloc[:,-1].values

Splitting a dataset into independent and dependent variables.

The dataset is assumed to be stored in a pandas DataFrame object named dataset.

The iloc method is used to extract subsets of the dataset. The first subset dataset.iloc[:,:-1] contains all rows and all columns except the last column. This subset contains the independent variables or features of the dataset. The values attribute is used to extract the values from this subset and store it in a NumPy array named X.

The second subset dataset.iloc[:,-1] contains all rows and only the last column. This subset contains the dependent variable or target of the dataset. The values attribute is used to extract the values from this subset and store it in a NumPy array named y.

In summary, this code is extracting the independent variables from a pandas DataFrame and storing them in a NumPy array named X, and extracting the dependent variable from the same DataFrame and storing it in a NumPy array named y. This is a common step in machine learning workflows, where we use the independent variables to predict the dependent variable.

## Taking care of missing data

In [26]:
from sklearn.impute import SimpleImputer

In [27]:
imputer=SimpleImputer(missing_values=np.nan,strategy='mean')
imputer.fit(X[:,1:3])
X[:,1:3]=imputer.transform(X[:,1:3])

In [28]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

The SimpleImputer class replaces missing values with some strategy. Here, it's replacing the missing values represented as np.nan (NumPy's not-a-number) with the mean of the non-missing values for each column.

The fit method of the SimpleImputer object is used to compute the mean of the non-missing values for the second and third columns ([:,1:3]) of the 2-dimensional array X.

Finally, the transform method is used to replace the missing values in these columns with their corresponding means.

In summary, this code is imputing missing values in the second and third columns of a 2-dimensional array X using the mean imputation strategy.

## Encoding categorical data

### Encoding the Independent Variable

In [29]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

In [30]:
ct=ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[0])],remainder='passthrough')
X=np.array(ct.fit_transform(X))
X

array([[1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [0.0, 1.0, 0.0, 30.0, 54000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 35.0, 58000.0],
       [0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

Performing one-hot encoding on the independent variables or features of a dataset stored in a NumPy array X.

The ColumnTransformer class from Scikit-Learn is used to transform specific columns of the array X. The transformers parameter takes a list of tuples, where each tuple specifies a transformer to be applied to a specific column or set of columns. In this case, the tuple contains a string identifier 'encoder', an instance of the OneHotEncoder class, and the index of the column to be transformed [0]. This means that the one-hot encoder will be applied to the first column of the array X.

The remainder parameter is set to 'passthrough', which means that any columns not specified in the transformers parameter will be left untouched.

The fit_transform method of the ColumnTransformer object is then used to fit the transformer on the specified column(s) and transform the array X. The resulting transformed array is then converted back to a NumPy array using the np.array function and stored in the variable X.

In summary, this code is applying one-hot encoding to the first column of the independent variables in a NumPy array X and storing the transformed array back in the variable X. The resulting array has the same number of columns as the original array, but with the first column replaced by a set of binary columns corresponding to the unique values of the original first column. This is a common preprocessing step in machine learning workflows to handle categorical variables in a meaningful way.

### Encoding the Dependent Variable

In [31]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
y= le.fit_transform(y)
y

array([0, 1, 0, 0, 0, 0, 0, 0, 0, 1])

Performing label encoding on the dependent variable or target of a dataset stored in a NumPy array y.

The LabelEncoder class from Scikit-Learn is used to encode the unique values of the target variable y as integers. The fit_transform method of the LabelEncoder object is used to fit the encoder on the unique values of y and transform y by replacing each unique value with its corresponding encoded integer value. The resulting transformed array is then stored back in the variable y.

Label encoding is a common preprocessing step in machine learning workflows to convert categorical variables to numerical variables so that machine learning algorithms can process them. However, label encoding has a potential drawback in that it can introduce an arbitrary ordering of categories, which may not be appropriate for some machine learning algorithms. Therefore, for nominal categorical variables where there is no inherent ordering, one-hot encoding as shown in the previous code snippet may be a more appropriate encoding strategy.

## Splitting the dataset into the Training set and Test set

In [32]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test= train_test_split(X,y,test_size=0.2,random_state=1)

Using the train_test_split function from Scikit-Learn to split a dataset into training and testing sets for both independent variables and dependent variable.

The function takes four parameters: X, y, test_size, and random_state.

X and y are the independent and dependent variables of the dataset, respectively.

test_size is a float or an integer that specifies the proportion or absolute number of samples to include in the test split. In this case, test_size is set to 0.2, which means that 20% of the data will be used for testing, and the remaining 80% will be used for training.

random_state is an optional parameter that sets the random seed for reproducibility. If random_state is not specified, a different random split will be generated each time the function is called. In this case, random_state is set to 1 for reproducibility.

The train_test_split function returns four NumPy arrays: X_train, X_test, y_train, and y_test. X_train and y_train are the training set for the independent and dependent variables, respectively. X_test and y_test are the testing set for the independent and dependent variables, respectively.

In summary, this code is splitting a dataset into training and testing sets for the independent and dependent variables using the train_test_split function. The resulting training and testing sets are stored in the variables X_train, X_test, y_train, and y_test. This is a common step in machine learning workflows to evaluate the performance of a machine learning model on unseen data.

## Feature Scaling

In [33]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
X_train[:,3:]=sc.fit_transform(X_train[:,3:])
X_test[:,3:]=sc.transform(X_test[:,3:])

In [34]:
X_train

array([[0.0, 0.0, 1.0, -0.19159184384578545, -1.0781259408412425],
       [0.0, 1.0, 0.0, -0.014117293757057777, -0.07013167641635372],
       [1.0, 0.0, 0.0, 0.566708506533324, 0.633562432710455],
       [0.0, 0.0, 1.0, -0.30453019390224867, -0.30786617274297867],
       [0.0, 0.0, 1.0, -1.9018011447007988, -1.420463615551582],
       [1.0, 0.0, 0.0, 1.1475343068237058, 1.232653363453549],
       [0.0, 1.0, 0.0, 1.4379472069688968, 1.5749910381638885],
       [1.0, 0.0, 0.0, -0.7401495441200351, -0.5646194287757332]],
      dtype=object)

Performing feature scaling using the StandardScaler class from Scikit-Learn on the training and testing sets of independent variables generated in the previous step.

The StandardScaler class is used to standardize the independent variables by scaling them to have zero mean and unit variance. This is a common preprocessing step in machine learning workflows to improve the performance and convergence of many machine learning algorithms, particularly those that are sensitive to the scale of the input features.

The fit_transform method of the StandardScaler object is used to fit the scaler on the training set and transform the selected columns [3:] of the training set by subtracting their mean and dividing by their standard deviation. The resulting transformed values are then stored back in the corresponding columns of X_train.

The transform method of the StandardScaler object is then used to transform the selected columns [3:] of the testing set using the same scaling parameters learned from the training set. The resulting transformed values are then stored back in the corresponding columns of X_test.

Note that the first three columns of X_train and X_test were not scaled because they were one-hot encoded categorical variables, and scaling them could potentially destroy the information encoded by the one-hot encoding.

In summary, this code is performing feature scaling on the training and testing sets of independent variables using the StandardScaler class. The resulting scaled values are stored back in the corresponding columns of X_train and X_test. This is a common preprocessing step in machine learning workflows to improve the performance and convergence of many machine learning algorithms.