# Project Phase 3: Data Preprocessing
In this stage, the dataset will be formatted so that it can be used in machine learning model. That is, all data should be encoded numerically. At this stage, one-hot encoding and ordinal encoding will be used to encode non-numeric data. In this notebook, we will also execute the feature selection phase.

## Loading the Cleaned Dataset
Now, let's start by loading our cleaned dataset into a `Pandas.DataFrame` object.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Import the Feature Names, convert them to numpy and them flatten them into one-dimension.
ATTRS_NUM = pd.read_csv("dataset/constants/ATTRS_NUM.csv", index_col=0).to_numpy().flatten()

DATASET = pd.read_csv("dataset/cleaned/Dataset.csv", index_col="EmployeeNumber")
DATASET.head()

Unnamed: 0_level_0,Age,DailyRate,DistanceFromHome,EmployeeCount,HourlyRate,MonthlyIncome,MonthlyRate,NumCompaniesWorked,PercentSalaryHike,PerformanceRating,...,WorkLifeBalance,BusinessTravel,Department,EducationField,Gender,JobRole,MaritalStatus,Over18,OverTime,Attrition
EmployeeNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,41.0,1102.0,1.0,1.0,94.0,5993.0,19479.0,8.0,11.0,3.0,...,1,Travel_Rarely,Sales,Life Sciences,Female,Sales Executive,Single,Y,Yes,1
2,49.0,279.0,8.0,1.0,61.0,5130.0,24907.0,1.0,23.0,4.0,...,3,Travel_Frequently,Research & Development,Life Sciences,Male,Research Scientist,Married,Y,No,0
4,37.0,1373.0,2.0,1.0,92.0,2090.0,2396.0,6.0,15.0,3.0,...,3,Travel_Rarely,Research & Development,Other,Male,Laboratory Technician,Single,Y,Yes,1
5,33.0,1392.0,3.0,1.0,56.0,2909.0,23159.0,1.0,11.0,3.0,...,3,Travel_Frequently,Research & Development,Life Sciences,Female,Research Scientist,Married,Y,Yes,0
7,27.0,591.0,2.0,1.0,40.0,3468.0,16632.0,9.0,12.0,3.0,...,3,Travel_Rarely,Research & Development,Medical,Male,Laboratory Technician,Married,Y,No,0


## Splitting the Dataset into Features and Targets
Now, we will split the columns of our dataset into two: *Features* (or matrix of independent variable `X`), and *Target* (or the vector of dependent variable `y`). Now we can appreciate the column re-arrangement step we conducted on the [data cleaning phase](./01%20-%20Data%20Cleaning.ipynb). Using `Pandas.DataFrame.iloc[]` attribute, we can split our dataset into two slices.

In [3]:
# Define Feature Matrix and Target Vector to be inputted on machine learning models.
X = DATASET.iloc[:, 0:-1]
y = DATASET.iloc[:, -1]

# Conduct One-Hot Encoding on the Nominal Data of the feature matrix
X = pd.get_dummies(data=X, drop_first=True)

# Save the column names of the newly encoded dataset
ATTRS_ENCODED = X.columns

In [4]:
# Preview the Features
X.head()

Unnamed: 0_level_0,Age,DailyRate,DistanceFromHome,EmployeeCount,HourlyRate,MonthlyIncome,MonthlyRate,NumCompaniesWorked,PercentSalaryHike,PerformanceRating,...,JobRole_Laboratory Technician,JobRole_Manager,JobRole_Manufacturing Director,JobRole_Research Director,JobRole_Research Scientist,JobRole_Sales Executive,JobRole_Sales Representative,MaritalStatus_Married,MaritalStatus_Single,OverTime_Yes
EmployeeNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,41.0,1102.0,1.0,1.0,94.0,5993.0,19479.0,8.0,11.0,3.0,...,0,0,0,0,0,1,0,0,1,1
2,49.0,279.0,8.0,1.0,61.0,5130.0,24907.0,1.0,23.0,4.0,...,0,0,0,0,1,0,0,1,0,0
4,37.0,1373.0,2.0,1.0,92.0,2090.0,2396.0,6.0,15.0,3.0,...,1,0,0,0,0,0,0,0,1,1
5,33.0,1392.0,3.0,1.0,56.0,2909.0,23159.0,1.0,11.0,3.0,...,0,0,0,0,1,0,0,1,0,1
7,27.0,591.0,2.0,1.0,40.0,3468.0,16632.0,9.0,12.0,3.0,...,1,0,0,0,0,0,0,1,0,0


In [5]:
# Preview the Target
y.head()

EmployeeNumber
1    1
2    0
4    1
5    0
7    0
Name: Attrition, dtype: int64

## Scaling the Features
If you revisit the histograms from the [data visualization phase](./02%20-%20Data%20Visualization.ipynb), you can observe that the range of our attributes were highly skewed. For instance, the `Age` attribute ranges from `20` to `60`, but the `DailyRate` ranges from `200` to `1400`. This skewed range can affect the speed and accuracy of our machine learning models. We can deal with it using `sklearn.preprocessing.StandardScaler` or other types of feature scalers.

In [6]:
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split

# Convert the X, and y DataFrames into NDArray
X = X.values
y = y.values

# Target columns to be scaled
COLUMNS_TO_BE_SCALED = [i for i in range(0,len(ATTRS_NUM))]

# Define the column transformer with standard scaler targetted to columns defined in COLUMNS_TO_BE_SCALED.
column_transformer = ColumnTransformer(
  [("Standard Scaler", StandardScaler(),  COLUMNS_TO_BE_SCALED),],
  remainder="passthrough"
)

# Scale the whole feature set
X_scaled = column_transformer.fit_transform(X)

## Selecting the Features
Now our dataset was scaled. The range of the values of each numeric attribute of our dataset were not skewed. Let's now deal with the problem of selecting features. Similar with skewed features, using unnecessary features also slows down and affects machine learning model's performance. To deal with it, let's objectively determine the important features for our data modelling. We can do this with the help of `sklearn.feature_selection.SelectFromModel`, and using `sklearn.ensembleExtraTreesClassifier` as model for selecting features.

In [7]:
# Feature Selection
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import ExtraTreesClassifier

# Fit the classifier to all of the dataset
classifier = ExtraTreesClassifier(n_estimators=50)
classifier.fit(X_scaled, y)

# Select only the important features
model = SelectFromModel(classifier, prefit=True)


X_scaled_feature_selected = model.transform(X_scaled)

# Save the names of the selected features
model.feature_names_in_ = ATTRS_ENCODED
ATTRS_SELECTED = model.get_feature_names_out()

# Preview the selected features
pd.DataFrame({
  "Selected Features": ATTRS_SELECTED
})

Unnamed: 0,Selected Features
0,Age
1,DailyRate
2,DistanceFromHome
3,HourlyRate
4,MonthlyIncome
5,MonthlyRate
6,NumCompaniesWorked
7,PercentSalaryHike
8,TotalWorkingYears
9,TrainingTimesLastYear


Now using a decision tree, out of 46 features we initially had, only 23 of them were found to be usefull in modelling. Now our models will run much faster and with better performance.

## Balancing the Dataset
Going back to the [data visualization phase](./02%20-%20Data%20Visualization.ipynb), it can be observed that the amount of data for employees that stayed at the company is way much larger than those that left. This imbalance in the target variable might result to high *accuracy score*, but poor *recall score*. To deal with this, we can balance the data by using oversampling techniques such as SMOTE, where a synthetic minority data will be generated automatically to match the number of majority data.

In [8]:
# Balancing the Dataset using SMOTE
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=22)

X_scaled_resampled, y_resampled = smote.fit_resample(X_scaled_feature_selected, y)
# X_scaled_resampled, y_resampled = smote.fit_resample(X_scaled, y)

## Splitting the Dataset Into Training Set and Testing Set
To evaluation the performance of our machine learning models, we need to test it with data they have not seen so far. To do this, we will train them in a training set, then evaluation their performance on a testing set. To split the data, we can use the `sklearn.model_selection.train_test_split()` method.

In [9]:
# Split the features and targets into a training set and test set.
X_train, X_test, y_train, y_test = train_test_split(X_scaled_resampled, y_resampled, test_size=0.20)

## Exporting the Preprocessed Data
Finaly! We are now ready to proceed to the exciting part -- modelling. Our data is now prepared. We will conclude this phase by exporting our training, and testing set into a CSV file so that it can be readily used by other machine learning models.

In [10]:
# Export the training and testing set.
pd.DataFrame(X_train).to_csv("dataset/preprocessed/Features_Training_Set.csv")
pd.DataFrame(X_test).to_csv("dataset/preprocessed/Features_Testing_Set.csv")
pd.Series(y_train).to_csv("dataset/preprocessed/Target_Training_Set.csv")
pd.Series(y_test).to_csv("dataset/preprocessed/Target_Testing_Set.csv")

# Export the names of the selected features
pd.Series(ATTRS_SELECTED).to_csv("./dataset/constants/ATTRS_SELECTED.csv")