# Preprocessing the Data

This notebook is showed to both run and demonstrate the preprocessing on the raw mock data. The goal is to apply one hot encoding to all of the categorical attributes, process the list attributes, and apply normalization to the numerical features. It will then create a train/test split based on the ratio specified (default is 80% train, 20% test).

In [129]:
import pandas as pd
from sklearn.model_selection import train_test_split

## Raw Data

In [130]:
df = pd.read_csv("raw/HackathonMockDataDraft_v2.csv")

df

Unnamed: 0,Employee ID,Role,Hourly/Salary,# of Badges,Badges,Years at Company,Visited page,Recommended page
0,D123462,CBEX Associate,Salary,4,"Banking Foundations, Risk Management, Jira Mas...",6,34,3
1,D123469,CBEX Associate,Salary,4,"Banking Foundations, Risk Management, Jira Mas...",4,34,3
2,D123476,CBEX Associate,Salary,4,"Banking Foundations, Risk Management, Jira Mas...",6,34,3
3,D123483,CBEX Associate,Salary,4,"Banking Foundations, Risk Management, Jira Mas...",4,31,3
4,D123490,CBEX Associate,Salary,4,"Banking Foundations, Risk Management, Jira Mas...",6,31,3
...,...,...,...,...,...,...,...,...
101,D123532,Software Engineer,Salary,3,"Microservices, Secure Dev, Jira Master",4,58,5
102,D123539,Software Engineer,Salary,4,"Microservices, Secure Dev, Jira Master, Risk M...",4,54,5
103,D123546,Software Engineer,Hourly,4,"Microservices, Secure Dev, Jira Master, Risk M...",4,52,5
104,D123553,Software Engineer,Salary,3,"Microservices, Secure Dev, Risk Management",4,54,5


## Normalize Numeric Attributes

In [131]:
df["# of Badges"] = (df["# of Badges"] - df["# of Badges"].min()) / (df["# of Badges"].max() - df["# of Badges"].min())
df["Years at Company"] = (df["Years at Company"] - df["Years at Company"].min()) / (df["Years at Company"].max() - df["Years at Company"].min())

df

Unnamed: 0,Employee ID,Role,Hourly/Salary,# of Badges,Badges,Years at Company,Visited page,Recommended page
0,D123462,CBEX Associate,Salary,1.000000,"Banking Foundations, Risk Management, Jira Mas...",0.238095,34,3
1,D123469,CBEX Associate,Salary,1.000000,"Banking Foundations, Risk Management, Jira Mas...",0.142857,34,3
2,D123476,CBEX Associate,Salary,1.000000,"Banking Foundations, Risk Management, Jira Mas...",0.238095,34,3
3,D123483,CBEX Associate,Salary,1.000000,"Banking Foundations, Risk Management, Jira Mas...",0.142857,31,3
4,D123490,CBEX Associate,Salary,1.000000,"Banking Foundations, Risk Management, Jira Mas...",0.238095,31,3
...,...,...,...,...,...,...,...,...
101,D123532,Software Engineer,Salary,0.666667,"Microservices, Secure Dev, Jira Master",0.142857,58,5
102,D123539,Software Engineer,Salary,1.000000,"Microservices, Secure Dev, Jira Master, Risk M...",0.142857,54,5
103,D123546,Software Engineer,Hourly,1.000000,"Microservices, Secure Dev, Jira Master, Risk M...",0.142857,52,5
104,D123553,Software Engineer,Salary,0.666667,"Microservices, Secure Dev, Risk Management",0.142857,54,5


## One Hot Encoding

In [132]:
badges_dummies = df["Badges"].str.get_dummies(sep=", ")
df = df.drop("Badges", axis=1)
df = df.join(badges_dummies)

visited_dummies = df["Visited page"].str.get_dummies(sep=",")
df = df.drop("Visited page", axis=1)
df = df.join(visited_dummies)

role_dummies = pd.get_dummies(df["Role"])
df = df.drop("Role", axis=1)
df = df.join(role_dummies)

hourly_dummies = pd.get_dummies(df["Hourly/Salary"])
df = df.drop("Hourly/Salary", axis=1)
df = df.join(hourly_dummies)

df

Unnamed: 0,Employee ID,# of Badges,Years at Company,Recommended page,AWS Apprentice,Agile Master,Banking Foundations,Data Visualization,Jira Master,Microservices,...,8,CBEX Associate,Cloud Engineer,Data Analyst,HR Associate,Innovation Manager,Scrum Master,Software Engineer,Hourly,Salary
0,D123462,1.000000,0.238095,3,0,0,1,0,1,1,...,0,1,0,0,0,0,0,0,0,1
1,D123469,1.000000,0.142857,3,0,0,1,0,1,1,...,0,1,0,0,0,0,0,0,0,1
2,D123476,1.000000,0.238095,3,0,0,1,0,1,1,...,0,1,0,0,0,0,0,0,0,1
3,D123483,1.000000,0.142857,3,0,0,1,0,1,1,...,0,1,0,0,0,0,0,0,0,1
4,D123490,1.000000,0.238095,3,0,0,1,0,1,1,...,0,1,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
101,D123532,0.666667,0.142857,5,0,0,0,0,1,1,...,1,0,0,0,0,0,0,1,0,1
102,D123539,1.000000,0.142857,5,0,0,0,0,1,1,...,0,0,0,0,0,0,0,1,0,1
103,D123546,1.000000,0.142857,5,0,0,0,0,1,1,...,0,0,0,0,0,0,0,1,1,0
104,D123553,0.666667,0.142857,5,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,1


## Train/Test Split

In [133]:
train,test = train_test_split(df, train_size= 0.8)

print(f"Number of training samples: {len(train)}")
print(f"Number of testing samples: {len(test)}")

train.to_csv("processed/train.csv", index=False)
test.to_csv("processed/test.csv", index=False)

print("Saved to data/processed!")

Number of training samples: 84
Number of testing samples: 22
Saved to data/processed!
