Skip to content

This is project 1 of the Udacity Data Scientist Nanodegree.

Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit



69 Commits

Repository files navigation

Table of Contents


All the required packages can be found in the file pyproject.toml. The code should run with no issues using Python versions 3.*.

Project Motivation

For this project, we were interestested in using Kaggle dataset on HR analytics to better understand who is looking for a new job, in particular:

  • Do experienced candidates tend to look for a new job?
  • Does the educational background matter?
  • Did candidates work before joining the training?
  • What is the City Development Index? Does it matter here?

These questions were analysed in the notebook hr_analytics.ipynb.


A company which is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses which are conducted by the company. A large number of candidates signup for their training. To reduce the cost and time, as well as the quality of the training, the company wants to know which of these candidates really wants to work for them, or are most likely to look for a job, after completing the training.

Information related to demographics, education, experience and features related to training as well are in hands from candidates signup and enrollment. The dataset, which is devided into train and test sets, can be found here Kaggle.


The main findings of the code can be found at the post available here.


Classification is one of the most common machine learning problems. A common issue found in datasets that are used for classification is imbalance. Data imbalance usually reflects an unequal distribution of classes within a dataset.

The challenge of working with an imbalanced dataset is that most machine learning techniques will likely ignore, and in turn have poor performance on, although typically it is performance on the minority class that is most important.

One approach to addressing imbalanced datasets is to oversample the minority class. The simplest approach involves duplicating examples in the minority class, although these examples don’t add any new information to the model. Instead, new examples can be synthesized from the existing examples. This is a type of data augmentation for the minority class and is referred to as the Synthetic Minority Oversampling Technique, or SMOTE for short.

Here we will apply some common machine learning models to an imbalanced dataset both with and without the SMOTE technique and compare the results.

The goal is to predict the probability of a candidate looking for a new job or will work for the company.


There are in total 33.380 rows, i.e. enrollees which participated in the training, and 14 features. We have a total of 10 categorical features, some with high cardinality, and 4 numerical feautres.

Features dtypes unique values isnull sum isnull %
enrollee_id int64 19158 0 0.00
city object 123 0 0.00
city_development_index float64 93 0 0.00
gender object 4 4508 23
relevent_experience object 2 0 0.00
enrolled_university object 4 386 2.01
education_level object 6 460 2.40
major_discipline object 7 2813 14.68
experience object 23 65 0.34
company_size object 9 5938 30.99
company_type object 7 6140 32.05
last_new_job object 7 423 2.21
training_hours int64 241 0 0.00
target float64 2 0 0.00

We have an unequal class distribution:

Class 0 (not looking for a job) Class 1 (looking for a job)
75% 25%

If we would just drop all rows with missing values, we would end up loosing 47% of the data, that is, almost half. Since data is essential, we apply the following strategy:

  • All feautures with missing values less then 3%, in particular, enrolled_university, education_level, experience, last_new_job, will be filled with the most frequent value.
  • All other missing values will be labeled as unknown.

We use the LabelEncoder() from sklearn to impute for the categorical features.

ML Models Result

We apply three common machine learning models; LogisticRegression, RandomForestClassifier, SVC. We compare the accuracy score of the prediction made on a test set before and after normalizing and after applying the SMOTE technique:

ML model score with normalized data score with normalized and SMOTE
SVC 0.77 0.79
LogisticRegression 0.77 0.74
RandomForestClassifier 0.78 0.83

Further Thoughts

Further thoughts on how to improve the accuracy score:

  • Preprocessing:
    • Use a different encoder for the categorical features
    • Impute most frequent value for all the missing values instead of adding the label "unknown"
  • Hyperparamter tuning
  • Different approah to handle the imbalanced dataset


This is project 1 of the Udacity Data Scientist Nanodegree.




