# 9. Classification Exercise (40 points + 3 point extra) ✔
There are 2 files: training and test.

This dataset is designed to understand the factors that lead a person to leave their current job for HR research. By using model(s) that leverage current credentials, demographics, and experience data, you will predict the probability of a candidate looking for a new job or continuing to work for the company, as well as interpreting affected factors on employee decision.

### Note:
- The dataset is imbalanced.
- Most features are categorical (Nominal, Ordinal, Binary), some with high cardinality.
- Missing imputation can be a part of your pipeline as well.

### Features

| Feature                   | Description                                            |
|---------------------------|--------------------------------------------------------|
| city_development_index    | Development index of the city (scaled)                 |
| gender                    | Gender of candidate                                    |
| relevent_experience       | Relevant experience of candidate                      |
| enrolled_university       | Type of University course enrolled if any             |
| education_level           | Education level of candidate                          |
| major_discipline          | Education major discipline of candidate               |
| experience                | Candidate's total experience in years                 |
| company_type              | Type of current employer                              |
| last_new_job              | Difference in years between previous job and current job |
| training_hours            | Training hours completed                              |
| target                    | 0 – Not looking for job change, 1 – Looking for a job change |

## Task 1: Data Cleaning and Imputation

1. In `experience`, replace `>20` with `21`; `<1` with `1`, and convert this column to numerical.
2. In `last_new_job`, replace `>4` with `5`; `never` with `0`, and convert this column to numerical.
3. If the column is categorical, impute the missing values with its mode. If the column is numerical, impute the missing values with its median.

In [70]:
# imports and load data
import pandas as pd
import numpy as np

df_train = pd.read_csv('data/aug_train.csv')
df_test = pd.read_csv('data/aug_test.csv')

In [71]:
# check the data
df_train.head()

Unnamed: 0,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_type,last_new_job,training_hours,target
0,0.624,Male,No relevent experience,no_enrollment,High School,,5,,never,21,0
1,0.926,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,>4,12,0
2,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,Public Sector,>4,26,0
3,0.624,Male,No relevent experience,Full time course,High School,,1,,never,30,1
4,0.92,Female,Has relevent experience,no_enrollment,Masters,STEM,>20,,>4,46,0


In [72]:
# check the data
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2100 entries, 0 to 2099
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   city_development_index  2100 non-null   float64
 1   gender                  1585 non-null   object 
 2   relevent_experience     2100 non-null   object 
 3   enrolled_university     2051 non-null   object 
 4   education_level         2049 non-null   object 
 5   major_discipline        1768 non-null   object 
 6   experience              2090 non-null   object 
 7   company_type            1415 non-null   object 
 8   last_new_job            2048 non-null   object 
 9   training_hours          2100 non-null   int64  
 10  target                  2100 non-null   int64  
dtypes: float64(1), int64(2), object(8)
memory usage: 180.6+ KB


In [73]:
df_train.describe()

Unnamed: 0,city_development_index,training_hours,target
count,2100.0,2100.0,2100.0
mean,0.826898,65.89619,0.254762
std,0.124464,58.432483,0.435831
min,0.448,1.0,0.0
25%,0.72925,24.0,0.0
50%,0.899,49.0,0.0
75%,0.92,89.25,1.0
max,0.949,336.0,1.0


In [74]:
# number of unique values for each column
df_train.nunique()

city_development_index     79
gender                      3
relevent_experience         2
enrolled_university         3
education_level             5
major_discipline            6
experience                 22
company_type                6
last_new_job                6
training_hours            220
target                      2
dtype: int64

In [75]:
# check null values for each column
df_train.isnull().sum()


city_development_index      0
gender                    515
relevent_experience         0
enrolled_university        49
education_level            51
major_discipline          332
experience                 10
company_type              685
last_new_job               52
training_hours              0
target                      0
dtype: int64

In [76]:
df_train['experience'].value_counts()

experience
>20    369
5      170
2      145
3      134
6      130
4      124
7      123
9      109
10     103
8       86
11      79
1       73
15      72
12      68
14      55
16      49
<1      46
13      41
17      40
18      30
19      28
20      16
Name: count, dtype: int64

1. In `experience`, replace `>20` with `21`; `<1` with `1`, and convert this column to numerical.

In [77]:
df_train['experience'] = df_train['experience'].replace({'>20': '21', '<1': '1'}).astype(float)
df_test['experience'] = df_test['experience'].replace({'>20': '21', '<1': '1'}).astype(float)

# check
print(df_train['experience'].value_counts())
print(df_test['experience'].value_counts())
print("NaN values:", df_train['experience'].isnull().sum() + df_test['experience'].isnull().sum())

experience
21.0    369
5.0     170
2.0     145
3.0     134
6.0     130
4.0     124
7.0     123
1.0     119
9.0     109
10.0    103
8.0      86
11.0     79
15.0     72
12.0     68
14.0     55
16.0     49
13.0     41
17.0     40
18.0     30
19.0     28
20.0     16
Name: count, dtype: int64
experience
21.0    17
10.0    10
4.0      9
7.0      7
15.0     7
3.0      6
2.0      6
5.0      6
1.0      5
8.0      5
6.0      4
11.0     3
12.0     3
18.0     3
14.0     2
16.0     2
9.0      2
13.0     2
20.0     1
Name: count, dtype: int64
NaN values: 10


2. In `last_new_job`, replace `>4` with `5`; `never` with `0`, and convert this column to numerical.

In [78]:
df_train['last_new_job'] = df_train['last_new_job'].replace({'>4': '5', 'never': '0'}).astype(float)
df_test['last_new_job'] = df_test['last_new_job'].replace({'>4': '5', 'never': '0'}).astype(float)

# check
print(df_train['last_new_job'].value_counts())
print(df_test['last_new_job'].value_counts())
print("NaN values:", df_train['last_new_job'].isnull().sum() + df_test['last_new_job'].isnull().sum())

last_new_job
1.0    857
5.0    357
2.0    322
0.0    284
3.0    115
4.0    113
Name: count, dtype: int64
last_new_job
1.0    40
5.0    23
2.0    19
0.0    12
4.0     5
3.0     1
Name: count, dtype: int64
NaN values: 52


3. If the column is categorical, impute the missing values with its mode. If the column is numerical, impute the missing values with its median.
# note: imputing with mode or median can alter the data heavily, but i will do it for this exercise.

In [79]:
# impute missing values
for col in df_train.columns:
    if df_train[col].dtype == 'object':
        mode_value_train = df_train[col].mode()[0]
        mode_value_test = df_test[col].mode()[0]
        df_train[col] = df_train[col].fillna(mode_value_train)
        df_test[col] = df_test[col].fillna(mode_value_test)
    else:
        median_value_train = df_train[col].median()
        median_value_test = df_test[col].median()
        df_train[col] = df_train[col].fillna(median_value_train)
        df_test[col] = df_test[col].fillna(median_value_test)

# check
print(df_train.isnull().sum())
print(df_test.isnull().sum())

# check data types again
df_train.info()


city_development_index    0
gender                    0
relevent_experience       0
enrolled_university       0
education_level           0
major_discipline          0
experience                0
company_type              0
last_new_job              0
training_hours            0
target                    0
dtype: int64
city_development_index    0
gender                    0
relevent_experience       0
enrolled_university       0
education_level           0
major_discipline          0
experience                0
company_type              0
last_new_job              0
training_hours            0
target                    0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2100 entries, 0 to 2099
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   city_development_index  2100 non-null   float64
 1   gender                  2100 non-null   object 
 2   relevent_experience     2100 non-n

## Task 2: Classification

1. Build a classification model from the training set (you can use any algorithms).
## i will use Random Forest, because it has good performance for non-linear classification problems.
2. Generate the confusion matrix and calculate the accuracy, precision, recall, and F1-score on the training set.


In [80]:
# use df_train for training and df_test for testing. so no need to split the data.
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# convert categorical to numerical. we have to concat the two dataframes first to avoid mismatch in columns after get_dummies().
# concatenate df_train and df_test
df_combined = pd.concat([df_train, df_test])

# apply get_dummies on the combined dataframe
df_combined = pd.get_dummies(df_combined, drop_first=True)

# split back into train and test
df_train = df_combined[:df_train.shape[0]]
df_test = df_combined[df_train.shape[0]:]

# separate features and target
X_train = df_train.drop('target', axis=1)
y_train = df_train['target']

# train model
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

# predict on training set
y_train_pred = rf.predict(X_train)

# evaluate on training set
print("Training Accuracy:", accuracy_score(y_train, y_train_pred))
print("\nTraining Classification Report:\n", classification_report(y_train, y_train_pred))
print("\nTraining Confusion Matrix:\n", confusion_matrix(y_train, y_train_pred))


Training Accuracy: 0.9990476190476191

Training Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      1565
           1       1.00      1.00      1.00       535

    accuracy                           1.00      2100
   macro avg       1.00      1.00      1.00      2100
weighted avg       1.00      1.00      1.00      2100


Training Confusion Matrix:
 [[1565    0]
 [   2  533]]


3. Apply the model to the test set and generate the predictions.
4. Generate the confusion matrix from the test set and calculate the accuracy, precision, recall, and F1-score.


In [81]:
# seperate features and target
X_test = df_test.drop('target', axis=1)
y_test = df_test['target']

# predict
y_pred = rf.predict(X_test)

# evaluate on test set
print("Test Accuracy:", accuracy_score(y_test, y_pred))
print("\nTest Classification Report:\n", classification_report(y_test, y_pred))
print("\nTest Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Test Accuracy: 0.82

Test Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.95      0.89        78
           1       0.67      0.36      0.47        22

    accuracy                           0.82       100
   macro avg       0.75      0.66      0.68       100
weighted avg       0.80      0.82      0.80       100


Test Confusion Matrix:
 [[74  4]
 [14  8]]


In [82]:
# feature importance (extra)
feat_importances = pd.Series(rf.feature_importances_, index=X_train.columns)

# also print the top 16 features
print("Top 16 Important Features according to Random Forest:")
# print rounded to 4 decimal places
print(np.round(feat_importances.nlargest(16), 3))

Top 16 Important Features according to Random Forest:
training_hours                                0.289
city_development_index                        0.256
experience                                    0.169
last_new_job                                  0.080
enrolled_university_no_enrollment             0.028
relevent_experience_No relevent experience    0.026
education_level_Masters                       0.023
education_level_High School                   0.019
gender_Male                                   0.016
company_type_Pvt Ltd                          0.016
enrolled_university_Part time course          0.012
major_discipline_STEM                         0.011
company_type_Public Sector                    0.010
major_discipline_Humanities                   0.007
company_type_Funded Startup                   0.007
education_level_Primary School                0.005
dtype: float64


5. Compare the results between the training and test sets.

These results are expected because the model is trained on the training data (of course it will perform well on it), but when applied to the test data, the performance drops significantly. It still performs okay though.
I will focus on the test results now, because that is the real evaluation of the model:
Test Accuracy: 0.81

## Test Classification Report

| Class | Precision | Recall | F1-Score | Support |
|-------|-----------|--------|----------|---------|
| 0     | 0.83      | 0.95   | 0.89     | 78      |
| 1     | 0.64      | 0.32   | 0.42     | 22      |
|       |           |        |          |         |
| **Accuracy**            |           |        | 0.81     | 100     |
| **Macro Avg** | 0.73      | 0.63   | 0.66     | 100     |
| **Weighted Avg** | 0.79      | 0.81   | 0.78     | 100     |



Test Confusion Matrix:
 [[74  4]
 [15  7]]
Overall accuracy is 0.81, which means the target (looking for a job change or not) is predicted correctly 81% of the time.
The precision for class 0 (not looking for job change) is 0.83 while the precision for class 1 (looking for job change) is only 0.64, which means the model is better at predicting class 0 than class 1. But WHY? BECAUSE THE DATA IS IMBALANCED! There are more samples for class 0 than class 1, so the model is biased towards predicting class 0. This is a common issue in imbalanced datasets. Recall and F1-score also show similar trends, as well as the confusion matrix.
Also, imputing missing values with mode or median can affect the performance of the model, especially with a lot of missing values, like in this case.


### Extra Point:
Think about what kind of methods can increase performance (does not need to be run).

## 1. Remove the imbalance of the dataset by oversampling the minority class or undersampling the majority class.
## 2. Use better imputation techniques for missing values, like KNN imputation or MICE.
## 3. Try different algorithms and tune hyperparameters for better performance.

In [83]:
# just for demonstration, i will show some missing data and what they are (WRONGLY) imputed to.
# lets take gender for example

# reload the data from the original csv files
df_train = pd.read_csv('data/aug_train.csv')
df_test = pd.read_csv('data/aug_test.csv')

# distinct gender values in dataset (Annahme: im Testdatensatz sind die gleichen):
print("gender values in dataset: ", df_train['gender'].unique())

# print how many of each gender value there are in the training set
print("count num of each gender value: \n", df_train['gender'].value_counts(dropna=False))


gender values in dataset:  ['Male' 'Female' nan 'Other']
count num of each gender value: 
 gender
Male      1422
NaN        515
Female     133
Other       30
Name: count, dtype: int64


So what we see is 1422 rows with Male, only 133 with Female, 30 with Other BUT 515 ROWS WITH NaN (missing values).
If we just impute these missing values (which may be purposefully not provided by the candidates), we are just assuming they are all Male.
What i would do instead is to create a new category for these missing values, like "Not Provided" or "Unknown", which can also add some information to the model.

In [84]:
# impute missing categorical values with a new category "Not Provided" and numerical values with -999
for col in df_train.select_dtypes(include=['object']).columns:
    df_train[col] = df_train[col].fillna('Not Provided')
    df_test[col] = df_test[col].fillna('Not Provided')

for col in df_train.select_dtypes(include=['number']).columns:
    df_train[col] = df_train[col].fillna(-999)
    df_test[col] = df_test[col].fillna(-999)


In [86]:
from imblearn.over_sampling import SMOTE

# separate features and target
X_train = df_train.drop('target', axis=1)
y_train = df_train['target']

# apply SMOTE for oversampling the minority class
smote = SMOTE(random_state=42)
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)

# check new distribution of target variable
print("Distribution of target after SMOTE:\n", y_train_res.value_counts())



Distribution of target after SMOTE:
 target
0    1565
1    1565
Name: count, dtype: int64
