<a href="https://colab.research.google.com/github/aozoramoew/CS4410/blob/main/Assignment_2_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Use the cleaned data in assignment 1. Make sure I converted all the attributes into numerical values including the target.

In [None]:
import pandas as pd

df = pd.read_csv('/content/cleaneddata.csv')
df.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,work_accident,left,promotion_last_5years,department,salary
0,0.38,0.53,2,157.0,3.0,0,1,0,sales,low
1,0.8,0.86,5,262.0,6.0,0,1,0,sales,medium
2,0.11,0.88,7,272.0,4.0,0,1,0,sales,medium
3,0.72,0.87,5,223.0,5.0,0,1,0,sales,low
4,0.37,0.52,2,200.511732,3.380048,0,1,0,sales,low


We can see that only `department` and `salary` columns'values are not converted into numerical values. Let's see all the categories of these two columns.

In [None]:
for col in ['department', 'salary']:
    display(df[col].value_counts())

Unnamed: 0_level_0,count
department,Unnamed: 1_level_1
sales,3321
technical,2282
support,1861
IT,998
product_mng,704
RandD,698
marketing,690
accounting,629
hr,611
management,465


Unnamed: 0_level_0,count
salary,Unnamed: 1_level_1
low,5872
medium,5360
high,1027


## Preprocess the data

Now we need to convert the `department` and `salary` columns to numerical format using one-hot encoding.


In [None]:
df = pd.get_dummies(df, columns=['department', 'salary'], drop_first=True)
display(df.head())

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,work_accident,left,promotion_last_5years,department_RandD,department_accounting,department_hr,department_management,department_marketing,department_product_mng,department_sales,department_support,department_technical,salary_low,salary_medium
0,0.38,0.53,2,157.0,3.0,0,1,0,False,False,False,False,False,False,True,False,False,True,False
1,0.8,0.86,5,262.0,6.0,0,1,0,False,False,False,False,False,False,True,False,False,False,True
2,0.11,0.88,7,272.0,4.0,0,1,0,False,False,False,False,False,False,True,False,False,False,True
3,0.72,0.87,5,223.0,5.0,0,1,0,False,False,False,False,False,False,True,False,False,True,False
4,0.37,0.52,2,200.511732,3.380048,0,1,0,False,False,False,False,False,False,True,False,False,True,False


In the code above, I applied one-hot encoding to the 'department' and `salary` columns using `get_dummies` function (which converts categorical variables into multiple binary columns)  and drop the original columns.



## Define evaluation function

Now I will create a function to calculate and display the confusion matrix, precision, recall, and F1 score for a given set of true and predicted labels.


In [None]:
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score

def evaluate_model(y_true, y_pred):
    """
    Calculates and displays evaluation metrics for a binary classification model.

    Args:
        y_true: The true labels.
        y_pred: The predicted labels.
    """
    cm = confusion_matrix(y_true, y_pred)
    precision = precision_score(y_true, y_pred, average='binary')
    recall = recall_score(y_true, y_pred, average='binary')
    f1 = f1_score(y_true, y_pred, average='binary')

    print("Confusion Matrix:")
    print(cm)
    print(f"\nPrecision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}")

In the next 3 code blocks, I will split the data into features and target, shuffle the data then split into training and testing sets (following given mix percentages), instantiate a `RandomForestClassifier` and a `SGDClassifier`, train the model using 5-fold cross-validation on the training set, and evaluate the model's performance in the test set using confusion matrix, precision, and F1 score in the defined function.

## Rationale for Choosing Random Forest over SGD Classifier Initially

When approaching this binary classification task (predicting employee turnover), I initially want to choose `RandomForestClassifier` over `SGDClassifier` for several key reasons:

1.  **Default Performance and Ease of Use:** Random Forests are generally known for providing good performance out-of-the-box with less need for extensive hyperparameter tuning compared to models like `SGDClassifier`. They are robust and often handle various data characteristics well without requiring extensive preprocessing like feature scaling.
2.  **Handling Non-Linearity:** Random Forests are ensemble methods based on decision trees, which can inherently capture complex, non-linear relationships within the data. `SGDClassifier`, being a linear model, assumes a linear relationship between features and the target variable, which might not be the case in this dataset.
3.  **Feature Importance:** Random Forests provide a measure of feature importance, which can be valuable for understanding which features contribute most to the model's predictions. This can offer insights into the factors influencing employee turnover.
4.  **Robustness to Outliers:** Decision tree-based methods like Random Forests are less sensitive to outliers in the data compared to linear models like `SGDClassifier` which can be significantly affected by extreme values.
5.  **Handling Different Data Types:** While we performed one-hot encoding, Random Forests can naturally handle a mix of numerical and categorical features, making them a versatile choice.

`SGDClassifier`, on the other hand, is a linear model that relies on stochastic gradient descent for training. While it can be very efficient for large datasets, its performance is often highly dependent on feature scaling and careful tuning of hyperparameters like the learning rate and regularization. Without proper scaling and tuning, it can be sensitive to the scale of the features and might not converge effectively or perform poorly.

Given the initial goal was to quickly establish a baseline model and evaluate performance across different splits, the `RandomForestClassifier` was a more pragmatic and likely-to-perform-well choice without immediate extensive tuning.

I will try to prove this via the performance of these two classifiers.

## a) Split and evaluate for 85/15



In [None]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score
import numpy as np

X = df.drop('left', axis=1)
y = df['left']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42, stratify=y)

# Explicitly shuffle the training data
shuffle_index = np.random.permutation(X_train.index)
X_train_shuffled, y_train_shuffled = X_train.loc[shuffle_index], y_train.loc[shuffle_index]

# Random Forest Classifier
print("--- Random Forest Classifier (85/15 Split) ---")
rf_model = RandomForestClassifier(random_state=42)

rf_cv_scores = cross_val_score(rf_model, X_train_shuffled, y_train_shuffled, cv=5)
print(f"Cross-validation scores: {rf_cv_scores}")
print(f"Average cross-validation score: {rf_cv_scores.mean():.4f}")

rf_model.fit(X_train_shuffled, y_train_shuffled)
rf_y_pred = rf_model.predict(X_test)

evaluate_model(y_test, rf_y_pred)

# SGD Classifier
print("\n--- SGD Classifier (85/15 Split) ---")
sgd_model = SGDClassifier(random_state=42)

sgd_cv_scores = cross_val_score(sgd_model, X_train_shuffled, y_train_shuffled, cv=5)
print(f"Cross-validation scores: {sgd_cv_scores}")
print(f"Average cross-validation score: {sgd_cv_scores.mean():.4f}")

sgd_model.fit(X_train_shuffled, y_train_shuffled)
sgd_y_pred = sgd_model.predict(X_test)

evaluate_model(y_test, sgd_y_pred)

--- Random Forest Classifier (85/15 Split) ---
Cross-validation scores: [0.98704415 0.97504798 0.98128599 0.98272553 0.98128599]
Average cross-validation score: 0.9815
Confusion Matrix:
[[1519    3]
 [  24  293]]

Precision: 0.9899
Recall: 0.9243
F1 Score: 0.9560

--- SGD Classifier (85/15 Split) ---
Cross-validation scores: [0.76871401 0.82677543 0.8171785  0.82149712 0.82053743]
Average cross-validation score: 0.8109
Confusion Matrix:
[[1520    2]
 [ 317    0]]

Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000


## b) Split and evaluate for 75/25



In [None]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score
import numpy as np

X = df.drop('left', axis=1)
y = df['left']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

# Explicitly shuffle the training data
shuffle_index = np.random.permutation(X_train.index)
X_train_shuffled, y_train_shuffled = X_train.loc[shuffle_index], y_train.loc[shuffle_index]

# Random Forest Classifier
print("--- Random Forest Classifier (75/25 Split) ---")
rf_model = RandomForestClassifier(random_state=42)

rf_cv_scores = cross_val_score(rf_model, X_train_shuffled, y_train_shuffled, cv=5)
print(f"Cross-validation scores: {rf_cv_scores}")
print(f"Average cross-validation score: {rf_cv_scores.mean():.4f}")

rf_model.fit(X_train_shuffled, y_train_shuffled)
rf_y_pred = rf_model.predict(X_test)

evaluate_model(y_test, rf_y_pred)

# SGD Classifier
print("\n--- SGD Classifier (75/25 Split) ---")
sgd_model = SGDClassifier(random_state=42)

sgd_cv_scores = cross_val_score(sgd_model, X_train_shuffled, y_train_shuffled, cv=5)
print(f"Cross-validation scores: {sgd_cv_scores}")
print(f"Average cross-validation score: {sgd_cv_scores.mean():.4f}")

sgd_model.fit(X_train_shuffled, y_train_shuffled)
sgd_y_pred = sgd_model.predict(X_test)

evaluate_model(y_test, sgd_y_pred)

--- Random Forest Classifier (75/25 Split) ---
Cross-validation scores: [0.98151169 0.98151169 0.98096792 0.98096792 0.97823721]
Average cross-validation score: 0.9806
Confusion Matrix:
[[2531    5]
 [  42  487]]

Precision: 0.9898
Recall: 0.9206
F1 Score: 0.9540

--- SGD Classifier (75/25 Split) ---
Cross-validation scores: [0.77705275 0.23926047 0.81783578 0.2262099  0.32100109]
Average cross-validation score: 0.4763
Confusion Matrix:
[[ 745 1791]
 [  37  492]]

Precision: 0.2155
Recall: 0.9301
F1 Score: 0.3499


## Split and evaluate for 65/35



In [None]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score
import numpy as np

X = df.drop('left', axis=1)
y = df['left']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35, random_state=42, stratify=y)

# Explicitly shuffle the training data
shuffle_index = np.random.permutation(X_train.index)
X_train_shuffled, y_train_shuffled = X_train.loc[shuffle_index], y_train.loc[shuffle_index]

# Random Forest Classifier
print("--- Random Forest Classifier (65/35 Split) ---")
rf_model = RandomForestClassifier(random_state=42)

rf_cv_scores = cross_val_score(rf_model, X_train_shuffled, y_train_shuffled, cv=5)
print(f"Cross-validation scores: {rf_cv_scores}")
print(f"Average cross-validation score: {rf_cv_scores.mean():.4f}")

rf_model.fit(X_train_shuffled, y_train_shuffled)
rf_y_pred = rf_model.predict(X_test)

evaluate_model(y_test, rf_y_pred)

# SGD Classifier
print("\n--- SGD Classifier (65/35 Split) ---")
sgd_model = SGDClassifier(random_state=42)

sgd_cv_scores = cross_val_score(sgd_model, X_train_shuffled, y_train_shuffled, cv=5)
print(f"Cross-validation scores: {sgd_cv_scores}")
print(f"Average cross-validation score: {sgd_cv_scores.mean():.4f}")

sgd_model.fit(X_train_shuffled, y_train_shuffled)
sgd_y_pred = sgd_model.predict(X_test)

evaluate_model(y_test, sgd_y_pred)

--- Random Forest Classifier (65/35 Split) ---
Cross-validation scores: [0.98431619 0.97929737 0.98494354 0.97489014 0.97363465]
Average cross-validation score: 0.9794
Confusion Matrix:
[[3542    9]
 [  64  676]]

Precision: 0.9869
Recall: 0.9135
F1 Score: 0.9488

--- SGD Classifier (65/35 Split) ---
Cross-validation scores: [0.65370138 0.82747804 0.82747804 0.81858129 0.82046453]
Average cross-validation score: 0.7895
Confusion Matrix:
[[3015  536]
 [ 422  318]]

Precision: 0.3724
Recall: 0.4297
F1 Score: 0.3990


## Random Forest Classifier Results Summary Across Splits

*   **For the 85/15 train/test split:**
    *   Average 5-fold cross-validation score on training set: 0.9815
    *   Test set metrics:
        *   `Confusion Matrix`: \[\[1519 3], \[24 293]]
        *   `Precision`: 0.9899
        *   `Recall`: 0.9243
        *   `F1 Score`: 0.9560

*   **For the 75/25 train/test split:**
    *   Average 5-fold cross-validation score on training set: 0.9806
    *   Test set metrics:
        *   `Confusion Matrix`: \[\[2531 5], \[42 487]]
        *   `Precision`: 0.9898
        *   `Recall`: 0.9206
        *   `F1 Score`: 0.9540

*   **For the 65/35 train/test split:**
    *   Average 5-fold cross-validation score on training set: 0.9794
    *   Test set metrics:
        *   `Confusion Matrix`: \[\[3542 9], \[64 676]]
        *   `Precision`: 0.9869
        *   `Recall`: 0.9135
        *   `F1 Score`: 0.9488

Across all splits, the Random Forest classifier demonstrated high precision, indicating a low rate of false positives, and consistently strong recall and F1 scores, indicating good overall performance in identifying employees who leave.

## SGD Classifier Results Summary Across Splits

*   **For the 85/15 train/test split:**
    *   Average 5-fold cross-validation score on training set: 0.8109
    *   Test set metrics:
        *   `Confusion Matrix`: [[1520 2], [317 0]]
        *   `Precision`: 0.0000
        *   `Recall`: 0.0000
        *   `F1 Score`: 0.0000

*   **For the 75/25 train/test split:**
    *   Average 5-fold cross-validation score on training set: 0.4763
    *   Test set metrics:
        *   `Confusion Matrix`: [[ 745 1791], [  37  492]]
        *   `Precision`: 0.2155
        *   `Recall`: 0.9301
        *   `F1 Score`: 0.3499

*   **For the 65/35 train/test split:**
    *   Average 5-fold cross-validation score on training set: 0.7895
    *   Test set metrics:
        *   `Confusion Matrix`: [[3015 536], [ 422 318]]
        *   `Precision`: 0.3724
        *   `Recall`: 0.4297
        *   `F1 Score`: 0.3990

The SGD classifier's performance varied significantly across the splits and was generally much lower than the Random Forest classifier, particularly in terms of precision and F1 score.