<a href="https://colab.research.google.com/github/cynthiacc512/DicodingJayaJayaMaju/blob/main/notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Proyek Akhir: Menyelesaikan Permasalahan Perusahaan Edutech

- Nama: Cynthia Caroline
- Email: Cynthiacc512@gmail.com
- Id Dicoding: meecyn

# Business Understanding

## Apa Masalahnya?
Jaya Jaya Maju, perusahaan besar dengan lebih dari 1000 karyawan, lagi pusing. Tiap bulan, ada aja yang resign, celakanya bukan satu dua orang. Total attrition rate mereka udah nembus angka 10%, dan itu jadi lampu merah buat divisi HR. Red flag buat kantor kalo banyak yang resign :)

Kalau kondisi ini dibiarkan, pasti ngerugiin bisnis. Recruitment terus terusan itu rugi uang dan waktu, belum lagi pegawai baru butuh waktu untuk belajar dulu.

## Apa yang HR Pengen?
Divisi HR pengen tahu detail kira kira apa alesan orang orang resign? Apakah karena jabatan? Gaji? Umur? Atau jangan jangan ada hal "random" lain yang kita ga kepikiran.

Ga cuma itu, mereka butuh **dashboard interaktif** buat mantau kondisi tim secara berkala. Bahkan kalau bisa, sekalian dibikinin model prediksi biar bisa cegah resign sebelum kejadian.

## Apa Tujuan Proyek Ini?
Sebagai Data Scientist, tugas kita adalah:
1. **Ngulik data HR** buat nyari tahu faktor-faktor utama yang bikin karyawan keluar.
2. **Membuat dashboard** yang membantu HR melihat tren dan insight secara cepat.
3. **Buat model prediksi** buat mendeteksi karyawan berisiko resign.

## Persiapan

### Menyiapkan Library yang dibutuhkan

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import joblib

### Menyiapkan data yang akan digunakan

In [2]:
df = pd.read_csv("employee_data.csv")

## Generate Requirements.txt

In [3]:
!pip freeze > requirements.txt

## Data Undestanding

In [4]:
df.isnull().sum()

Unnamed: 0,0
EmployeeId,0
Age,0
Attrition,412
BusinessTravel,0
DailyRate,0
Department,0
DistanceFromHome,0
Education,0
EducationField,0
EmployeeCount,0


In [5]:
df_clean = df.dropna(subset=["Attrition"]).copy()
df_clean["Attrition"] = df_clean["Attrition"]

Karena missing value berada di attrition, yang mana merupakan target utama prediksi, maka saya tidak akan melakukan data inputation. Jika menggunakan data inputation maka yang terjadi akan bias total, sedangkan kita tidak tau apa realita dilapangan yang sesungguhnya terjadi. Saya anggap ini cacat data/cacat sample (tidak bisa digunakan)

In [6]:
drop_cols = ["EmployeeId", "StandardHours", "Over18", "EmployeeCount"]
df_drop = df_clean.drop(columns=drop_cols)

In [7]:
df_encoded = df_drop.copy()
categorical_cols = df_encoded.select_dtypes(include="object").columns

le = LabelEncoder()
for col in categorical_cols:
    df_encoded[col] = le.fit_transform(df_encoded[col])

In [8]:
print("Distribusi Attrition:")
print(df_encoded["Attrition"].value_counts(normalize=True) * 100)

Distribusi Attrition:
Attrition
0.0    83.081285
1.0    16.918715
Name: proportion, dtype: float64


In [9]:
print("Perbandingan mean usia resign vs tidak:")
print(df_encoded.groupby("Attrition")["Age"].mean())

print("Perbandingan mean pendapatan resign vs tidak:")
print(df_encoded.groupby("Attrition")["MonthlyIncome"].mean())

print("Resign x Overtime:")
print(df_encoded.groupby("OverTime")["Attrition"].mean())

print("Resign x departement:")
print(df_encoded.groupby("Department")["Attrition"].mean())

Perbandingan mean usia resign vs tidak:
Attrition
0.0    37.786121
1.0    33.469274
Name: Age, dtype: float64
Perbandingan mean pendapatan resign vs tidak:
Attrition
0.0    6982.928328
1.0    4872.938547
Name: MonthlyIncome, dtype: float64
Resign x Overtime:
OverTime
0    0.107856
1    0.319218
Name: Attrition, dtype: float64
Resign x departement:
Department
0    0.157895
1    0.152639
2    0.206897
Name: Attrition, dtype: float64


## Modeling

In [10]:
df_encoded

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,Gender,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
1,37,1.0,2,1141,1,11,2,3,1,0,...,3,1,0,15,2,1,1,0,0,0
2,51,1.0,2,1323,1,4,4,1,1,1,...,3,3,3,18,2,4,10,0,2,7
3,42,0.0,1,555,2,26,3,2,3,0,...,3,4,1,23,2,4,20,4,4,8
6,40,0.0,2,1124,2,1,2,3,2,1,...,4,3,3,6,2,2,4,3,0,2
7,55,1.0,2,725,1,2,3,3,4,1,...,3,4,1,24,2,3,5,2,1,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1464,28,1.0,0,1366,1,24,2,5,2,1,...,3,1,0,10,2,2,10,7,1,9
1465,38,0.0,2,168,1,1,3,1,3,0,...,3,4,0,10,4,4,1,0,0,0
1467,28,1.0,2,1485,1,12,1,1,3,0,...,3,4,0,1,4,2,1,1,0,0
1468,40,0.0,0,458,1,16,2,1,3,1,...,3,2,1,6,0,3,4,2,0,0


## Data Preprocessing

In [11]:
X = df_encoded.drop(columns=["Attrition"])
y = df_encoded["Attrition"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

In [12]:
smote = SMOTE(random_state=42)
X_train, y_train = smote.fit_resample(X_train, y_train)

In [13]:
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

## Evaluation

In [14]:
y_pred = rf_model.predict(X_test)

In [15]:
report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

         0.0       0.88      0.93      0.91       176
         1.0       0.54      0.39      0.45        36

    accuracy                           0.84       212
   macro avg       0.71      0.66      0.68       212
weighted avg       0.82      0.84      0.83       212



In [16]:
y_pred_all = rf_model.predict(X)

In [17]:
report_all = classification_report(y, y_pred_all)
print(report)

              precision    recall  f1-score   support

         0.0       0.88      0.93      0.91       176
         1.0       0.54      0.39      0.45        36

    accuracy                           0.84       212
   macro avg       0.71      0.66      0.68       212
weighted avg       0.82      0.84      0.83       212



In [18]:
df_encoded["Attrition_Predicted"] = y_pred_all
df_encoded.to_csv("employee_data_cleaned", index=False)

## Feature Importance

In [19]:
importances = rf_model.feature_importances_
features = X.columns

In [20]:
importances_df = pd.DataFrame({
    "Feature": features,
    "Importance": importances
}).sort_values(by="Importance", ascending=False).reset_index(drop=True)

print(importances_df.to_string(index=False))

                 Feature  Importance
        StockOptionLevel    0.088638
           MonthlyIncome    0.065165
         JobSatisfaction    0.057325
          JobInvolvement    0.054691
             MonthlyRate    0.050443
       TotalWorkingYears    0.048568
 EnvironmentSatisfaction    0.047012
                     Age    0.043619
              HourlyRate    0.038596
               DailyRate    0.037511
                JobLevel    0.037391
         WorkLifeBalance    0.034850
        DistanceFromHome    0.033296
RelationshipSatisfaction    0.031838
    YearsWithCurrManager    0.030206
      YearsInCurrentRole    0.029426
          BusinessTravel    0.029197
   TrainingTimesLastYear    0.028964
          YearsAtCompany    0.026667
      NumCompaniesWorked    0.023451
       PercentSalaryHike    0.022919
                 JobRole    0.022493
               Education    0.021390
          EducationField    0.019973
 YearsSinceLastPromotion    0.017471
           MaritalStatus    0.014915
 

## Model

In [21]:
joblib.dump(rf_model, 'rf_model.pkl')

['rf_model.pkl']