# Employee Performance Prediction - Training Notebook

This notebook recreates the pipeline used in the demo project:
- Load dataset
- Preprocess (OneHot + StandardScaler)
- Train RandomForest pipeline
- Evaluate and save the pipeline (model_pipeline.pkl)


In [1]:
import pandas as pd
df = pd.read_csv("C:/Users/ASUS/Desktop/Project_Demo/Dataset/employee_performance.csv")
df.head()

Unnamed: 0,employee_id,age,education_level,department,job_role,years_experience,training_hours,avg_monthly_hours,satisfaction_level,last_performance_score,absenteeism_days,promotion_last_5yrs,performance_label
0,E10000,25,High School,IT,Engineer,7,42,188,0.56,3,2,1,Medium
1,E10001,51,Bachelors,Finance,Operator,34,49,243,0.553,3,4,0,Medium
2,E10002,46,High School,IT,Executive,25,46,184,0.706,1,1,0,Low
3,E10003,38,High School,IT,Analyst,16,38,181,0.707,4,2,0,Medium
4,E10004,38,Bachelors,R&D,Technician,21,37,170,0.559,2,3,0,Low


In [2]:
from sklearn.model_selection import train_test_split
feature_cols = ['age', 'education_level', 'department', 'job_role', 'years_experience', 'training_hours', 'avg_monthly_hours', 'satisfaction_level', 'last_performance_score', 'absenteeism_days', 'promotion_last_5yrs']
X = df[feature_cols]
y = df["performance_label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
X_train.shape, X_test.shape

((1600, 11), (400, 11))

In [3]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
import joblib

cat_cols = ['education_level','department','job_role']
num_cols = [c for c in feature_cols if c not in cat_cols]

preprocessor = ColumnTransformer([('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), cat_cols),
                                  ('num', StandardScaler(), num_cols)],
                                 remainder='drop')

pipeline = Pipeline([('preprocessor', preprocessor), ('clf', RandomForestClassifier(n_estimators=150, random_state=42))])
pipeline.fit(X_train, y_train)
joblib.dump(pipeline, "C:/Users/ASUS/Desktop/Project_Demo/Flask/model.pkl")
print("Model saved successfully!")

Model saved successfully!


In [4]:
from sklearn.metrics import classification_report, accuracy_score
y_pred = pipeline.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.8025
              precision    recall  f1-score   support

        High       0.85      0.28      0.42        39
         Low       0.87      0.71      0.78       123
      Medium       0.78      0.94      0.85       238

    accuracy                           0.80       400
   macro avg       0.83      0.64      0.68       400
weighted avg       0.81      0.80      0.79       400

