TF-IDF Job Requirement Features

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Load dataset
df = pd.read_excel("XGBoost-model-training/resumes.xlsx")

# Fill NaN values in Job Requirement
df["Job Requirement"] = df["Job Requirement"].fillna("")

# Dynamically set max_features based on vocabulary size
max_features = min(8, len(set(" ".join(df["Job Requirement"]).split())))
vectorizer = TfidfVectorizer(max_features=max_features)

# Transform job requirements into numerical features
job_requirement_features = vectorizer.fit_transform(df["Job Requirement"]).toarray()
num_features = job_requirement_features.shape[1]  # Get actual feature count

# Convert to DataFrame
job_req_df = pd.DataFrame(job_requirement_features, columns=[f"job_feature_{i}" for i in range(num_features)])

# Merge with original dataset (excluding original text column)
df = df.drop(columns=["Job Requirement"])
df = pd.concat([df, job_req_df], axis=1)

# Save the processed dataset
df.to_csv("processed_resumes.csv", index=False)

print(f"Successfully processed {df.shape[0]} resumes with {num_features} job requirement features!")

1️⃣ Load & Preprocess the Data
First, load your preprocessed_resumes.csv that includes both structured features (Age, Gender, etc.) and TF-IDF job requirement features (job_feature_0 to job_feature_n).

In [9]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Load dataset
df = pd.read_csv("C:/Users/Acer/Desktop/Talaba,Ephraim/ARSwithPredictiveAnalytics/XGBoost-model-training/processed_resumes.csv")

# Convert categorical columns to numeric (Label Encoding)
categorical_cols = ["Gender", "Address", "Skills", "Education", "Work Experience", "Certificates", "Course"]

for col in categorical_cols:
    df[col] = df[col].astype(str)  # Ensure it's string type
    df[col] = LabelEncoder().fit_transform(df[col])

# Separate features and labels
X = df.drop(columns=["Hired"])  # Features (resume attributes + job requirement features)
y = df["Hired"]  # Target (1 = Suitable, 0 = Not Suitable)

# Split into Train & Test Set (80%-20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training Samples: {X_train.shape[0]}, Testing Samples: {X_test.shape[0]}")


Training Samples: 560, Testing Samples: 140


2️⃣ Train the XGBoost Model
Now, we train the XGBoost classifier to predict suitability.

In [10]:
import xgboost as xgb

# Create the XGBoost classifier
model = xgb.XGBClassifier(
    objective="binary:logistic",  # Binary classification
    eval_metric="logloss",
    use_label_encoder=False
)

# Train the model
model.fit(X_train, y_train)

# Save the trained model
model.save_model("trained-XGBoost-model/xgboost_model.json")

print("✅ XGBoost Model Trained and Saved!")


✅ XGBoost Model Trained and Saved!


Parameters: { "use_label_encoder" } are not used.



3️⃣ Evaluate Model Performance
After training, check how well the model predicts candidate suitability.

In [11]:
from sklearn.metrics import accuracy_score, classification_report

# Make predictions
y_pred = model.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"🎯 Model Accuracy: {accuracy * 100:.2f}%")

# Print classification report
print(classification_report(y_test, y_pred))


🎯 Model Accuracy: 78.57%
              precision    recall  f1-score   support

           0       0.85      0.88      0.87       112
           1       0.46      0.39      0.42        28

    accuracy                           0.79       140
   macro avg       0.66      0.64      0.65       140
weighted avg       0.77      0.79      0.78       140

