‚úÖTASK 1 ‚Äì Dataset Understanding

This dataset represents a resume screening system used in hiring. Each row corresponds to a candidate who applied for a job. The dataset includes information about their skills, experience, education, certifications, job role applied for, salary expectations, and number of projects completed.

The dataset contains different types of data:
‚Ä¢ Text data ‚Äì Skills, Certifications, and Job Role
‚Ä¢ Numerical data ‚Äì Experience (Years), Salary Expectation, Projects Count, and AI Score
‚Ä¢ Categorical data ‚Äì Education and Recruiter Decision
‚Ä¢ Identifier columns ‚Äì Resume_ID and Name

The main goal of this project is to build a machine learning model that predicts whether a candidate will be hired or rejected.
The target variable in this dataset is: Recruiter Decision

This column contains two categories:
‚Ä¢ Hire
‚Ä¢ Reject

For model training, these values will be converted into numeric format:
Hire = 1
Reject = 0

One important observation is the presence of the AI Score column. Since this score might already influence the recruiter‚Äôs decision, using it as a feature could cause data leakage and lead to unrealistic model performance. Therefore, it should not be used while training the model if the goal is to simulate an independent hiring prediction system.

1Ô∏è‚É£ Import Required Libraries

In [1]:
# Data handling
import pandas as pd
import numpy as np

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt

# Sklearn base imports
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, classification_report


2Ô∏è‚É£ Load the Dataset

In [2]:
from google.colab import files
uploaded = files.upload()

Saving AI-Based Hiring Prediction System.csv to AI-Based Hiring Prediction System.csv


In [3]:
df = pd.read_csv("AI-Based Hiring Prediction System.csv")


3Ô∏è‚É£ View the Data

In [4]:
print("First 5 rows:")
df.head()


First 5 rows:


Unnamed: 0,Resume_ID,Name,Skills,Experience (Years),Education,Certifications,Job Role,Recruiter Decision,Salary Expectation ($),Projects Count,AI Score (0-100)
0,1,Ashley Ali,"TensorFlow, NLP, Pytorch",10,B.Sc,,AI Researcher,Hire,104895,8,100
1,2,Wesley Roman,"Deep Learning, Machine Learning, Python, SQL",10,MBA,Google ML,Data Scientist,Hire,113002,1,100
2,3,Corey Sanchez,"Ethical Hacking, Cybersecurity, Linux",1,MBA,Deep Learning Specialization,Cybersecurity Analyst,Hire,71766,7,70
3,4,Elizabeth Carney,"Python, Pytorch, TensorFlow",7,B.Tech,AWS Certified,AI Researcher,Hire,46848,0,95
4,5,Julie Hill,"SQL, React, Java",4,PhD,,Software Engineer,Hire,87441,9,100


In [5]:
print("Last 5 rows:")
df.tail()


Last 5 rows:


Unnamed: 0,Resume_ID,Name,Skills,Experience (Years),Education,Certifications,Job Role,Recruiter Decision,Salary Expectation ($),Projects Count,AI Score (0-100)
995,996,Brenda Williams,"Cybersecurity, Linux, Ethical Hacking",0,B.Sc,,Cybersecurity Analyst,Reject,114364,9,60
996,997,Colleen Hicks,"Deep Learning, Machine Learning",0,MBA,Deep Learning Specialization,Data Scientist,Reject,103294,5,45
997,998,Michelle Molina,"TensorFlow, NLP",0,B.Tech,Google ML,AI Researcher,Hire,113855,9,65
998,999,Danielle Horn,"Linux, Networking, Cybersecurity, Ethical Hacking",8,PhD,AWS Certified,Cybersecurity Analyst,Hire,83146,10,100
999,1000,Chad Collins,"SQL, Machine Learning, Python, Deep Learning",7,M.Tech,Deep Learning Specialization,Data Scientist,Hire,119474,3,100


In [6]:
print("Random Sample:")
df.sample(5)

Random Sample:


Unnamed: 0,Resume_ID,Name,Skills,Experience (Years),Education,Certifications,Job Role,Recruiter Decision,Salary Expectation ($),Projects Count,AI Score (0-100)
980,981,Juan Le,"Python, SQL, Machine Learning",3,B.Tech,AWS Certified,Data Scientist,Hire,71142,4,75
418,419,Valerie Clark,"SQL, Machine Learning, Python, Deep Learning",2,MBA,AWS Certified,Data Scientist,Hire,58438,6,80
608,609,Nathan Davis,"NLP, TensorFlow, Python, Pytorch",8,MBA,,AI Researcher,Hire,50831,3,100
246,247,Daniel Alvarado,"Cybersecurity, Ethical Hacking, Linux",7,MBA,AWS Certified,Cybersecurity Analyst,Hire,63157,4,100
686,687,Sarah Adams,"SQL, Java, C++, React",6,B.Sc,Deep Learning Specialization,Software Engineer,Hire,73268,8,100


‚úÖ TASK 2: Basic Data Inspection

Data inspection is necessary before model training because it helps identify data structure, detect missing values, check class imbalance, verify correct data types, and find anomalies or outliers. Without proper inspection, the model may produce inaccurate or misleading results.

1Ô∏è‚É£ Check Dataset Shape

In [7]:
df.shape

(1000, 11)

2Ô∏è‚É£ View Column Names

In [8]:
df.columns

Index(['Resume_ID', 'Name', 'Skills', 'Experience (Years)', 'Education',
       'Certifications', 'Job Role', 'Recruiter Decision',
       'Salary Expectation ($)', 'Projects Count', 'AI Score (0-100)'],
      dtype='object')

üîπ Step 2.1 - Clean Column Names

In [9]:
df.columns = (
    df.columns
    .str.strip()
    .str.lower()
    .str.replace(" ", "_")
    .str.replace("(", "", regex=False)
    .str.replace(")", "", regex=False)
    .str.replace("$", "", regex=False)
    .str.replace("-", "_")
)

df.columns


Index(['resume_id', 'name', 'skills', 'experience_years', 'education',
       'certifications', 'job_role', 'recruiter_decision',
       'salary_expectation_', 'projects_count', 'ai_score_0_100'],
      dtype='object')

3Ô∏è‚É£ Check Data Types

In [10]:
df.dtypes

Unnamed: 0,0
resume_id,int64
name,object
skills,object
experience_years,int64
education,object
certifications,object
job_role,object
recruiter_decision,object
salary_expectation_,int64
projects_count,int64


4Ô∏è‚É£ Check Class Distribution

In [12]:
df["recruiter_decision"].value_counts()


Unnamed: 0_level_0,count
recruiter_decision,Unnamed: 1_level_1
Hire,812
Reject,188


5Ô∏è‚É£ Summary Statistics (Numerical Columns)

In [13]:
df.describe()

Unnamed: 0,resume_id,experience_years,salary_expectation_,projects_count,ai_score_0_100
count,1000.0,1000.0,1000.0,1000.0,1000.0
mean,500.5,4.896,79994.486,5.133,83.95
std,288.819436,3.112695,23048.472549,3.23137,20.983036
min,1.0,0.0,40085.0,0.0,15.0
25%,250.75,2.0,60415.75,2.0,70.0
50%,500.5,5.0,79834.5,5.0,100.0
75%,750.25,8.0,99583.25,8.0,100.0
max,1000.0,10.0,119901.0,10.0,100.0


6Ô∏è‚É£ Check Missing Values

In [14]:
df.isnull().sum()

Unnamed: 0,0
resume_id,0
name,0
skills,0
experience_years,0
education,0
certifications,274
job_role,0
recruiter_decision,0
salary_expectation_,0
projects_count,0


‚úÖ TASK 3: Data Cleaning & Preprocessing

In this step, identifier columns were removed as they do not contribute to prediction. The AI score column was also dropped to prevent data leakage. The target variable was converted into numeric format for model training. Missing values were checked and handled appropriately to ensure data consistency.

1Ô∏è‚É£ Drop Unnecessary Columns

In [15]:
df = df.drop(columns=["resume_id", "name", "ai_score_0_100"])


2Ô∏è‚É£ Convert Target Variable

In [16]:
df["recruiter_decision"] = df["recruiter_decision"].map({
    "Hire": 1,
    "Reject": 0
})


In [17]:
df["recruiter_decision"].value_counts()


Unnamed: 0_level_0,count
recruiter_decision,Unnamed: 1_level_1
1,812
0,188


3Ô∏è‚É£ Check Missing Values

In [18]:
df.isnull().sum()

Unnamed: 0,0
skills,0
experience_years,0
education,0
certifications,274
job_role,0
recruiter_decision,0
salary_expectation_,0
projects_count,0


Note: The certifications column contained 274 missing values. Since absence of certification is meaningful information, missing values were replaced with "none" instead of dropping rows. This preserves dataset size and maintains class distribution.

In [19]:
df["certifications"] = df["certifications"].fillna("none")


In [20]:
df.isnull().sum()

Unnamed: 0,0
skills,0
experience_years,0
education,0
certifications,0
job_role,0
recruiter_decision,0
salary_expectation_,0
projects_count,0


‚úÖ TASK 4: Text Feature Engineering

Text cleaning is necessary before vectorization because machine learning models cannot interpret raw text directly. Cleaning ensures uniform formatting, removes noise such as special characters, and prevents duplicate representations of the same word due to case differences.

1Ô∏è‚É£ Combine Text Columns

In [21]:
df["combined_text"] = (
    df["skills"] + " " +
    df["certifications"] + " " +
    df["job_role"]
)


In [22]:
df[["combined_text"]].head()


Unnamed: 0,combined_text
0,"TensorFlow, NLP, Pytorch none AI Researcher"
1,"Deep Learning, Machine Learning, Python, SQL G..."
2,"Ethical Hacking, Cybersecurity, Linux Deep Lea..."
3,"Python, Pytorch, TensorFlow AWS Certified AI R..."
4,"SQL, React, Java none Software Engineer"


2Ô∏è‚É£ Clean the Text

In [23]:
import re

def clean_text(text):
    text = text.lower()
    text = re.sub(r"[^a-zA-Z0-9 ]", " ", text)
    text = re.sub(r"\s+", " ", text)
    return text.strip()

df["combined_text"] = df["combined_text"].apply(clean_text)


In [24]:
df[["combined_text"]].head()

Unnamed: 0,combined_text
0,tensorflow nlp pytorch none ai researcher
1,deep learning machine learning python sql goog...
2,ethical hacking cybersecurity linux deep learn...
3,python pytorch tensorflow aws certified ai res...
4,sql react java none software engineer


‚úÖ TASK 5: Convert Text to Numerical Features (TF-IDF)

1Ô∏è‚É£ Apply TF-IDF

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=500)

X_text = tfidf.fit_transform(df["combined_text"])


In [26]:
X_text.shape


(1000, 28)

In [27]:
tfidf.get_feature_names_out()


array(['ai', 'analyst', 'aws', 'certified', 'cybersecurity', 'data',
       'deep', 'engineer', 'ethical', 'google', 'hacking', 'java',
       'learning', 'linux', 'machine', 'ml', 'networking', 'nlp', 'none',
       'python', 'pytorch', 'react', 'researcher', 'scientist',
       'software', 'specialization', 'sql', 'tensorflow'], dtype=object)

‚úÖ TASK 6: Encode Education

1Ô∏è‚É£ Apply Label Encoding

In [28]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df["education_encoded"] = le.fit_transform(df["education"])


In [29]:
dict(zip(le.classes_, le.transform(le.classes_)))


{'B.Sc': np.int64(0),
 'B.Tech': np.int64(1),
 'M.Tech': np.int64(2),
 'MBA': np.int64(3),
 'PhD': np.int64(4)}

Drop Original Column

In [30]:
df = df.drop(columns=["education"])


‚úÖ TASK 7: Feature and Target Separation

The target variable recruiter_decision was separated from the feature set to prevent data leakage. Text-based features were converted using TF-IDF, and numerical features were selected separately. Both were combined into a single feature matrix to be used for model training.

1Ô∏è‚É£ Define Target (y)

In [31]:
y = df["recruiter_decision"]


2Ô∏è‚É£ Select Numeric Features

In [32]:
X_numeric = df[[
    "experience_years",
    "salary_expectation_",
    "projects_count",
    "education_encoded"
]]


3Ô∏è‚É£ Combine Text + Numeric Features

In [33]:
from scipy.sparse import hstack

X = hstack([X_text, X_numeric])


In [34]:
X.shape


(1000, 32)

‚úÖ TASK 8: Train‚ÄìTest Split

The dataset was split into 80% training data and 20% testing data using stratified sampling. Stratification ensures that the class distribution remains consistent in both training and testing sets. This prevents bias and provides a realistic evaluation of model performance.

Overfitting occurs when a model memorizes training data instead of learning patterns, resulting in poor performance on unseen data.



1Ô∏è‚É£ Split the Data

In [35]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)


2Ô∏è‚É£ Check Shapes

In [36]:
print("Training shape:", X_train.shape)
print("Testing shape:", X_test.shape)


Training shape: (800, 32)
Testing shape: (200, 32)


3Ô∏è‚É£ Verify Class Distribution

In [37]:
print("Train class distribution:")
print(y_train.value_counts())

print("Test class distribution:")
print(y_test.value_counts())


Train class distribution:
recruiter_decision
1    650
0    150
Name: count, dtype: int64
Test class distribution:
recruiter_decision
1    162
0     38
Name: count, dtype: int64


‚úÖ TASK 9: Feature Scaling

Feature scaling was applied to numerical features using StandardScaler. Scaling ensures that all numeric features contribute equally to model training. The scaler was fitted only on the training data to prevent data leakage and then applied to the test data. Tree-based models do not require scaling, but distance-based models such as SVM and KNN benefit from it.

1Ô∏è‚É£ Identify Numeric Column Indices

In [38]:
X = hstack([X_text, X_numeric])

2Ô∏è‚É£ Apply StandardScaler

In [48]:
from sklearn.preprocessing import StandardScaler
from scipy.sparse import hstack
import numpy as np

scaler = StandardScaler()

# Convert sparse to dense only for numeric part
X_train_dense = X_train.toarray()
X_test_dense = X_test.toarray()

# Separate text and numeric parts
X_train_text = X_train_dense[:, :-4]
X_train_num = X_train_dense[:, -4:]

X_test_text = X_test_dense[:, :-4]
X_test_num = X_test_dense[:, -4:]

# Fit scaler on training numeric features
X_train_num_scaled = scaler.fit_transform(X_train_num)

# Transform test numeric features
X_test_num_scaled = scaler.transform(X_test_num)

# Combine back
X_train_final = np.hstack([X_train_text, X_train_num_scaled])
X_test_final = np.hstack([X_test_text, X_test_num_scaled])


In [40]:
X_train_final.shape


(800, 32)

‚úÖ TASK 10: Model Training

1Ô∏è‚É£ Logistic Regression

In [41]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter=1000)

lr.fit(X_train_final, y_train)

y_pred_lr = lr.predict(X_test_final)


2Ô∏è‚É£ Random Forest

In [42]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state=42)

rf.fit(X_train_dense, y_train)

y_pred_rf = rf.predict(X_test_dense)


3Ô∏è‚É£ Support Vector Machine

In [43]:
from sklearn.svm import SVC

svm = SVC(kernel="rbf")

svm.fit(X_train_final, y_train)

y_pred_svm = svm.predict(X_test_final)


4Ô∏è‚É£ K-Nearest Neighbors

In [44]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()

knn.fit(X_train_final, y_train)

y_pred_knn = knn.predict(X_test_final)


‚úÖ TASK 11: Model Evaluation

In [45]:
from sklearn.metrics import accuracy_score

print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_lr))
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))
print("KNN Accuracy:", accuracy_score(y_test, y_pred_knn))


Logistic Regression Accuracy: 0.985
Random Forest Accuracy: 0.95
SVM Accuracy: 0.97
KNN Accuracy: 0.95


In [46]:
from sklearn.metrics import classification_report

print("Logistic Regression Report:")
print(classification_report(y_test, y_pred_lr))

print("Random Forest Report:")
print(classification_report(y_test, y_pred_rf))


Logistic Regression Report:
              precision    recall  f1-score   support

           0       0.95      0.97      0.96        38
           1       0.99      0.99      0.99       162

    accuracy                           0.98       200
   macro avg       0.97      0.98      0.98       200
weighted avg       0.99      0.98      0.99       200

Random Forest Report:
              precision    recall  f1-score   support

           0       0.97      0.76      0.85        38
           1       0.95      0.99      0.97       162

    accuracy                           0.95       200
   macro avg       0.96      0.88      0.91       200
weighted avg       0.95      0.95      0.95       200



In [47]:
results = {
    "Model": ["Logistic Regression", "Random Forest", "SVM", "KNN"],
    "Accuracy": [
        accuracy_score(y_test, y_pred_lr),
        accuracy_score(y_test, y_pred_rf),
        accuracy_score(y_test, y_pred_svm),
        accuracy_score(y_test, y_pred_knn)
    ]
}

comparison_df = pd.DataFrame(results)
comparison_df


Unnamed: 0,Model,Accuracy
0,Logistic Regression,0.985
1,Random Forest,0.95
2,SVM,0.97
3,KNN,0.95


‚úÖ TASK 12 ‚Äî Build Pipeline + GridSearch

1Ô∏è‚É£ Separate Raw Features (Before TF-IDF)

In [58]:
X_raw = df[[
    "combined_text",
    "experience_years",
    "salary_expectation_",
    "projects_count",
    "education_encoded"
]]

y = df["recruiter_decision"]


2Ô∏è‚É£ Define Column Groups

In [59]:
text_feature = "combined_text"

numeric_features = [
    "experience_years",
    "salary_expectation_",
    "projects_count",
    "education_encoded"
]


3Ô∏è‚É£ Build ColumnTransformer

In [60]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

preprocessor = ColumnTransformer(
    transformers=[
        ("text", TfidfVectorizer(max_features=500), text_feature),
        ("num", StandardScaler(), numeric_features)
    ]
)


4Ô∏è‚É£ Create Pipeline (Logistic Regression Example)

In [61]:
pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier", LogisticRegression(max_iter=1000))
])


5Ô∏è‚É£ Train-Test Split Again (Raw Data)

In [62]:
from sklearn.model_selection import train_test_split

X_train_raw, X_test_raw, y_train, y_test = train_test_split(
    X_raw,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)


6Ô∏è‚É£ Fit Pipeline

In [63]:
pipeline.fit(X_train_raw, y_train)

y_pred = pipeline.predict(X_test_raw)

from sklearn.metrics import accuracy_score
print("Pipeline Accuracy:", accuracy_score(y_test, y_pred))


Pipeline Accuracy: 0.985


‚úÖ GridSearchCV (Hyperparameter Tuning)

1Ô∏è‚É£ Define Parameter Grid

In [64]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    "classifier__C": [0.01, 0.1, 1, 10, 100]
}


2Ô∏è‚É£ Setup GridSearch

In [65]:
grid = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring="f1",
    n_jobs=-1
)


3Ô∏è‚É£ Fit GridSearch

In [66]:
grid.fit(X_train_raw, y_train)


4Ô∏è‚É£ Results

In [67]:
print("Best Parameters:", grid.best_params_)
print("Best Cross-Validation Score:", grid.best_score_)

best_model = grid.best_estimator_

y_pred_best = best_model.predict(X_test_raw)

from sklearn.metrics import accuracy_score
print("Final Test Accuracy:", accuracy_score(y_test, y_pred_best))


Best Parameters: {'classifier__C': 10}
Best Cross-Validation Score: 0.9884937336779476
Final Test Accuracy: 0.985


‚úÖ TASK 13: Build Hiring Prediction Function

1Ô∏è‚É£ Define Function

In [68]:
import pandas as pd

def predict_candidate(skills, experience, education, certifications, projects, salary):

    # Combine text (same logic used during training)
    combined_text = skills + " " + certifications

    # Encode education using existing encoder
    education_encoded = le.transform([education])[0]

    # Create dataframe in same format as training
    input_data = pd.DataFrame({
        "combined_text": [combined_text],
        "experience_years": [experience],
        "salary_expectation_": [salary],
        "projects_count": [projects],
        "education_encoded": [education_encoded]
    })

    # Predict using best_model (Pipeline)
    prediction = best_model.predict(input_data)[0]
    probability = best_model.predict_proba(input_data)[0][1]

    if prediction == 1:
        decision = "Hire"
    else:
        decision = "Reject"

    return decision, round(probability, 3)


2Ô∏è‚É£ Test It

In [71]:
def predict_candidate(skills, experience, education, certifications, projects, salary):

    combined_text = skills + " " + certifications

    # Handle unseen education safely
    if education not in le.classes_:
        education_encoded = 0  # default to lowest category
    else:
        education_encoded = le.transform([education])[0]

    input_data = pd.DataFrame({
        "combined_text": [combined_text],
        "experience_years": [experience],
        "salary_expectation_": [salary],
        "projects_count": [projects],
        "education_encoded": [education_encoded]
    })

    prediction = best_model.predict(input_data)[0]
    probability = best_model.predict_proba(input_data)[0][1]

    decision = "Hire" if prediction == 1 else "Reject"

    return decision, round(probability, 3)


In [72]:
predict_candidate(
    skills="Python Machine Learning SQL",
    experience=3,
    education="Masters",
    certifications="AWS Certification",
    projects=5,
    salary=70000
)


('Hire', np.float64(0.997))

In [73]:
predict_candidate(
    skills="HTML CSS",
    experience=0,
    education="Bachelors",
    certifications="none",
    projects=1,
    salary=120000
)


('Reject', np.float64(0.0))

‚úÖ Final Step: TASK 14 ‚Äì Conclusion

Final Conclusion

In this project, an AI-based Hiring Prediction System was developed using a structured resume screening dataset. The dataset contained textual, numerical, and categorical features representing candidate qualifications, experience, and job-related information.

During preprocessing, identifier columns were removed and the AI score column was excluded to prevent data leakage. Text features were cleaned and transformed using TF-IDF vectorization, while numerical features were scaled using StandardScaler. Education was encoded for model compatibility. Stratified train-test splitting was applied due to class imbalance.

Four models were trained and compared: Logistic Regression, Random Forest, SVM, and KNN. Logistic Regression achieved the best performance with 98.5% test accuracy and strong cross-validation results. Hyperparameter tuning using GridSearchCV confirmed C=10 as the optimal regularization parameter.

A complete prediction function was implemented using a Pipeline to simulate a real-world AI hiring system. The model can now accept candidate inputs and return both hiring decisions and probability scores.

This project demonstrates practical knowledge of:

1. Data preprocessing
2. Text feature engineering
3. Handling class imbalance
4. Preventing data leakage
5. Model comparison
6. Hyperparameter tuning
7. Pipeline construction
8. Deployment-ready prediction systems

The system reflects how AI can support HR automation by assisting recruiters in resume screening while maintaining structured and explainable decision-making.