# Resume Outcome Predictor for Data Science - EDA

## 1. Inital Overview of the Dataset

- 1000 instances
- 11 different features
- About half object types and half integers
- Target variable will be Recruiter Decision

[See the Dataset](https://www.kaggle.com/datasets/mdtalhask/ai-powered-resume-screening-dataset-2025?resource=download)

In [None]:
import sys
sys.path.append('../')

import seaborn as sns
import matplotlib.pyplot as plt

from src.data_load import extract_zip, load_csv

extract_zip()
df = load_csv()

print("Head of dataset")
print(df.head(), "\n")
print("Dataset info")
print(df.info(), "\n")
print("Dataset shape")
print(df.shape, "\n")
print("Dataset columns")
print(df.columns, "\n")

## 2. Check for Missing Values
- Only missing values for certification feature
- This means that the job applicant does not have a certificate so they won't be removed


In [None]:
# printing total null of each feature
print("Amount of Null for each feature")
print(df.isnull().sum())

# printing percentage of null for each feature
print("\n", "Percentage of Null for each feature")
df.isnull().mean() * 100

## 3. Examine Categorical Features
- There is an even distribution of categories for almost all features

In [None]:
# printing out each categorical feature and their amounts
print(df['Education'].value_counts(), "\n")
print(df['Certifications'].value_counts(), "\n")
print(df['Job Role'].value_counts(), "\n")
print(df['Recruiter Decision'].value_counts(), "\n")
print(df['Skills'].str.split(', ').explode().value_counts())

# ploting the counts of each of the categorical features
for col in ['Education', 'Certifications', 'Job Role']:

    value_counts = df[col].value_counts()
    min_count = value_counts.min()
    max_count = value_counts.max()

    padding_range = max_count - min_count
    lower_limit = max(min_count - 0.3 * padding_range, 0)
    upper_limit = max_count + 0.3 * padding_range

    plt.figure()
    sns.countplot(x=col, hue=col, data=df, palette='Set1', legend=False)
    plt.tight_layout()
    plt.ylim(lower_limit, upper_limit)

# Graph for each of the individual skills
plt.figure()
ml_skills = df['Skills'].str.split(', ').explode()
sns.countplot(y=ml_skills, hue=ml_skills, palette='Set1', legend=False)


## 4. Examine Numeric Features
- all of these features are also very evenly distributed

In [None]:
# printing basic data for features
print(df[['Experience (Years)', 'Salary Expectation ($)', 'Projects Count']].describe())

# creating histogram plot for each feature
for col in ['Experience (Years)', 'Projects Count']:
    plt.figure()
    sns.histplot(x=col, hue=col, data=df, kde=True, palette='Set2', legend=False).set(title=f"{col} Distribution")

# boxplot for salary expectation
plt.figure()
sns.boxplot(df['Salary Expectation ($)'], orient='h').set(title="Salary Expectation Distribution")

## Examine Target Variable
- The target variable *Recruiter Decision* is very imbalanced
- This is odd as you would normally expect there to be more applicants rejected
- Plan to fix this issue using SMOTE before training a model

In [None]:
print(df['Recruiter Decision'].value_counts(normalize=True) * 100)

sns.countplot(x='Recruiter Decision', data=df).set(title="Target Variable Distribution")