# Data exploration

**Main objective**

In this notebook, the main objectives are: 
* Explore the distribution of parsed skills and job information.
* Analyze the distribution of the number of skills per candidate and identify outliers.
* Examine the distributions of hard and soft skills.

In [None]:
%load_ext autoreload 
%autoreload 2
import matplotlib.pyplot as plt
import polars as pl

from hiring_cv_bias.config import CLEANED_SKILLS, HARD_SOFT_SKIILS
from hiring_cv_bias.exploration.utils import plot_boxplot, plot_distribution_bar
from hiring_cv_bias.exploration.visualize import (
    plot_skills_frequency,
    plot_skills_per_category,
    plot_top_skills_for_job_title,
)
from hiring_cv_bias.utils import load_data

In [None]:
cv_skills = load_data(CLEANED_SKILLS)
cv_skills.head(10)

### Skill Extraction by Category

The bar chart displays the total number of skills extracted for each **skill_type**:

- **Professional_Skill**: ~68,000 occurrences — the most frequently identified category.  
- **Job_title**: ~23,000 occurrences.  
- **IT_Skill**: ~22,000 occurrences.  
- **Language_Skill**: ~13,000 occurrences.  
- **DRIVERSLIC**: ~2,500 occurrences — the least common category.  

In [None]:
skill_counts = (
    cv_skills.group_by("Skill_Type")
    .agg(pl.count("Skill_Type").alias("count"))
    .sort("count", descending=True)
)
skill_counts

In [None]:
plot_skills_frequency(cv_skills)

### Top N Skills by Category

The `plot_skills_per_category` function can be used to visualize the most frequent skills within any given skill category. It:

1. Filters the `cv_skills` DataFrame by the chosen `skill_type`.  
2. Computes the frequency of each individual skill in that category.  
3. Plots the top _n_ skills by their occurrence count.  

With the parameters below, we are displaying the **top 10** most common skills for the **Job_title** category.

In [None]:
skill_pd = plot_skills_per_category(cv_skills, "Job_title", top_n=10)

#### Top N Skills for a Given Job Title

This function shows you the most common `<type_skill>` for candidates who have a specific `<job_title>`. It simply:

1. Picks out all candidates with the chosen job title.  
2. Collects their skills of **the specified category**.  
3. Counts how often each skill appears.  
4. Plots the top _n_ skills by frequency.

With the parameters below, we are displaying the **top 10** most frequent **IT skills** among **“Commis Chef (m/f)”** candidates.

In [None]:
plot_top_skills_for_job_title(cv_skills, "Commis Chef (m/f)", "IT_Skill", top_n=20)

In [None]:
plot_top_skills_for_job_title(
    cv_skills, "Commis Chef (m/f)", "Professional_Skill", top_n=20
)

In [None]:
plot_top_skills_for_job_title(
    cv_skills, "Commis Chef (m/f)", "Language_Skill", top_n=10
)

In [None]:
plot_top_skills_for_job_title(cv_skills, "Commis Chef (m/f)", "DRIVERSLIC", top_n=10)

### Counting Skills per Candidate

In this step, we aim to:

- **Visualize the distribution** of the number of skills extracted per candidate.
- **Spot and investigate outliers**, candidates who list an unusually high number of skills.

In [None]:
skill_counts = cv_skills.group_by("CANDIDATE_ID").len()

fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(skill_counts["len"], bins=40, edgecolor="black")
ax.set_title("Distribution of Number of Skills per Candidate", fontsize=14, pad=10)
ax.set_xlabel("Number of Skills", fontsize=12)
ax.set_ylabel("Number of Candidates", fontsize=12)
ax.grid(axis="y", linestyle="--", linewidth=0.7, alpha=0.7)

In [None]:
plot_boxplot(
    data=skill_counts["len"],
    labels=None,
    title="Boxplot of Skills per Candidate",
    xlabel="Number of Skills",
    colors=["orange"],
    figsize=(10, 2),
)

### Hard vs Soft Skills Analysis


In this section, we investigate the distribution of **hard** and **soft** skills extracted from the candidate CVs. (column `Professional_Skill`)

> The logic used to label each skill as *hard* or *soft* is documented in the `hard_soft_labelling.ipynb` notebook.

Here we present:
- **Total distribution** of hard vs soft skills across all candidates.
- **Distribution per candidate**: number of hard vs soft skills per individual, to highlight representation gaps.

In [None]:
hard_soft_skills = pl.read_csv(HARD_SOFT_SKIILS)
hard_soft_skills

In [None]:
cv_skills_with_label = cv_skills.join(hard_soft_skills, on="Skill")
cv_skills_with_label

* **Hard skills** dominate the dataset, accounting for roughly ~80% of all skills extracted from the CVs.

* **Soft skills** are markedly under represented, at about one skill in six.

> A small remainder is classified as “Unknown” terms that did not match either taxonomy, highlighting the presence of noise in the parsed skills data. (see `hard_soft_labelling.ipynb`)

In [None]:
counts = cv_skills_with_label["label"].value_counts()
plot_distribution_bar(
    counts,
    "label",
    "count",
    "Label",
    "Frequency",
    "Absolute frequency for Hard/Soft skills",
)

As we can see from the analysis below:

* **Most applicants (about two thirds)** mention at least one hard **and** one soft skill, suggesting reasonably balanced self presentation.

* **One third list exclusively hard skills**; they highlight technical competence but leave behavioural strengths implicit.

* **Soft skill only CVs** are extremely rare; almost nobody relies on soft skills without pairing them with technical ones.

A negligible fraction provide no skills at all, indicating either very short resumes or parsing errors that merit inspection.

These proportions flag a potential representational gap: while hard skills dominate completely, soft skills appear primarily in combination with hard ones rather than standing alone.

In [None]:
per_cand = (
    cv_skills_with_label.group_by(["CANDIDATE_ID", "label"])
    .agg(pl.len())
    .pivot(
        index="CANDIDATE_ID",
        on="label",
        values="len",
    )
    .fill_null(0)
    .with_columns(
        [
            (pl.col("Hard") + pl.col("Soft") + pl.col("Unknown").fill_null(0)).alias(
                "total"
            ),
        ]
    )
)

In [None]:
cats = per_cand.with_columns(
    [
        pl.when((pl.col("Hard") == 0) & (pl.col("Soft") == 0))
        .then(pl.lit("No skills"))
        .when((pl.col("Hard") > 0) & (pl.col("Soft") == 0))
        .then(pl.lit("Only hard"))
        .when((pl.col("Hard") == 0) & (pl.col("Soft") > 0))
        .then(pl.lit("Only soft"))
        .otherwise(pl.lit("Both"))
        .alias("category")
    ]
)

counts = cats.group_by("category").len().sort("len", descending=True)

plot_distribution_bar(
    counts, "category", "len", "Label", "Frequency", "Frequency for Hard/Soft skills"
)

In [None]:
per_cand = per_cand.with_columns(
    (pl.col("Hard") / pl.col("total")).alias("hard_share"),
)
per_cand

The chart shows how technical competences (**hard skills**) are distributed across each candidate.

For each one of them we calculate the **hard skill share**, the ratio between the number of hard skills and the total number of skills, listed (hard + soft + any unknown items) and we plot all of these percentages in a histogram.

In [None]:
data = per_cand["hard_share"].to_numpy()

plt.figure(figsize=(8, 4))
plt.hist(data, bins=20, edgecolor="black")
plt.xlabel("Share of hard skills per CV")
plt.ylabel("Number of candidates")
plt.title("Distribution of Hard-Skill Share Across CVs")
plt.tight_layout()
plt.show()

The box plot highlights several outliers. (reporting 40 + hard skills or 15 + soft skills)

Such counts are well beyond the typical range and **may** indicate parsing errors (e.g bullet points misclassified as skills, the same skill split into multiple tokens ecc).
These outliers should be reviewed manually to confirm whether they reflect genuine, unusually rich profiles or artefacts produced by the parsing pipeline.

In [None]:
plot_boxplot(
    data=[per_cand["Hard"], per_cand["Soft"]],
    labels=["Hard", "Soft"],
    title="Hard vs Soft Skill Distribution",
    xlabel="Number of Skills",
    colors=["#1f77b4", "#ff7f0e"],
    figsize=(6, 4),
)