# Uncover distributional imbalances 

**Main Objective:**  
This notebook aims to uncover distributional imbalances by combining **demographic data** (age, gender, location) from the **Candidates** sheet of the reverse matching dataset and **Parsed skill data** extracted from raw CVs.

## Steps:

- **Analyze distributional skews**  
    - **Gender** distribution 
    - **Location** distribution  
    - **Age** distribution  
    - **Hard vs. soft skills** distribution  

- **Visualize imbalances**  
  - Use bar charts to highlight **over** or **under** representation.

- **Surface parser induced bias**  
  - Identify patterns where the parser **may** systematically favor or overlook certain groups or skill types.  


**Why This Matters** 

> Detecting these imbalances is **critical** to designing a robust, fair pipeline that flags and mitigates biases introduced by the CV parser relying only on raw CV inputs and their parsed outputs.  

## Merge & prepare data 
  - Bring together candidate metadata and their extracted skills.  

In [None]:
%load_ext autoreload 
%autoreload 2
import polars as pl

from hiring_cv_bias.config import (
    CLEANED_REVERSE_MATCHING_PATH,
    CLEANED_SKILLS,
    HARD_SOFT_SKILLS,
)
from hiring_cv_bias.exploration.gender_analysis import (
    add_zippia_columns,
    compute_bias_strenght,
    get_category_distribution,
    get_skill_target_share,
    plot_bias_skills_bar,
)
from hiring_cv_bias.exploration.utils import (
    plot_distribution_bar,
    split_df_per_attribute,
)
from hiring_cv_bias.exploration.visualize import (
    compute_and_plot_disparity,
    plot_histogram,
    plot_target_distribution,
)
from hiring_cv_bias.utils import load_data

In [None]:
df_skills = load_data(CLEANED_SKILLS)
df_info_candidates = load_data(CLEANED_REVERSE_MATCHING_PATH)
df_skills.head()

In [None]:
df_info_candidates.sample(5)

## Gender Analysis

In this section, we analyze the distribution of extracted skills across candidates by incorporating **gender** information.

We focus on:

* Exploring the distribution of skills by gender.
* Identifying job roles where the skill sets parsed from CVs exhibit significant gender based disparities.
* Uncovering potential biases in how skills are emphasized for different genders.

In [None]:
df_skills_with_gender = df_skills.join(
    df_info_candidates.select(["CANDIDATE_ID", "Gender"]), on="CANDIDATE_ID"
)
df_skills_with_gender.head()

In [None]:
gender_counts_df = get_category_distribution(df_info_candidates, "Gender")
gender_counts_df

We begin by examining the overall gender composition of the candidate pool, which shows 53.0% male and 45.5% female, therefore rather balanced and a small fraction identifying as “Other” or “Unknown."

In [None]:
plot_distribution_bar(
    gender_counts_df,
    x_col="Gender",
    y_col="count",
    x_label="Gender",
    y_label="Number of Candidates",
    title="Candidate Distribution by Gender",
)

Compute for each skill type the counts and percentages of male vs. female candidates and their differences.

**Steps:**  
1. Count males and females per skill type. These counts are **normalized** with respect to the prior distribution. 
2. Calculate total count, percent female/male (rounded to 1 decimal), absolute and percentage differences.  
3. Sort by descending total count.

In [None]:
df_category_with_gender = get_skill_target_share(
    df_skills_with_gender,
    gender_counts_df,
    target_col="Gender",
    target_values=["Male", "Female"],
)
df_category_with_gender

We identify **skill categories** with the gender imbalance, highlighting those that are disproportionately associated with either male or female candidates. 

This analysis has as aim to uncover general categories (e.g. IT skills, professional skills, job_title) that are heavily skewed toward one gender.

In [None]:
gender_percs_dict = {"Male": "perc_male", "Female": "perc_female"}
gender_colors = {"Male": "skyblue", "Female": "lightcoral"}

plot_bias_skills_bar(
    df_category_with_gender,
    "Skill_Type",
    gender_percs_dict,
    "perc_diff",
    "Skill Categories gender imbalance",
    colors=gender_colors,
)

Now we analyze gender representation across parsed skills by computing both **absolute counts** (normalized by prior distribution) and **relative percentages** for male and female candidates. The objective is to identify skills that show a significant **gender imbalance**.

We group the data by each unique combination of `Skill` and `Skill_Type` and compute the following:

- `count_male`: number of male candidates who have the skill 
- `count_female`: number of female candidates who have the skill  
- `count_total = count_male + count_female`  
- `perc_male = (count_male / count_total) × 100`  
- `perc_female = (count_female / count_total) × 100`  
- `perc_diff = perc_male - perc_female`  
- `count_diff = count_male - count_female`  

To quantify the **strength of gender bias** for each skill, we define the following metric:



$$
\text{bias\_strength} = \left| \frac{\text{count\_diff}}{\text{count\_total}} \cdot \log(1 + \text{count\_total}) \right|
$$


This formula combines:
- the **normalized difference** between male and female counts (relative to the total),
- a **logarithmic weighting** that increases confidence in imbalances occurring in larger samples.

The result is a **scale invariant** score that emphasizes statistically meaningful disparities.

A higher `bias_strength` indicates a stronger imbalance between male and female representations for that particular skill.

In [None]:
df_gender_bias = compute_bias_strenght(df_skills_with_gender, gender_counts_df)
df_gender_bias

In [None]:
df_gender_bias.sort(pl.col("bias_strength"), descending=True).head(20)

> Note: From these results, we can see that certain skills known to be heavily "gender skewed" in society have been identified. <br>
Although a high perc_diff highlights strong imbalances, it alone would also surface **rare** skills with extreme ratios (for example, 1 occurrence versus 0). By adding a logarithmic term based on counts, we ensure that only skills with both a large percentage difference and a sufficiently high frequency rise to the top. <br>
This `bias_strenght` metric therefore uncovers the most widespread, gender-biased skills in our dataset.

In [None]:
plot_bias_skills_bar(
    df_gender_bias,
    "Skill",
    gender_percs_dict,
    "bias_strength",
    "Top Skills with Highest Gender Imbalance",
    top_n=20,
    colors=gender_colors,
)

Now we're going to examine gender bias in the relationship between **Job Titles** to understand how work experiences differ for male and female candidates.

* Determine whether observed disparities could reflect **parser errors** or **real world biases** already present in our CV dataset.

* Append two new columns `perc_female_zippia` and `perc_male_zippia` by scraping Zippia (USA) for the percentage of men and women in each role.

By this we’ll see whether the same male/female proportions that we observe in our parsed skills and roles align with the real world distribution of those occupations. 

Example: If our parsed CVs dataset shows that 10% of “Software Engineer” CVs are female, but Zippia reports 30% of software engineer are female, this gap **may** indicate a parser bias. Conversely, if both sources match closely, it could suggests that any skew is likely a reflection of broader societal patterns rather than a flaw in our extraction process.

In [None]:
df = df_gender_bias.sort(pl.col("bias_strength"), descending=True).head(30)
job_df = df.filter(pl.col("Skill_Type") == "Job_title")
job_df

In [None]:
job_df = add_zippia_columns(job_df)
job_df

## Geographical Analysis

In this section, we analyze the distribution of extracted skills across candidates by incorporating **geographical position** information.

We focus on:

* Exploring the distribution of skills by geographical position.
* Identifying job roles where the skill sets parsed from CVs exhibit significant position based disparities.
* Uncovering potential biases in how skills are emphasized for different locations.

In [None]:
df_skill_candidates = df_info_candidates.join(
    df_skills,
    on="CANDIDATE_ID",
).select("CANDIDATE_ID", "LATITUDE", "Skill", "Skill_Type")
display(df_skill_candidates)

This distribution reveals a pronounced geographic imbalance: with nearly three‐quarters of candidates **(~71%)** coming from the **North** and very few from the Center (~14%) or South (~15%); the pool is heavily skewed toward Northern regions.

This pronounced skew must be taken into account in all subsequent analyses.

In [None]:
df_skill_candidates_localized = df_skill_candidates.with_columns(
    pl.when(pl.col("LATITUDE") > 44.5)
    .then(pl.lit("NORTH"))
    .when(pl.col("LATITUDE") < 42)
    .then(pl.lit("SOUTH"))
    .otherwise(pl.lit("CENTER"))
    .alias("Location")
)
df_location_per_candidate = df_skill_candidates_localized.select(
    "CANDIDATE_ID", "Location"
).unique()
plot_histogram(
    df_location_per_candidate["Location"],
    title="Candidates Geographical Distribution",
    normalize=True,
)

The charts below display the **percentage distribution** of each `Skill_Type` within three geographic regions (North, Center, South). Because each histogram is normalized, differences in absolute CV counts (e.g., 71% of candidates coming from the North) do not affect the shape of the distribution **within** each region. In other words, the y-axis values represent the relative share of each skill type among CVs from that specific area, regardless of the total volume of CVs.

Each bar represents the percentage of occurrences of a given `Skill_Type` among the CVs collected in that area.

> Note: The only noticeable difference is that in the **South**, the `Job_title` category is slightly more prevalent than `IT_Skill`. In the North and Center, these two categories remain roughly similar. All other proportions (e.g., the dominance of `Professional_Skill` and the marginal share of `DRIVERSLIC`) are nearly identical across regions.  

In [None]:
skills_per_location = split_df_per_attribute(df_skill_candidates_localized, "Location")
plot_target_distribution(skills_per_location, "Geographical Skill Type Distribution")

When comparing how specific skills are distributed across multiple geographic regions, it is crucial to identify which skills exhibit the most pronounced imbalance. 
The technique employed here involves:

1. **Cutting out low frequencies skills**: first applying *log trasformation* on the total counts distribution, then computing *z-score* for each count and lastly filtering out those that are below a certain threshold.  

2. **Gathering frequency counts (already scaled considering the groups prior distribution)** of each skill within each group.  (considering only skills as described in 1.) 

3. **Quantifying inequality** for each skill across these groups using a statistical measure.  

4. **Selecting the top skills with the maximum inequality** and visualizing its breakdown to facilitate interpretation.

**The Gini Index as an Inequality Metric**

The chosen disparity metric is the **Gini index**, a widely used measure of statistical dispersion. For a given skill, let $n$ be the number of groups and let $x_i$ denote the frequency of that skill in group $i$. We first sort these values in non‐decreasing order and denote them by $x_{(1)}$, $x_{(2)}$, $...$, $x_{(n)}$. (in our case $n=3$) The Gini index \(G\) is then computed as:

$$
G \;=\; \frac{\displaystyle\sum_{1 \,\le i < j \,\le n} \bigl|x_i - x_j\bigr|}{\,n \,\sum_{i=1}^{n} x_i\,}\,.
$$

**How this works:**  
1. **Intuition**:  
   - It measures the **average absolute difference** between every pair of group values, scaled by the total.  
   - If all $x_i$ are identical, each $\lvert x_i - x_j\rvert = 0$, so $G=0$ (perfect equality).  
   - If one group has **all** of the mass and the others have zero, then the numerator is maximized, driving $G$ toward 1 (maximal inequality).  

2. **Normalization**:  
   - Dividing by $n \sum_{i=1}^{n} x_i$ ensures $G$ ranges between 0 and (just under) 1 regardless of absolute scale or number of groups.  
   - In practice, $G$ approaches 1 when one group’s share dominates and the rest contribute negligibly.

3. **Interpretation**:  
   - A **low Gini** (near 0) indicates the attribute is nearly equally represented across all groups.  
   - A **high Gini** signals that the attribute is concentrated in one or a few groups, revealing a strong disparity.

By sorting each skill’s group counts and computing this Gini formula, we rank skills by how unequal their distributions are. The top‐disparity skill is the one whose frequency differs most sharply between groups.  


> **Note:** The Gini index’s maximum value is $(n-1)/n$. For $n=3$, this gives a range from 0 up to $2/3$ (approximately 0.667).

In [None]:
prof_skills_per_location = {
    attr: df.filter(pl.col("Skill_Type") == "Professional_Skill")["Skill"]
    for attr, df in skills_per_location.items()
}


location_colors = {"NORTH": "#2d8659", "CENTER": "#dddddd", "SOUTH": "#b03a2e"}

location_weights = {
    key: 1 / len(df["CANDIDATE_ID"].unique()) for key, df in skills_per_location.items()
}

compute_and_plot_disparity(
    prof_skills_per_location,
    colors=location_colors,
    attribute_name="Professional_Skills",
    weights_dict=location_weights,
)

In [None]:
it_skills_per_location = {
    attr: df.filter(pl.col("Skill_Type") == "IT_Skill")["Skill"]
    for attr, df in skills_per_location.items()
}


compute_and_plot_disparity(
    it_skills_per_location,
    colors=location_colors,
    attribute_name="IT_Skills",
    weights_dict=location_weights,
)

In [None]:
job_title_per_location = {
    attr: df.filter(pl.col("Skill_Type") == "Job_title")["Skill"]
    for attr, df in skills_per_location.items()
}


compute_and_plot_disparity(
    job_title_per_location,
    colors=location_colors,
    attribute_name="Job_titles",
    weights_dict=location_weights,
)

In [None]:
lang_skills_per_location = {
    attr: df.filter(pl.col("Skill_Type") == "Language_Skill")["Skill"]
    for attr, df in skills_per_location.items()
}


compute_and_plot_disparity(
    lang_skills_per_location,
    min_threshold=0.5,
    colors=location_colors,
    attribute_name="Language_Skill",
    weights_dict=location_weights,
)

In [None]:
driverslic_per_location = {
    attr: df.filter(pl.col("Skill_Type") == "DRIVERSLIC")["Skill"]
    for attr, df in skills_per_location.items()
}


compute_and_plot_disparity(
    driverslic_per_location,
    min_threshold=0.0,
    colors=location_colors,
    attribute_name="DRIVERSLIC",
    weights_dict=location_weights,
)

## Age Analysis

In this section, we analyze the distribution of extracted skills across candidates by incorporating **age** information.

We focus on:

* Exploring the distribution of skills by age.
* Identifying job roles where the skill sets parsed from CVs exhibit significant age disparities.
* Uncovering potential biases in how skills are emphasized for different ages.

In [None]:
df_age_candidates = df_info_candidates.join(df_skills, on="CANDIDATE_ID").select(
    "CANDIDATE_ID", "Age_bucket", "Skill", "Skill_Type"
)
df_age_candidates

In [None]:
df_age_per_candidate = df_age_candidates.select("CANDIDATE_ID", "Age_bucket").unique(
    maintain_order=True
)
plot_histogram(
    df_age_per_candidate["Age_bucket"],
    normalize=True,
    title="Candidates Age Distribution",
)

In [None]:
df_age_candidates = df_age_candidates.filter(pl.col("Age_bucket") != "Unknown")
skills_per_age = dict(
    sorted(split_df_per_attribute(df_age_candidates, "Age_bucket").items())
)
plot_target_distribution(skills_per_age, "Age Skill Type Distribution")

In [None]:
prof_skills_per_age = {
    attr: df.filter(pl.col("Skill_Type") == "Professional_Skill")["Skill"]
    for attr, df in skills_per_age.items()
}

age_weights = {
    key: 1 / len(df["CANDIDATE_ID"].unique()) for key, df in skills_per_age.items()
}

age_colors = {"25-34": "#99bdd4", "45-54": "#5499c7", "55-74": "#1e5579"}

compute_and_plot_disparity(
    prof_skills_per_age,
    attribute_name="Professional_Skills",
    weights_dict=age_weights,
    colors=age_colors,
)

In [None]:
it_skills_per_age = {
    attr: df.filter(pl.col("Skill_Type") == "IT_Skill")["Skill"]
    for attr, df in skills_per_age.items()
}


compute_and_plot_disparity(
    it_skills_per_age,
    attribute_name="IT_Skills",
    weights_dict=age_weights,
    colors=age_colors,
)

In [None]:
job_titles_per_age = {
    attr: df.filter(pl.col("Skill_Type") == "Job_title")["Skill"]
    for attr, df in skills_per_age.items()
}


compute_and_plot_disparity(
    job_titles_per_age,
    attribute_name="Job_titles",
    weights_dict=age_weights,
    colors=age_colors,
)

In [None]:
lang_skills_per_age = {
    attr: df.filter(pl.col("Skill_Type") == "Language_Skill")["Skill"]
    for attr, df in skills_per_age.items()
}


compute_and_plot_disparity(
    lang_skills_per_age,
    attribute_name="Language_Skills",
    weights_dict=age_weights,
    colors=age_colors,
)

In [None]:
driverslic_per_age = {
    attr: df.filter(pl.col("Skill_Type") == "DRIVERSLIC")["Skill"]
    for attr, df in skills_per_age.items()
}


compute_and_plot_disparity(
    driverslic_per_age,
    min_threshold=0.0,
    attribute_name="DRIVERSLIC",
    weights_dict=age_weights,
    colors=age_colors,
)

## Hard-Soft Skills Analysis

In this section, we analyze the distribution of extracted skills across candidates by incorporating **the hard/soft skills label**. (column `Professional_Skill`)

We will see the relations between this label and the three areas already explored, with the aim to investigate better possible biases. 

In [None]:
hard_soft_skills = load_data(HARD_SOFT_SKILLS)
df_skills_with_label = df_skills.join(hard_soft_skills, on="Skill")
df_skills_with_gender = df_skills_with_label.join(
    df_info_candidates.select(["CANDIDATE_ID", "Gender"]), on="CANDIDATE_ID"
)
df_skills_with_gender

Let's now see the gender distribution on this section of candidates. 

In [None]:
gender_counts_df = get_category_distribution(df_info_candidates, "Gender")
gender_counts_df

In [None]:
df_gender_bias = get_skill_target_share(
    df_skills_with_gender,
    gender_counts_df,
    target_col="Gender",
    target_values=["Male", "Female"],
    skill_col=["label"],
)
df_gender_bias

As we can see from the chart below, soft skills are prevalent for female candidates. (counts are normalized, as before, considering the prior distribution)

In [None]:
plot_bias_skills_bar(
    df_gender_bias,
    "label",
    gender_percs_dict,
    "perc_diff",
    "Top Skills with Highest Gender Imbalance",
    colors=gender_colors,
    figsize=(10, 6),
)

### Hard/Soft Skills: Geographical Analysis

In [None]:
df_skills_with_location = df_skills_with_label.join(
    df_skill_candidates_localized.select("CANDIDATE_ID", "Location", "Skill"),
    on=["CANDIDATE_ID", "Skill"],
    coalesce=True,
)
df_skills_with_location

In [None]:
location_counts_df = get_category_distribution(
    df_skill_candidates_localized.unique("CANDIDATE_ID"), "Location"
)
location_counts_df

In [None]:
df_location_bias = get_skill_target_share(
    df_skills_with_location,
    location_counts_df,
    target_col="Location",
    target_values=["NORTH", "CENTER", "SOUTH"],
    skill_col=["label"],
)
df_location_bias

As shown in the chart below, the Northern bars are consistently the tallest, indicating that candidates from the **North** have, on average, **more skills**.

In [None]:
location_percs_dict = {
    "NORTH": "perc_north",
    "CENTER": "perc_center",
    "SOUTH": "perc_south",
}

plot_bias_skills_bar(
    df_location_bias,
    "label",
    location_percs_dict,
    "perc_diff",
    "Top Skills with Highest Location Imbalance",
    colors=location_colors,
    figsize=(10, 6),
)

### Hard/Soft Skills: Age Analysis

In [None]:
df_skills_with_age = df_skills_with_label.join(
    df_info_candidates.select("CANDIDATE_ID", "Age_bucket"),
    on=["CANDIDATE_ID"],
    coalesce=True,
)
df_skills_with_age

In [None]:
age_counts_df = get_category_distribution(df_info_candidates, "Age_bucket")
age_counts_df

In [None]:
df_age_bias = get_skill_target_share(
    df_skills_with_age,
    age_counts_df,
    target_col="Age_bucket",
    target_values=["25-34", "45-54", "55-74"],
    skill_col=["label"],
)
df_age_bias

In [None]:
age_percs_dict = {
    "25-34": "perc_25-34",
    "45-54": "perc_45-54",
    "55-74": "perc_55-74",
}

plot_bias_skills_bar(
    df_age_bias,
    "label",
    age_percs_dict,
    "perc_diff",
    "Top Skills with Highest Age Imbalance",
    colors=age_colors,
    figsize=(10, 6),
)