# Hard Soft skills labelling

**Main Objective:**  

This notebook automates the labeling of extracted `Professional_Skill` entries as **Hard**, **Soft**, or **Unknown**. 

We will do this using a **4-bit quantized Llama-3.1 8B Instruct model** on a GPU P100. 

**Steps**:
1. It loads the list of unique skills
2. **Classifies** each skill in batches via the language model
3. **Evaluate** the model's accuracy using the technique **LLM as a judge** via Chat Gpt o3 model.
4. Writes the results to a CSV.

Additionally, we include the overall breakdown, the model’s Chain of Thought and each skill’s individual classification in the `CoT.log` file.

In [None]:
import numpy as np
import polars as pl
import torch
from huggingface_hub import login
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

from hiring_cv_bias.config import CLEANED_SKILLS, HARD_SOFT_SKILLS
from hiring_cv_bias.hard_soft_skills_labelling.utils import (
    batch_classify_skills,
    clean_results,
)
from hiring_cv_bias.utils import load_data

SEED = 42
login(token="[YOUR_TOKEN]")

In [None]:
cv_skills = load_data(CLEANED_SKILLS)
skills = (
    cv_skills.filter(pl.col("Skill_Type") == "Professional_Skill")["Skill"]
    .unique()
    .to_list()
)

Here we loads the model (using nf4 and float16) and its tokenizer.

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
)

model_name = "meta-llama/Llama-3.1-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    quantization_config=bnb_config,
)

The function `batch_classify_skills` in `utils.py` does the work. Let's breakdown it:

**Inputs:**  
- `model`: A HuggingFace causal LM model instance (e.g. quantized Llama-3.1-8B Instruct).  
- `tokenizer`: Corresponding tokenizer for the model.  
- `skills`: List of skill strings to classify.  
- `batch_size`: Number of skills to send to the model at once.

**Process:**  
1. Iterate over the `skills` list in chunks of size `batch_size`.  
2. For each batch, construct a prompting template that:  
   - Instructs the model to think step by step about the skill.  
   - Provides four concrete examples (Data Analysis -> Hard; Communication -> Soft).  

3. Tokenize all prompts simultaneously with padding/truncation and move tensors to the model’s device.  
4. Call `model.generate(...)` to produce completions (up to 150 new tokens) for each prompt.  
5. Decode each generated output, extract the final token as the predicted label (`Hard`, `Soft`, or `Unknown`) and append to `labels`.  
6. Return the full list of labels in the same order as the input skills.

In [None]:
hard_soft_labels = batch_classify_skills(skills, batch_size=64)

In [None]:
(
    hard_soft_labels.count("Hard"),
    hard_soft_labels.count("Soft"),
    hard_soft_labels.count("Unknown"),
)

In [None]:
output_df = pl.DataFrame({"Skill": skills, "label": hard_soft_labels})
output_df.write_csv("hard_soft_skills.csv")

### Estimating Accuracy

We estimated the model's accuracy using **a sample of 100 skills**: 50 predicted as hard skills and 50 predicted as soft skills.

We then compared the model's predictions with evaluations provided by ChatGPT o3 model.

From this comparison, we derived separate accuracy estimates for hard and soft skills, as well as an overall accuracy score.

Accuracy:
- **Hard**     --> 49/50
- **Soft**     --> 23/50
- **Overall**  --> 72/100

In [None]:
hard_soft_df = load_data(HARD_SOFT_SKILLS)
print(
    hard_soft_df["label"]
    .value_counts()
    .filter(pl.col("label").is_in(["Hard", "Soft", "Unknown"]))
    .sort(pl.col("count"), descending=True)
)

Now any label not equal to `Hard`, `Soft` or `Unknown` is replaced with `Unknown`. This step helps correct misclassifications arising from the model’s reasoning (e.g. truncated responses or unexpected formats) since we use the last token of its output as the predicted label.

In [None]:
hard_soft_df = clean_results(hard_soft_df)
print(
    hard_soft_df["label"]
    .value_counts()
    .filter(pl.col("label").is_in(["Hard", "Soft", "Unknown"]))
    .sort(pl.col("count"), descending=True)
)

Accuracy:
- **Hard**     --> 46/50
- **Soft**     --> 16/50
- **Overall**  --> /100

In [None]:
hard_skills = (
    hard_soft_df.filter(pl.col("label") == "Hard")
    .sample(50, shuffle=True, seed=SEED)
    .to_numpy()
)
soft_skills = (
    hard_soft_df.filter(pl.col("label") == "Soft")
    .sample(50, shuffle=True, seed=SEED)
    .to_numpy()
)

skills_sample = np.concatenate((hard_skills, soft_skills))
np.random.shuffle(skills_sample)
skills_sample