# Generative Psuedo Labeling with google/gemma-2-9b-it & Classification of "Work Behaviors and Skills"
* Notebook by Adam Lang
* Date: 2/12/2025

# Overview
* The goal of this notebook is to extract "Work Behaviors" and "Skills" from a dataset. 


# Workflow
* We will implement chain-of-thought prompting with Gemma2-9b-It LLM and extract "behaviors" and "skills" from a pre-processed CSV file that includes an "english_message" column which is the translated non-english messages to english.
* Originally I tried to use Claude-3.5-Sonnet via AWS Bedrock but the API calls were too frequent due to my limited access to Bedrock. So we will try and use Gemma2-9b-It from hugging face open source.
* I will then use the LLM to perform Generative Pseudo Labeling where it will generate a label for each behavior and skill so we can build a pseudo classification on the data. 

# Install Dependencies

In [1]:
%%capture 
!pip install transformers torch accelerate bitsandbytes tqdm

# Code Needed for SageMaker

In [2]:
%%capture 
!pip install einops

In [3]:
%%capture 
!pip install --upgrade pandas fsspec # sagemaker dependency

In [4]:
%%capture 
!pip install seaborn
!pip install s3fs #sagemaker dependency

In [5]:
%%capture  
## upgrade accelerate to use device_map 
!pip install --upgrade accelerate ## this is for compatability with `bitsandbytes` 

In [6]:
## check accelerate version after upgrade
import accelerate
print(f"Accelerate version: {accelerate.__version__}") 

Accelerate version: 1.3.0


In [7]:
%%capture 
## upgrade torchvision
!pip install --upgrade torchvision # if you need to upgrade torchvision run this line
!pip install --upgrade torch #upgrade torch version


**Note: Restart kernel before running next cell**

In [2]:
# check versions of torch available
import torch
import torchvision 

# print versions
print(f"PyTorch version: {torch.__version__}") 
print(f"Torchvision version: {torchvision.__version__}")

PyTorch version: 2.6.0+cu124
Torchvision version: 0.21.0+cu124


## Check if GPU is Available

In [3]:
# check if GPU is available 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 
print(f"Using device: {device}")

# set device for PyTorch operations
if device.type == "cuda":
    torch.cuda.set_device(0) # you can use a different device ID if you have multiple GPUs running

Using device: cuda


# Import Dependencies

In [4]:
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from accelerate import Accelerator
from tqdm.auto import tqdm

# Load Data from S3 Bucket on AWS -- if using AWS

In [None]:
# import boto3
# import pandas as pd
# from sagemaker import get_execution_role

# # Create S3 client
# conn = boto3.client('s3')

# # S3 bucket name
# bucket = '<your bucket here>'

# # Correct data_key (remove the leading slash)
# data_key = '<file source>/inputs/df_triplets_experiment.csv'

# # Construct the full S3 URI
# data_location = f's3://{bucket}/{data_key}'

# # Load the DataFrame -->  variable is `df_qlik`
# try:
#     df_qlik = pd.read_csv(data_location)
#     print(df_qlik.head())
# except Exception as e:
#     print(f"An error occurred: {e}")
    
#     # List objects in the bucket to check if the file exists
#     response = conn.list_objects_v2(Bucket=bucket, Prefix='<file source>/inputs/')
    
#     if 'Contents' in response:
#         print("Files in the specified S3 location:")
#         for obj in response['Contents']:
#             print(obj['Key'])
#     else:
#         print("No files found in the specified S3 location.")



# Load Data from local files if not using AWS

In [92]:
## LOAD DATA
df = pd.read_csv('df_triplets_experiment.csv')

In [93]:
## check df head
df.head()

Unnamed: 0,award_date,award_type,english_message,department
0,2025-01-01,Applaud 2,Kam Wei has been instrumental in helping to cl...,6600 Solution Consultant
1,2025-01-01,Applaud 2,Sid is a rare Player-Coach that inspires his c...,6630 1st Line Mgr - SC
2,2025-01-01,Applaud 2,"Myung Soo rejoined Qlik in August 2024, but qu...",6600 Solution Consultant
3,2025-01-01,Applaud 2,"Congrats, Sean!",6600 Solution Consultant
4,2025-01-01,Applaud 2,Thank you Jason for being persistent and consi...,6120 QC Sales Enterprise


# Filter data for sample testing
* We will do this 2 ways:

1. Random sample or just filter based on number of rows.
2. Filter based on timeframe of months to a year.

## 2. Change `award_date` to datetime object

In [39]:
## dtypes checks
df.dtypes

award_date         object
award_type         object
english_message    object
department         object
dtype: object

In [40]:
from datetime import datetime, timedelta

# convert `date` to datetime format for filtering
df['date'] = pd.to_datetime(df['date'])


In [41]:
## check dtypes again
df.dtypes

award_date         datetime64[ns]
award_type                 object
english_message            object
department                 object
dtype: object

In [42]:
## view sample of award_date
df['date'].sample(1)

15908   2024-02-06
Name: award_date, dtype: datetime64[ns]

## 3. Filter based on specific date range

In [43]:
df['date'].min()

Timestamp('2018-01-02 00:00:00')

In [44]:
df['date'].max()

Timestamp('2025-01-01 00:00:00')

In [58]:
from datetime import datetime
import pandas as pd

def filter_date_range(df, start_date=None, end_date=None, n_months=None):
    """
    Filter the DataFrame based on a date range.
    
    Parameters:
    df (pandas.DataFrame): The DataFrame to filter
    start_date (str or datetime): The start date of the range (inclusive)
    end_date (str or datetime): The end date of the range (inclusive)
    n_months (int): Number of months to look back from end_date (if start_date is not provided)
    
    Returns:
    pandas.DataFrame: Filtered DataFrame
    """
    # Create a copy of the DataFrame to avoid SettingWithCopyWarning
    df = df.copy()
    
    # If end_date is not provided, use the current date
    if end_date is None:
        end_date = datetime.now()
    elif isinstance(end_date, str):
        end_date = pd.to_datetime(end_date)
    
    # If start_date is not provided, calculate it based on n_months
    if start_date is None:
        if n_months is None:
            raise ValueError("Either start_date or n_months must be provided")
        start_date = end_date - pd.DateOffset(months=n_months)
    elif isinstance(start_date, str):
        start_date = pd.to_datetime(start_date)
    
    # Ensure the date column is in datetime format
    df.loc[:, 'date'] = pd.to_datetime(df['date'])
    
    # Filter the DataFrame
    filtered_df = df[(df['date'] >= start_date) & (df['date'] <= end_date)]
    
    return filtered_df


In [59]:
# Example usage:
# Filter data for the last 6 months from a specific end date
df_last_6_months = filter_date_range(df, n_months=6, end_date='2018-06-12')

# Filter data between two specific dates
df_specific_range = filter_date_range(df, start_date='2018-01-01', end_date='2018-06-12')

# Filter data for the last 6 months from today
df_recent_6_months = filter_date_range(df, n_months=6)

print(f"Number of records in last 6 months from 2018-06-12: {len(df_last_6_months)}")
print(f"Number of records between 2018-01-01 and 2018-06-12: {len(df_specific_range)}")
print(f"Number of records in last 6 months from today: {len(df_recent_6_months)}")

Number of records in last 6 months from 2018-06-12: 2260
Number of records between 2018-01-01 and 2018-06-12: 2260
Number of records in last 6 months from today: 8086


In [87]:
# Specifying both start and end dates
df_2_weeks = filter_date_range(df, start_date='2024-10-15', end_date='2024-10-20')


In [88]:
print(f"Number of records in 2 weeks : {len(df_2_weeks)}")

Number of records in 2 weeks : 356


In [None]:
# # Using n_months and end_date (similar to your original function)
# df_range1 = filter_date_range(df, n_months=6, end_date='2018-06-12')


In [None]:
# # Using n_months from today
# df_range3 = filter_date_range(df, n_months=6)


In [None]:
# # Using start_date and n_months
# df_range4 = filter_date_range(df, start_date='2018-01-01', n_months=6)

# Hugging Face Login

In [14]:
## hf hub login
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Skills & Behavior Processing Script using Gemma-2-9b-It LLM

In [62]:
class SkillBehaviorProcessor:
    def __init__(self, model_name="google/gemma-2-9b-it"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        
        # Quantization configuration
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16
        )
        
        # Load model
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            quantization_config=bnb_config,
            device_map="auto",
            torch_dtype=torch.bfloat16
        )

        # Initialize Accelerator
        self.accelerator = Accelerator()
        self.model = self.accelerator.prepare(self.model)

    def generate_text(self, prompt, max_new_tokens=256):
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
        with torch.no_grad():
            outputs = self.model.generate(**inputs, max_new_tokens=max_new_tokens)
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)

    def generate_pseudo_label(self, text):
        prompt = f"""<start_of_turn>user
Given the following text, identify and extract key phrases related to work behaviors, hard skills, and soft skills demonstrated. Use a step-by-step approach:

Text: "{text}"

Step 1: Identify potential work behaviors, hard skills, and soft skills mentioned in the text.
Step 2: For each identified item, determine if it's a work behavior, hard skill, or soft skill.
Step 3: Formulate concise phrases for each work behavior, hard skill, and soft skill.
Step 4: List the work behaviors, hard skills, and soft skills separately.

Output the results in the following format:

Work Behaviors:
[List work behaviors here, one per line. If none found, write "No behaviors found."]

Hard Skills:
[List hard skills here, one per line. If none found, write "No hard skills found."]

Soft Skills:
[List soft skills here, one per line. If none found, write "No soft skills found."]

If no behaviors or skills are found at all, output "None found." for each category.

Here are two examples:

Example 1:
Text: "John consistently meets deadlines and delivers high-quality work. He's proficient in Python and SQL, and always communicates clearly with team members."

Work Behaviors:
Meets deadlines consistently
Delivers high-quality work

Hard Skills:
Python proficiency
SQL proficiency

Soft Skills:
Clear communication

Example 2:
Text: "Thank you for your hard work this quarter."

Work Behaviors:
No behaviors found.

Hard Skills:
No hard skills found.

Soft Skills:
No soft skills found.

Now, please analyze the provided text and output the results in the same format.
<end_of_turn>
<start_of_turn>model
"""

        return self.generate_text(prompt)
    

    def classify_behaviors_and_skills(self, behaviors, hard_skills, soft_skills):
        prompt = f"""<start_of_turn>user
Given the following work behaviors, hard skills, and soft skills, classify them into general categories:

Work Behaviors: {behaviors}
Hard Skills: {hard_skills}
Soft Skills: {soft_skills}

Provide a general classification for the behaviors, hard skills, and soft skills. Output the results in the following format:
Behavior Class: [General category for behaviors, or "None" if no behaviors]
Hard Skill Class: [General category for hard skills, or "None" if no hard skills]
Soft Skill Class: [General category for soft skills, or "None" if no soft skills]

If there are neither skills nor behaviors found, output "None" for all categories.

Here are two examples:

Example 1:
Work Behaviors: Meets deadlines consistently, Delivers high-quality work
Hard Skills: Python proficiency, SQL proficiency
Soft Skills: Clear communication

Output:
Behavior Class: Time Management and Quality Assurance
Hard Skill Class: Programming and Database Management
Soft Skill Class: Communication

Example 2:
Work Behaviors: No behaviors found.
Hard Skills: No hard skills found.
Soft Skills: No soft skills found.

Output:
Behavior Class: None
Hard Skill Class: None
Soft Skill Class: None

Now, please classify the provided behaviors and skills in the same format.
<end_of_turn>
<start_of_turn>model
"""

        return self.generate_text(prompt)
    
    

    def process_batch(self, batch):
        with tqdm(total=3, desc="Batch processing steps", leave=False) as pbar:
            pseudo_labels = [self.generate_pseudo_label(text) for text in batch['english_message']]
            pbar.update(1)
        
            work_behaviors, hard_skills, soft_skills = zip(*[self.extract_skills_and_behaviors(label) for label in pseudo_labels])
            pbar.update(1)
        
            classifications = [self.classify_behaviors_and_skills(b, h, s) for b, h, s in zip(work_behaviors, hard_skills, soft_skills)]
            behavior_class, hard_skill_class, soft_skill_class = zip(*[self.extract_classifications(c) for c in classifications])
            pbar.update(1)
        return pd.DataFrame({
            'work_behaviors': work_behaviors,
            'hard_skills': hard_skills,
            'soft_skills': soft_skills,
            'behavior_class': behavior_class,
            'hard_skill_class': hard_skill_class,
            'soft_skill_class': soft_skill_class
        })
    
    
    @staticmethod
    def extract_skills_and_behaviors(text):
        work_behaviors, hard_skills, soft_skills = [], [], []
        current_category = None
        for line in text.split('\n'):
            if line.startswith('Work Behaviors:'):
                current_category = 'behaviors'
            elif line.startswith('Hard Skills:'):
                current_category = 'hard_skills'
            elif line.startswith('Soft Skills:'):
                current_category = 'soft_skills'
            elif line.strip().startswith(('1.', '2.', '3.')):
                item = line.split('.', 1)[1].strip()
                if current_category == 'behaviors':
                    work_behaviors.append(item)
                elif current_category == 'hard_skills':
                    hard_skills.append(item)
                elif current_category == 'soft_skills':
                    soft_skills.append(item)
        return ', '.join(work_behaviors), ', '.join(hard_skills), ', '.join(soft_skills)

    @staticmethod
    def extract_classifications(classifications):
        behavior_class = hard_skill_class = soft_skill_class = ''
        for line in classifications.split('\n'):
            if line.startswith('Behavior Class:'):
                behavior_class = line.split(':', 1)[1].strip()
            elif line.startswith('Hard Skill Class:'):
                hard_skill_class = line.split(':', 1)[1].strip()
            elif line.startswith('Soft Skill Class:'):
                soft_skill_class = line.split(':', 1)[1].strip()
        return behavior_class, hard_skill_class, soft_skill_class



    


    def process_dataframe(self, df, max_batch_size=32):
        total_batches = (len(df) + max_batch_size - 1) // max_batch_size
    
        with tqdm(total=total_batches, desc="Processing batches") as pbar:
            for i in range(0, len(df), max_batch_size):
                batch = df.iloc[i:i+max_batch_size]
                batch_results = self.process_batch(batch)
            
                # Update the original dataframe with new columns
                for col in batch_results.columns:
                    df.loc[df.index[i:i+max_batch_size], col] = batch_results[col].values
            
                pbar.update(1)
    
        return df  # Return the updated original dataframe

# Run `SkillBehaviorProcessor` Script

In [77]:
## check df using
df_1_month.head()

Unnamed: 0,award_date,award_type,english_message,department
0,2025-01-01,Applaud 2,Kam Wei has been instrumental in helping to cl...,6600 Solution Consultant
1,2025-01-01,Applaud 2,Sid is a rare Player-Coach that inspires his c...,6630 1st Line Mgr - SC
2,2025-01-01,Applaud 2,"Myung Soo rejoined Qlik in August 2024, but qu...",6600 Solution Consultant
3,2025-01-01,Applaud 2,"Congrats, Sean!",6600 Solution Consultant
4,2025-01-01,Applaud 2,Thank you Jason for being persistent and consi...,6120 QC Sales Enterprise


In [89]:
len(df_2_weeks)

356

In [None]:
# Initialize the processor
processor = SkillBehaviorProcessor()

# Process the DataFrame with overall progress bar
total_rows = len(df_2_weeks)
with tqdm(total=total_rows, desc="Overall progress") as pbar:
    df_2_weeks = processor.process_dataframe(df_2_weeks, max_batch_size=32)
    pbar.update(total_rows)

# df now contains the original columns plus the new columns

# Display sample results
print(df_2_weeks[['english_message', 'work_behaviors', 'hard_skills', 'soft_skills', 'behavior_class', 'hard_skill_class', 'soft_skill_class']].head())

# Save the results
df_2_weeks.to_csv('2_weeks_Qlik_processed_results.csv', index=False)
print("Results saved to 2_weeks_Qlik_processed_results.csv")