# AI-Assisted Labeling

This notebook contains the code for the 'AI-Assisted Labeling' portion of the project.

In [None]:
#| default_exp AI_Assist_Labeling

The first step is to initialize a Python class named 'BertBase,' which is intended to serve as a base class for utilizing a BERT model, specifically for the purpose of zero-shot classification. Here are the steps:

1. **Imports & Class Declaration**
    - Import necessary libraries
    - Declared BertBase class

2. **Initialization Method**
    - __init__ is constructor for class, automatically called when new instance of BertBase is created
    - Takes 3 parameters

3. **Label Mapping**
    - self.label_map defined to map descriptive labels to their acronyms

4. **DataFrame Initialization**
    - If dataframe parameter is provided, reads CSV file into pandas datafram and assigns it to self.df
    - If not, self.df is set to None

5. **Model & Tokenizer Loading**
    - Code checks if load_model_path is provided
    - If so, BERT model & tokenizer is loaded from that path
    - If not, default pre-trained model specified by model_name & associated tokenizer is loaded
    - Model initialized for sequence classification with number of labels is equal to length of label_map

6. **Zero-Shot Classification Pipeline**
    - Intialize zero-shot classification pipeline using loaded model & tokenizer

In [None]:
#| export
# Importing necessary libraries
import pandas as pd
from transformers import pipeline, BertTokenizer, BertForSequenceClassification

In [None]:
#| export
# Define class
class BertBase:
    # Takes 3 parameters
    # model_name defaults to bert-based-uncased, pre-trained BERT model provided by hugging face
    def __init__(self, model_name="bert-base-uncased", load_model_path=None, dataframe=None):
        print('BertBase is being initialized')
        
        # Mapping from descriptive labels to acronyms
        self.label_map = {
            "An opportunity to respond": "OTR",
            "Praise": "PRS",
            "Reprimand": "REP",
            "None of the above": "NEU"
        }
        # If dataframe not provided, set to none
        # Otherwise CSV file is read into pandas dataframe and assigns it here
        self.df = pd.read_csv(dataframe) if dataframe else None

        # Load the model from the specified path if provided, otherwise load a pretrained model
        if load_model_path:
            self.model = BertForSequenceClassification.from_pretrained(load_model_path, num_labels=len(self.label_map))
            self.tokenizer = BertTokenizer.from_pretrained(load_model_path)
        else:
            self.model = BertForSequenceClassification.from_pretrained(model_name, num_labels=len(self.label_map))
            self.tokenizer = BertTokenizer.from_pretrained(model_name)

        # Initializing a zero-shot classification pipeline with the model and tokenizer
        self.classifier = pipeline("zero-shot-classification", model=self.model, tokenizer=self.tokenizer)


Then, we can define another class called CsvLabeler which inherits from the BertBase class. This class is designed for labeling textual data in a CSV file using a BERT model for zero-shot classification and then saving the results. Here are the steps:

1. **Class Declaration**
    - CsvLabeler inherits all functionalities of BertBase, including BERT model, tokenizer, & classifier

2. **Method: label_csv**
    - Reads CSV file
    - Applies text classification
    - Saves results

3. **Method: colorize_confidence**
    - Applies color formatting to DataFrame based on confidence scores of classifications
    - Returns styled DataFrame
        - Styles for display purposes, won't be saved in CSV file

In [None]:
#| export
class CsvLabeler(BertBase):
    def label_csv(self, file_name, output_filename='../labeled_data/labeled_classroom_transcripts.csv'):
        # Load data from the CSV file
        df = pd.read_csv(file_name)
        
        # Ensure 'Label' and 'Confidence' columns exist and are of type 'object' and 'float' respectively
        if 'Label' not in df.columns:
            df['Label'] = pd.Series(dtype='object')
        else:
            df['Label'] = df['Label'].astype('object')
        if 'Confidence' not in df.columns:
            df['Confidence'] = pd.Series(dtype='float')

        # Prepare descriptive labels for classification
        descriptive_labels = list(self.label_map.keys())
        
        # Classify each row in the dataframe and assign labels
        for index, row in df.iterrows():
            result = self.classifier(row['Text'], descriptive_labels)
            # Convert descriptive label to acronym and store it along with confidence
            df.at[index, 'Label'] = self.label_map[result['labels'][0]]
            df.at[index, 'Confidence'] = result['scores'][0]

        # Save the labeled data to a CSV file
        df.to_csv(output_filename, index=False)
        return output_filename

    def colorize_confidence(self, dataframe):
        """
        Apply color formatting to the dataframe based on confidence scores.
        High confidence: Green, Medium: White, Low: Red.
        """
        def apply_color(val):
            color = 'yellow'
            if val >= 0.75:
                color = 'green'
            elif val <= 0.25:
                color = 'red'
            return f'background-color: {color}'

        return dataframe.style.applymap(apply_color, subset=['Confidence'])


# Demonstration

Now we can demonstrate using ReTeach_Data.csv, that has 2 columns: Text and Label. The 'Text' column has a string of words, whereas the 'Label' column is completely empty as the model will be filling in this values.

In [None]:
# Sample CSV file path - Replace this with the path to your actual sample file
sample_csv_path = '../data/ReTeach_Data.csv'

# Initialize the CsvLabeler
csv_labeler = CsvLabeler()

# Step 1: Read the sample CSV file
original_df = pd.read_csv(sample_csv_path)
print("Original Data:")
display(original_df.head())  # Display the first few rows of the original data

# Step 2: Apply labeling and confidence scoring
output_filename = csv_labeler.label_csv(sample_csv_path)

# Step 3: Read the labeled data
labeled_df = pd.read_csv(output_filename)
print("\nLabeled Data with Confidence Scores:")
display(labeled_df.head())  # Display the first few rows of the labeled data

# Step 4: Apply color coding based on confidence scores
styled_df = csv_labeler.colorize_confidence(labeled_df)
print("\nLabeled Data with Color Coding:")
display(styled_df)  # Display the styled DataFrame

# Step 5: Save the labeled DataFrame to a new CSV file
styled_output_filename = '../labeled_data/ai_assist_labeled_data.csv'
labeled_df.to_csv(styled_output_filename, index=False)
print(f"\nLabeled data saved to: {styled_output_filename}")


BertBase is being initialized


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to -1. Define a descriptive label2id mapping in the model config to ensure correct outputs.


Original Data:


Unnamed: 0,Text,Label
0,"Good morning class, today we are going to lear...",
1,"A noun is a word that represents a person, pla...",
2,Can anyone give me an example of a noun?,
3,"That's right, 'dog' is a noun because it is a ...",
4,Let's write down some nouns in our notebooks.,



Labeled Data with Confidence Scores:


Unnamed: 0,Text,Label,Confidence
0,"Good morning class, today we are going to lear...",REP,0.257249
1,"A noun is a word that represents a person, pla...",PRS,0.260126
2,Can anyone give me an example of a noun?,OTR,0.251545
3,"That's right, 'dog' is a noun because it is a ...",PRS,0.253711
4,Let's write down some nouns in our notebooks.,REP,0.254356



Labeled Data with Color Coding:


  return dataframe.style.applymap(apply_color, subset=['Confidence'])


Unnamed: 0,Text,Label,Confidence
0,"Good morning class, today we are going to learn about nouns.",REP,0.257249
1,"A noun is a word that represents a person, place, thing, or idea.",PRS,0.260126
2,Can anyone give me an example of a noun?,OTR,0.251545
3,"That's right, 'dog' is a noun because it is a thing.",PRS,0.253711
4,Let's write down some nouns in our notebooks.,REP,0.254356
5,"Now, let's talk about verbs. Does anyone know what a verb is?",PRS,0.252375
6,"A verb is a word that describes an action, occurrence, or state of being.",PRS,0.255162
7,Can someone give me an example of a verb?,OTR,0.252993
8,"Great example, 'run' is a verb because it is an action.",OTR,0.251204
9,"Now, let's write down some verbs in our notebooks.",REP,0.256105



Labeled data saved to: ../labeled_data/ai_assist_labeled_data.csv
