# In-Loop AI Assisted Labeling

### Overview
This Jupyter notebook is a crucial component of the ReTeach project, focusing on the in-the-loop AI-assisted labeling task. It is designed to enhance the process of labeling educational transcripts by integrating machine learning models and human expertise. The notebook allows users to interactively label classroom session transcripts, thereby training the model to improve its labeling accuracy over time.

### Features
**Data Loading and Preprocessing**:

This section deals with importing classroom session transcripts from CSV files, and slice datasets when needed by streamlit UI.

**AI Labeling Initialization**:

Loading BERT Encoder model for initial label predictions. The model predicts labels for the transcript text for each prediction.

**Model Training and Adaptation**:

The notebook includes functionality to retrain the model with the corrected labels, enhancing its prediction accuracy over time.
The process of model adaptation is iteratively repeated, improving the model's performance with each batch of labeled data.

**Batch Processing and Accuracy Tracking**:

Users can process variable numbers of transcript lines in batches based on previous batch accuracy.
The notebook tracks and displays the accuracy of the model’s predictions, adjusting the batch sizes based on performance.

### More on model used (BERT)
BERT (Bidirectional Encoder Representations from Transformers) is a groundbreaking model in the field of natural language processing (NLP) developed by Google. It represents a significant leap forward because of its deep understanding of language context and nuance. Unlike previous models that analyzed text in one direction, either left-to-right or right-to-left, BERT is bidirectional, meaning it considers the full context of a word by looking at the words that come before and after it. This feature enables BERT to capture a more comprehensive understanding of language structure and meaning.

For the task of labeling educational transcripts, BERT is particularly suitable due to its exceptional ability to understand the context and nuances of human language. Educational transcripts often contain complex sentences, specialized terminology, and varied expressions that require a deep understanding of context to accurately interpret and label. BERT's proficiency in understanding context allows it to accurately classify sentences into categories such as praise, reprimand, or neutral remarks, which are typical in educational settings. Additionally, BERT's versatility and adaptability make it ideal for custom tasks like this, where it can be fine-tuned with specific data (like educational transcripts) to enhance its performance in a specialized domain. This ability to adapt to the nuances of educational dialogue makes BERT an excellent choice for the automated labeling of classroom transcripts.

### Disclaimer

1. This notebook does NOT contain all components for this particular task, some of the project requirements (such as user interface) are done via streamlit in app.py file.

2. Due to complexity of BERT model, and lack of extensive label training dataset, the training improvement might not be significant. This could be done once we have access to larger labeling dataset as well as extensive fine-tuning.

In [None]:
#| default_exp In_Loop_AI_Assist_Labeling

In [None]:
#| export
import sys
sys.path.append('../')
from transformers import Trainer, TrainingArguments, DataCollatorWithPadding
from datasets import Dataset
from ai_assisted_coding_v2 import AI_Assist_Labeling

# Define the BertLabeler class, inheriting from AI_Assist_Labeling.BertBase
class BertLabeler(AI_Assist_Labeling.BertBase):

    # Retrieve the current DataFrame
    def get_df(self):
        return self.df

    # Set the DataFrame with new data
    def set_df(self, dataframe):
        self.df = dataframe

    # Method to get a specific slice of the dataframe based on start and end row indices
    def get_slice(self, start, end):
        return self.df[start:end]

    # Method to label a single sentence using the trained classifier
    def label_sentence(self, sentence):
        descriptive_labels = list(self.label_map.keys())
        result = self.classifier(sentence, descriptive_labels)
        # Convert descriptive label to acronym and return it along with the confidence score
        return self.label_map[result['labels'][0]], result['scores'][0]
    
    # Method to label a list of sentences
    def label_list_sentences(self, sentences):
        # Iterate over each sentence, labeling it and collecting the results
        return [self.label_sentence(sentence) for sentence in sentences]
    
    # Method to calculate the accuracy of predictions in a batch and adjust the batch size accordingly
    def accuracy_batch_calculation(self, predicted, actual, batch_size):
        if len(predicted) != len(actual):
            raise ValueError("Lists must be of the same length")

        # Count the number of correct predictions
        matches = sum(1 for x, y in zip(predicted, actual) if x == y)

        # Calculate the accuracy as a percentage
        accuracy = (matches / len(predicted)) * 100

        # Adjust batch size based on accuracy using a simple heuristic
        if accuracy > 50:
            batch_size += 5
        else:
            if batch_size > 10:
                batch_size -= 5

        return accuracy, batch_size

    # Method for training the model with new data
    def train_with_sentences(self, sentences, labels, model_save_path='../trained_model', save_model=False):
        # Convert text labels to numeric IDs
        label_to_id = {v: k for k, v in enumerate(set(labels))}
        labels = [label_to_id[label] for label in labels]

        # Tokenize the input sentences
        train_encodings = self.tokenizer(sentences, truncation=True, padding=True, max_length=512)

        # Create a dataset from the tokenized sentences and labels
        train_dataset = Dataset.from_dict({
            'input_ids': train_encodings['input_ids'],
            'attention_mask': train_encodings['attention_mask'],
            'labels': labels
        })

        # Define training arguments for the model
        training_args = TrainingArguments(
            output_dir=model_save_path,
            num_train_epochs=3,  # Number of training epochs
            per_device_train_batch_size=1,  # Small batch size for detailed updates
            warmup_steps=500,  # Number of warmup steps
            weight_decay=0.01,  # Weight decay for regularization
            logging_dir='./logs',  # Directory for logs
            save_steps=5,  # Frequency of model saving
            save_total_limit=3,  # Maximum number of saved models
            load_best_model_at_end=False  # Flag to control model loading behavior
        )

        # Initialize the Trainer with the model, training arguments, dataset, and collator
        trainer = Trainer(
            model=self.model,                  
            args=training_args,
            train_dataset=train_dataset,
            data_collator=DataCollatorWithPadding(tokenizer=self.tokenizer),
            tokenizer=self.tokenizer
        )

        # Start the training process
        trainer.train()

        # Save the model and tokenizer if required
        if save_model:
            self.model.save_pretrained(model_save_path)
            self.tokenizer.save_pretrained(model_save_path)


In [None]:
# demo single sentence labeling
bert = BertLabeler(model_name="bert-base-uncased")
bert.label_sentence("I don't know")

BertBase is being initialized


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to -1. Define a descriptive label2id mapping in the model config to ensure correct outputs.


('NEU', 0.26569560170173645)

In [None]:
# Training example, using a list of sentences and labels
sentences = ["Good morning class, today we are going to learn about nouns.",
             "I'm going to give you a chance to answer a question.",
             "You are all doing a great job.",
             "You are all doing a terrible job."]

labels = ["PRS", "OTR", "PRS", "REP"]

bert.train_with_sentences(sentences, labels, save_model=True)

Step,Training Loss


Checkpoint destination directory ../trained_model/checkpoint-10 already exists and is non-empty.Saving will proceed but saved results may be invalid.


In [None]:
# demo single sentence labeling, after training
# Might not be accurate, but as we train with larger dataset, it will get better
bert.label_sentence("I don't know")

('PRS', 0.2611815929412842)

### Conclusion

In this Jupyter notebook, we have successfully deployed a BERT-based AI-assisted labeling system for educational transcripts with dynamic batching and in-loop learning. This system significantly streamlines the process of classifying classroom dialogue into categories like opportunities to respond, praise, reprimands, and neutral comments. By integrating the advanced NLP capabilities of BERT, the system offers contextually aware initial predictions, which are refined through an interactive user interface. This collaborative approach between AI and human input not only improves the accuracy of the labeling over time but also enhances the efficiency of the process.

The notebook's design emphasizes user-friendly interaction, adaptability, and scalability, making it a valuable tool for educational data analysis. With the ability to export the labeled data for further use and the system's continuous learning from user feedback, this project demonstrates the potential of AI in transforming educational data processing, paving the way for more insightful and data-driven educational practices.