# Introduction to the Project

This project focuses on **Domain-Specific Named Entity Recognition (NER)** for academic course catalog data. The objective is to extract structured information such as course codes, titles, credits, instructors, and scheduling details from unstructured course descriptions.

## Objective

The primary goal is to build a custom NER model that can:
- Identify relevant entities (e.g., `CourseCode`, `Credit`, `Instructor`, etc.)
- Enable structured search and classification of academic content
- Support downstream tasks like course recommendation, comparison, or catalog organization



## Experiment Overview

The experiment is designed to:
1. **Annotate Data**: Manually label course descriptions using Label Studio.
2. **Train a Transformer Model**: Fine-tune a pre-trained transformer (e.g., DistilBERT or BERT) on the annotated dataset.
3. **Evaluate Performance**: Use metrics such as F1-score to evaluate entity extraction performance.
4. **Deploy and Test**: Package the model in a Streamlit app for interactive testing.

### Environment Setup

The first step is to mount Google Drive to access training data and store outputs and to install the necessary Python libraries used throughout the NER project.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!pip install pandas
!pip install pandas label-studio label-studio-sdk


In [None]:
!pip install seqeval

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16251 sha256=7c810774b54c97cea30bbfd961e64093e5c09b4095cbc99c1611d3cecb31068d
  Stored in directory: /root/.cache/pip/wheels/bc/92/f0/243288f899c2eacdfa8c5f9aede4c71a9bad0ee26a01dc5ead
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


In [None]:
!pip install transformers datasets




### Clean Up Annotated Files

This cell deletes previously generated or uploaded chunk files from Google Drive to avoid redundancy or conflicts during reprocessing.

**Key Operations**:
- **Folder Path**: Specifies the target directory where annotated files are stored.
- **Pattern Matching**: Uses a glob pattern (`uiuc_chunk*`) to identify all relevant files.
- **Safe Deletion**: Iterates through each matching file and attempts deletion with error handling.

In [None]:
import os
import glob

# Specify the folder path
folder_path = "/content/drive/MyDrive/41043/Project"  # <-- Replace this with your actual path

# Pattern to match (all files starting with uiuc_chunk)
pattern = os.path.join(folder_path, "uiuc_chunk*")

# Delete matching files
deleted = 0
for file_path in glob.glob(pattern):
    try:
        os.remove(file_path)
        print(f"Deleted: {file_path}")
        deleted += 1
    except Exception as e:
        print(f"Error deleting {file_path}: {e}")

print(f"\n Deleted {deleted} files matching 'uiuc_chunk*' from {folder_path}")


## Create Project Directory

This cell sets up the target directory in Google Drive to store chunked datasets and annotation files for the NER project.

Since Label Studio cannot handle large datasets efficiently in a single upload, we split the course catalog into smaller chunks and store them here for smooth annotation and processing.


In [None]:
!mkdir -p /content/drive/MyDrive/41043/Project/uiuc_chunk

## Load and Prepare Course Catalog Data

This cell loads the original course catalog dataset and prepares it for annotation by creating a unified `text` column. Key transformations include:

- Generating a `CourseCode` by combining subject and number (e.g., "AAS 100").
- Formatting `TimeSlot` and `Location` details.
- Assembling all relevant course information into a single `text` field suitable for Named Entity Recognition (NER) annotation.
- Shuffling the dataset to ensure a random distribution of records.

This preprocessed `text` column will be used as input for chunking and uploading to Label Studio.


In [None]:
import pandas as pd

In [None]:
# 2) Load Data & Build Text Column

df = pd.read_csv('/content/drive/MyDrive/41043/Project/course-catalog dataset.csv')

# Generate CourseCode like "AAS 100"
df["CourseCode"] = df["Subject"].astype(str) + " " + df["Number"].astype(str)

# Construct Time and Location fields
df["TimeSlot"] = df["Days of Week"].fillna('') + " " + df["Start Time"].fillna('') + " - " + df["End Time"].fillna('')
df["Location"] = df["Room"].fillna('') + " " + df["Building"].fillna('')

# Generate a complete 'text' field
df["text"] = (
    df["CourseCode"].fillna('') + ": " +
    df["Name"].fillna('') + ". " +
    df["Description"].fillna('') + " " +
    "Credit: " + df["Credit Hours"].fillna('') + ". " +
    "Instructor(s): " + df["Instructors"].fillna('') + ". " +
    "Scheduled at: " + df["TimeSlot"].fillna('') + " in " + df["Location"].fillna('') + "."
)
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

# Preview
print(df["text"].head())

0    CEE 380: Geotechnical Engineering. Classificat...
1    ACE 499: Contemporary Topics in ACE. Group dis...
2    ECON 103: Macroeconomic Principles. Introducti...
3    RST 350: Tourism and Culture. Studies the rela...
4    CW 106: Poetry Workshop I. Practice in the wri...
Name: text, dtype: object


## Span Finder Helper

This utility function helps locate the character span of a given phrase within a text block.

It returns the start and end character indices along with the matched phrase. This is essential for generating span-based annotations (e.g., Label Studio JSON format) required for training NER models.

If the phrase is missing or not found in the text, it returns `None`.


In [None]:
#  Span Finder Helper
def find_span(text, phrase):
    phrase = str(phrase).strip()
    if not phrase or phrase == "nan":
        return None
    start = text.find(phrase)
    if start == -1:
        return None
    return {
        "start": start,
        "end": start + len(phrase),
        "text": phrase
    }


## Build Pre-Annotated Tasks

This function constructs structured annotations for each course entry to be used in Label Studio. It maps known fields (like `CourseCode`, `Instructor`, `Credit`, etc.) to corresponding spans in the text.

**Key Steps:**
- Define a mapping between column names and entity labels.
- Use the `find_span` function to locate the start and end positions of each entity.
- Handle multiple instructors by splitting on `;`.
- Generate a list of span-based annotations following Label Studio's expected schema.

The output is a dictionary containing the full `text` and its corresponding `annotations`, ready for export and upload to Label Studio.


In [None]:
# Build Pre-Annotated Tasks
def create_annotation(row):
    labels = {
        "CourseCode": row["CourseCode"],
        "CourseTitle": row["Name"],
        "Credit": row["Credit Hours"],
        "Instructor": row["Instructors"],
        "Department": row["Building"],
        "Location": row["Location"],
        "TimeSlot": row["TimeSlot"]
    }

    results = []

    for label, value in labels.items():
        if pd.isna(value): continue
        # Split if multiple instructors
        spans = str(value).split(";") if label == "Instructor" else [value]
        for span in spans:
            match = find_span(row["text"], span.strip())
            if match:
                results.append({
                    "value": {
                        **match,
                        "labels": [label]
                    },
                    "from_name": "ner",
                    "to_name": "text",
                    "type": "labels"
                })
    return {
        "text": row["text"],
        "annotations": [{
            "result": results
        }]
    }


## Export Pre-Annotated Data for Label Studio

This section generates a Label Studio-compatible JSON file containing span-based annotations for each course entry.

- The annotations are written to a single JSON file named `uiuc_labelstudio_preannotated.json`.
- To accommodate Label Studio’s size limitations, the full dataset is split into smaller chunks of 200 records each.
- Each chunk is saved as a separate JSON file inside the `uiuc_chunk` folder (e.g., `uiuc_chunk_1.json`, `uiuc_chunk_2.json`, etc.).

These chunked files are ready for upload into Label Studio for annotation review and further refinement.


In [None]:
# Create JSON for Label Studio
output = [create_annotation(row) for _, row in df.iterrows()]

import json
with open("/content/drive/MyDrive/41043/Project/uiuc_labelstudio_preannotated.json", "w") as f:
    json.dump(output, f, indent=2)


In [None]:
import json

# Load your full dataset
with open("/content/drive/MyDrive/41043/Project/uiuc_labelstudio_preannotated.json", "r") as f:
    data = json.load(f)

# Split size
chunk_size = 200

# Split and save
for i in range(0, len(data), chunk_size):
    chunk = data[i:i+chunk_size]
    with open(f"/content/drive/MyDrive/41043/Project/uiuc_chunk/uiuc_chunk_{i//chunk_size + 1}.json", "w") as out:
        json.dump(chunk, out, indent=2)


## Count Annotated Entities by Class

This cell loads the finalized annotated JSON file from Label Studio and counts the number of labeled entities for each class (e.g., `CourseCode`, `Credit`, `Instructor`, etc.).

- It iterates through all annotation entries and tallies label occurrences.
- The result is presented as a tabular DataFrame for easier interpretation and reporting.

This step helps assess label distribution and detect any imbalance in entity classes, which is important for model training.



In [None]:
file_path = "/content/drive/MyDrive/41043/Project/project-7-at-2025-05-16-18-37-eec27312.json"

with open(file_path, 'r') as f:
    label_data = json.load(f)

class_counts = {}

for item in label_data:
    if 'annotations' in item and item['annotations']:
        for annotation in item['annotations']:
            if 'result' in annotation and annotation['result']:
                for result in annotation['result']:
                    if 'value' in result and 'labels' in result['value']:
                        for label in result['value']['labels']:
                            class_counts[label] = class_counts.get(label, 0) + 1

# Convert counts to a pandas DataFrame for tabular format
class_counts_df = pd.DataFrame(list(class_counts.items()), columns=['Class', 'Count'])

# Display the table
class_counts_df

Unnamed: 0,Class,Count
0,CourseCode,201
1,CourseTitle,196
2,Prerequisites,106
3,Credit,198
4,Instructor,232
5,TimeSlot,304
6,Location,141
7,Department,7


## Tokenization and Label Encoding for NER Model Training

These cells prepare course catalog data for training a BERT-based token classification model using the BIO tagging scheme.

### Step 1: Define Label Mappings
- Loads the `bert-base-cased` tokenizer from Hugging Face.
- Specifies the target entity labels used for Named Entity Recognition.
- Constructs `label2id` and `id2label` dictionaries using the BIO format:
  - `B-<LABEL>` marks the beginning of an entity.
  - `I-<LABEL>` marks subsequent tokens within the same entity.
  - `O` denotes tokens outside any labeled span.

### Step 2: Convert Span-Based Annotations to Token-Level Labels
- Tokenizes the input text and captures character-level offset mappings.
- Aligns Label Studio’s span annotations with token boundaries.
- Assigns appropriate BIO tags to each token using the label mappings.
- Converts tags to numeric `label_ids` for model training.

Together, these steps generate token-label pairs essential for fine-tuning transformer models on the custom NER dataset.


In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

LABELS = ['CourseCode', 'CourseTitle', 'Credit', 'Instructor', 'Department', 'Location', 'TimeSlot']

label2id = {"O": 0}
for i, label in enumerate(LABELS):
    label2id[f"B-{label}"] = len(label2id)
    label2id[f"I-{label}"] = len(label2id)

id2label = {v: k for k, v in label2id.items()}



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [None]:
import json

def label_studio_to_token_labels(example):
    text = example["data"]["text"]
    entities = example["annotations"][0]["result"]

    encoding = tokenizer(text, return_offsets_mapping=True, truncation=True, max_length=128)
    offset_mapping = encoding["offset_mapping"]
    tokens = tokenizer.convert_ids_to_tokens(encoding["input_ids"])

    labels = ["O"] * len(tokens)

    for ent in entities:
        label_name = ent["value"]["labels"][0]
        start_char = ent["value"]["start"]
        end_char = ent["value"]["end"]

        is_first = True
        for idx, (start, end) in enumerate(offset_mapping):
            if start is None or end is None or start == end:
                continue
            if start >= end_char or end <= start_char:
                continue
            if start >= start_char and end <= end_char:
                if is_first:
                    labels[idx] = f"B-{label_name}"
                    is_first = False
                else:
                    labels[idx] = f"I-{label_name}"

    label_ids = [label2id.get(label, 0) for label in labels]
    return {
        "tokens": tokens,
        "ner_tags": label_ids
    }




## Prepare Token-Level Dataset for BERT NER Training

These steps convert annotated course descriptions into a format suitable for fine-tuning a BERT-based Named Entity Recognition (NER) model using the Hugging Face `transformers` and `datasets` libraries.

### Step 1: Convert Span Annotations to Token-Level BIO Tags
- Loads the exported JSON file from Label Studio.
- Applies the `label_studio_to_token_labels()` function to extract `tokens` and corresponding BIO-formatted `ner_tags` for each record.
- Constructs a Hugging Face `Dataset` from the processed records and splits it into training and validation sets.

### Step 2: Tokenize and Align Labels with Subword Tokens
- Uses the BERT tokenizer with `is_split_into_words=True` to preserve token-to-word alignment.
- Maps word-level labels (`ner_tags`) to subword tokens using `word_ids`.
- Assigns label IDs to the appropriate tokens and uses `-100` for special tokens and padding to be ignored during loss calculation.

This end-to-end preprocessing ensures that the dataset is properly formatted and aligned for input to a transformer-based token classification model.


In [None]:
from datasets import Dataset

# Load your Label Studio exported file
with open("/content/drive/MyDrive/41043/Project/project-7-at-2025-05-16-18-37-eec27312.json") as f:
    raw_data = json.load(f)

dataset = Dataset.from_list([label_studio_to_token_labels(row) for row in raw_data])


# Split into train/validation
dataset = dataset.train_test_split(test_size=0.2)
train_dataset = dataset["train"]
val_dataset = dataset["test"]


In [None]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"],
        is_split_into_words=True,
        padding="max_length",
        truncation=True,
        max_length=128,
        return_offsets_mapping=False
    )

    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []

        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])  # ✅ Fixed here
            else:
                label_ids.append(label[word_idx])
            previous_word_idx = word_idx

        label_ids += [-100] * (len(tokenized_inputs["input_ids"][i]) - len(label_ids))
        label_ids = label_ids[:len(tokenized_inputs["input_ids"][i])]
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs



## Tokenize and Encode Train/Validation Sets

This step applies the `tokenize_and_align_labels()` function to both the training and validation datasets using Hugging Face’s `map()` method.

- Converts word-level tokens and labels into subword-tokenized inputs with aligned label IDs.
- Ensures that the format matches what the BERT model expects for token classification.
- Uses `batched=True` for efficient batch processing during mapping.

The result is two fully preprocessed datasets (`tokenized_train` and `tokenized_val`) ready for model training.


In [None]:
tokenized_train = train_dataset.map(tokenize_and_align_labels, batched=True)
tokenized_val = val_dataset.map(tokenize_and_align_labels, batched=True)


Map:   0%|          | 0/160 [00:00<?, ? examples/s]

Map:   0%|          | 0/40 [00:00<?, ? examples/s]

## Define Evaluation Metrics for NER

This function calculates performance metrics for token classification using the `seqeval` library.

### Key Steps:
- Converts model logits (`predictions`) to label indices using `argmax`.
- Filters out special tokens with label `-100` (ignored during loss calculation).
- Maps predicted and true label indices back to their string form using `id2label`.
- Computes precision, recall, and F1-score for each entity type using `seqeval.metrics.classification_report`.

This function is used by Hugging Face’s `Trainer` to evaluate model performance on the validation set after each epoch.


In [None]:
from seqeval.metrics import classification_report

def compute_metrics(p):
    predictions, labels = p
    predictions = predictions.argmax(axis=-1)

    true_predictions = [
        [id2label[p] for (p, l) in zip(pred, lab) if l != -100]
        for pred, lab in zip(predictions, labels)
    ]
    true_labels = [
        [id2label[l] for (p, l) in zip(pred, lab) if l != -100]
        for pred, lab in zip(predictions, labels)
    ]

    return classification_report(true_labels, true_predictions, output_dict=True)

## Model and Trainer Setup

Initializes a BERT model for token classification and configures training using Hugging Face’s `Trainer`.

- Loads `bert-base-cased` with custom label mappings.
- Sets training arguments (epochs, batch size, logging, etc.).
- Prepares `Trainer` with model, data, tokenizer, and evaluation metrics.

Ready to train and evaluate the NER model.


In [None]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

model = AutoModelForTokenClassification.from_pretrained(
    "bert-base-cased",
    num_labels=len(label2id),
    id2label=id2label,
    label2id=label2id
)

training_args = TrainingArguments(
    output_dir="./ner-course-model",
    eval_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=5,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    load_best_model_at_end=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)



Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


## Train the NER Model

Applies tokenization and label alignment to the full dataset and starts training using the configured `Trainer`.

- `tokenized_dataset`: Preprocessed input for training.
- `trainer.train()`: Begins fine-tuning the BERT model on the NER task.


In [None]:
tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True)
trainer.train()


Map:   0%|          | 0/160 [00:00<?, ? examples/s]

Map:   0%|          | 0/40 [00:00<?, ? examples/s]



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mvijayvedula2002[0m ([33mvijayvedula2002-standard-chartered[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Coursecode,Coursetitle,Credit,Instructor,Location,Timeslot,Micro avg,Macro avg,Weighted avg
1,0.3871,0.287637,"{'precision': 0.8095238095238095, 'recall': 0.85, 'f1-score': 0.8292682926829269, 'support': 40}","{'precision': 0.3090909090909091, 'recall': 0.425, 'f1-score': 0.3578947368421052, 'support': 40}","{'precision': 0.3333333333333333, 'recall': 0.22727272727272727, 'f1-score': 0.27027027027027023, 'support': 22}","{'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 20}","{'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 7}","{'precision': 0.9583333333333334, 'recall': 1.0, 'f1-score': 0.9787234042553191, 'support': 23}","{'precision': 0.5808823529411765, 'recall': 0.5197368421052632, 'f1-score': 0.548611111111111, 'support': 152}","{'precision': 0.40171356421356424, 'recall': 0.4170454545454545, 'f1-score': 0.40602611734177024, 'support': 152}","{'precision': 0.48762887331966276, 'recall': 0.5197368421052632, 'f1-score': 0.4996256935843392, 'support': 152}"
2,0.142,0.108922,"{'precision': 0.9, 'recall': 0.9, 'f1-score': 0.9, 'support': 40}","{'precision': 0.6829268292682927, 'recall': 0.7, 'f1-score': 0.6913580246913581, 'support': 40}","{'precision': 0.5384615384615384, 'recall': 0.6363636363636364, 'f1-score': 0.5833333333333334, 'support': 22}","{'precision': 0.5263157894736842, 'recall': 0.5, 'f1-score': 0.5128205128205129, 'support': 20}","{'precision': 0.125, 'recall': 0.14285714285714285, 'f1-score': 0.13333333333333333, 'support': 7}","{'precision': 0.9583333333333334, 'recall': 1.0, 'f1-score': 0.9787234042553191, 'support': 23}","{'precision': 0.7088607594936709, 'recall': 0.7368421052631579, 'f1-score': 0.7225806451612904, 'support': 152}","{'precision': 0.6218395817561414, 'recall': 0.6465367965367965, 'f1-score': 0.6332614347389761, 'support': 152}","{'precision': 0.7145145360067494, 'recall': 0.7368421052631579, 'f1-score': 0.7249212908460763, 'support': 152}"
3,0.0708,0.064594,"{'precision': 0.8780487804878049, 'recall': 0.9, 'f1-score': 0.888888888888889, 'support': 40}","{'precision': 0.6136363636363636, 'recall': 0.675, 'f1-score': 0.6428571428571429, 'support': 40}","{'precision': 0.7619047619047619, 'recall': 0.7272727272727273, 'f1-score': 0.7441860465116279, 'support': 22}","{'precision': 0.6818181818181818, 'recall': 0.75, 'f1-score': 0.7142857142857143, 'support': 20}","{'precision': 0.2, 'recall': 0.2857142857142857, 'f1-score': 0.23529411764705882, 'support': 7}","{'precision': 0.92, 'recall': 1.0, 'f1-score': 0.9583333333333334, 'support': 23}","{'precision': 0.7300613496932515, 'recall': 0.7828947368421053, 'f1-score': 0.7555555555555555, 'support': 152}","{'precision': 0.6759013479745187, 'recall': 0.7229978354978356, 'f1-score': 0.6973075405872944, 'support': 152}","{'precision': 0.7409583826528628, 'recall': 0.7828947368421053, 'f1-score': 0.7606340399276806, 'support': 152}"
4,0.0465,0.070617,"{'precision': 0.9, 'recall': 0.9, 'f1-score': 0.9, 'support': 40}","{'precision': 0.5681818181818182, 'recall': 0.625, 'f1-score': 0.5952380952380952, 'support': 40}","{'precision': 0.7619047619047619, 'recall': 0.7272727272727273, 'f1-score': 0.7441860465116279, 'support': 22}","{'precision': 0.7619047619047619, 'recall': 0.8, 'f1-score': 0.7804878048780488, 'support': 20}","{'precision': 0.3333333333333333, 'recall': 0.42857142857142855, 'f1-score': 0.375, 'support': 7}","{'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 23}","{'precision': 0.7531645569620253, 'recall': 0.7828947368421053, 'f1-score': 0.767741935483871, 'support': 152}","{'precision': 0.7208874458874459, 'recall': 0.7468073593073591, 'f1-score': 0.732485324437962, 'support': 152}","{'precision': 0.7635566188197768, 'recall': 0.7828947368421053, 'f1-score': 0.7724761376996092, 'support': 152}"
5,0.0263,0.055943,"{'precision': 0.9, 'recall': 0.9, 'f1-score': 0.9, 'support': 40}","{'precision': 0.6190476190476191, 'recall': 0.65, 'f1-score': 0.6341463414634146, 'support': 40}","{'precision': 0.7619047619047619, 'recall': 0.7272727272727273, 'f1-score': 0.7441860465116279, 'support': 22}","{'precision': 0.7272727272727273, 'recall': 0.8, 'f1-score': 0.761904761904762, 'support': 20}","{'precision': 0.3333333333333333, 'recall': 0.42857142857142855, 'f1-score': 0.375, 'support': 7}","{'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 23}","{'precision': 0.7643312101910829, 'recall': 0.7894736842105263, 'f1-score': 0.7766990291262136, 'support': 152}","{'precision': 0.7235930735930737, 'recall': 0.7509740259740258, 'f1-score': 0.7358728583133006, 'support': 152}","{'precision': 0.7723855092276144, 'recall': 0.7894736842105263, 'f1-score': 0.7802700126308398, 'support': 152}"


  _warn_prf(average, modifier, msg_start, len(result))
Trainer is attempting to log a value of "{'precision': 0.8095238095238095, 'recall': 0.85, 'f1-score': 0.8292682926829269, 'support': 40}" of type <class 'dict'> for key "eval/CourseCode" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'precision': 0.3090909090909091, 'recall': 0.425, 'f1-score': 0.3578947368421052, 'support': 40}" of type <class 'dict'> for key "eval/CourseTitle" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'precision': 0.3333333333333333, 'recall': 0.22727272727272727, 'f1-score': 0.27027027027027023, 'support': 22}" of type <class 'dict'> for key "eval/Credit" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of 

TrainOutput(global_step=100, training_loss=0.23293721139431, metrics={'train_runtime': 1505.8965, 'train_samples_per_second': 0.531, 'train_steps_per_second': 0.066, 'total_flos': 52265493504000.0, 'train_loss': 0.23293721139431, 'epoch': 5.0})

## Evaluate the Trained Model

Runs evaluation on the validation set and prints key metrics such as precision, recall, and F1-score for each entity class.

Useful for assessing model performance after training.


In [None]:
metrics = trainer.evaluate()
print(metrics)


Trainer is attempting to log a value of "{'precision': 0.9, 'recall': 0.9, 'f1-score': 0.9, 'support': 40}" of type <class 'dict'> for key "eval/CourseCode" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'precision': 0.6190476190476191, 'recall': 0.65, 'f1-score': 0.6341463414634146, 'support': 40}" of type <class 'dict'> for key "eval/CourseTitle" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'precision': 0.7619047619047619, 'recall': 0.7272727272727273, 'f1-score': 0.7441860465116279, 'support': 22}" of type <class 'dict'> for key "eval/Credit" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'precision': 0.7272727272727273, 'recall': 0.8, 'f1-score': 0.761904761904762, 'support

{'eval_loss': 0.05594261735677719, 'eval_CourseCode': {'precision': 0.9, 'recall': 0.9, 'f1-score': 0.9, 'support': 40}, 'eval_CourseTitle': {'precision': 0.6190476190476191, 'recall': 0.65, 'f1-score': 0.6341463414634146, 'support': 40}, 'eval_Credit': {'precision': 0.7619047619047619, 'recall': 0.7272727272727273, 'f1-score': 0.7441860465116279, 'support': 22}, 'eval_Instructor': {'precision': 0.7272727272727273, 'recall': 0.8, 'f1-score': 0.761904761904762, 'support': 20}, 'eval_Location': {'precision': 0.3333333333333333, 'recall': 0.42857142857142855, 'f1-score': 0.375, 'support': 7}, 'eval_TimeSlot': {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 23}, 'eval_micro avg': {'precision': 0.7643312101910829, 'recall': 0.7894736842105263, 'f1-score': 0.7766990291262136, 'support': 152}, 'eval_macro avg': {'precision': 0.7235930735930737, 'recall': 0.7509740259740258, 'f1-score': 0.7358728583133006, 'support': 152}, 'eval_weighted avg': {'precision': 0.7723855092276144, 'r

## Visualize Token-Label Alignment

This utility function prints tokens and their corresponding entity labels from a dataset sample.

- Helps verify if BIO labels align correctly with tokens.
- Useful for debugging and qualitative inspection of training data or predictions.

Displays output in a clear tabular format for easy readability.


In [None]:
def visualize_example(dataset, index, id2label):
    example = dataset[index]
    tokens = example["tokens"]
    label_ids = example["ner_tags"]

    print(f"\n📄 Sample #{index} — Token Label Alignment\n")
    print(f"{'Token':<20} | Label")
    print(f"{'-'*30}")

    for token, label_id in zip(tokens, label_ids):
        label = id2label.get(label_id, "O")
        print(f"{token:<20} | {label}")


In [None]:
visualize_example(dataset['train'], 0, id2label)




## Save Trained Model and Tokenizer

Saves the fine-tuned NER model and tokenizer to Google Drive for future use or deployment.

This allows reloading the model without retraining.


In [None]:
trainer.save_model("/content/drive/MyDrive/41043/Project/ner-course-model-4")
tokenizer.save_pretrained("/content/drive/MyDrive/41043/Project/ner-course-model-4")


('/content/drive/MyDrive/41043/Project/ner-course-model-4/tokenizer_config.json',
 '/content/drive/MyDrive/41043/Project/ner-course-model-4/special_tokens_map.json',
 '/content/drive/MyDrive/41043/Project/ner-course-model-4/vocab.txt',
 '/content/drive/MyDrive/41043/Project/ner-course-model-4/added_tokens.json',
 '/content/drive/MyDrive/41043/Project/ner-course-model-4/tokenizer.json')

## Batch Annotate Text Chunks and Merge with Manual Labels

This workflow automates the generation of NER-labeled data from raw text and combines it with manually annotated examples for training.

### Key Steps:
- **Load Trained Model**: Restores the fine-tuned BERT model and tokenizer for inference.
- **Generate Predictions**: Applies `generate_ner_samples()` on multiple chunked `.txt` or `.json` files to create token-level predictions.
- **Merge Datasets**: Combines the model-generated annotations with the manually labeled dataset to form a unified training set.

This approach enables scalable dataset creation and incremental fine-tuning with minimal manual effort.


In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
from torch.nn.functional import softmax
from datasets import Dataset

# Load your model and tokenizer
model_path = "/content/drive/MyDrive/41043/Project/ner-course-model-4"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForTokenClassification.from_pretrained(model_path)
model.eval()

# Label mapping
id2label = model.config.id2label
label_list = sorted(set(id2label.values()))
label2id = {label: i for i, label in enumerate(label_list)}

#  Function to generate NER samples in HuggingFace format
def generate_ner_samples(text_lines, batch_size=32):
    all_samples = []

    for i in range(0, len(text_lines), batch_size):
        batch = text_lines[i:i + batch_size]
        encodings = tokenizer(
            batch,
            return_offsets_mapping=True,
            return_tensors='pt',
            truncation=True,
            padding=True,
            is_split_into_words=False
        )

        offset_mapping = encodings.pop("offset_mapping")
        with torch.no_grad():
            outputs = model(**encodings)

        predictions = torch.argmax(outputs.logits, dim=-1)

        for b_idx, text in enumerate(batch):
            word_ids = encodings.encodings[b_idx].word_ids
            tokens = tokenizer.convert_ids_to_tokens(encodings["input_ids"][b_idx])
            pred_labels = predictions[b_idx].tolist()

            token_list = []
            label_ids = []

            seen_word_idx = set()
            for t_idx, word_idx in enumerate(word_ids):
                if word_idx is None or word_idx in seen_word_idx:
                    continue
                label_id = pred_labels[t_idx]
                label_str = id2label[label_id]

                token_list.append(tokens[t_idx])
                label_ids.append(label2id[label_str])
                seen_word_idx.add(word_idx)

            if token_list:
                all_samples.append({
                    "tokens": token_list,
                    "ner_tags": label_ids
                })

    return Dataset.from_list(all_samples)


In [None]:
# Load & Annotate Multiple Chunks
import os
from datasets import concatenate_datasets

def generate_from_folder(chunk_folder, chunk_prefix="uiuc_chunk_", max_chunks=10):
    all_datasets = []
    for i in range(1, max_chunks + 1):
        file_path = os.path.join(chunk_folder, f"{chunk_prefix}{i}.txt")
        if os.path.exists(file_path):
            print(f" Annotating: {file_path}")
            with open(file_path, "r") as f:
                lines = [line.strip() for line in f if line.strip()]
            ds = generate_ner_samples(lines)
            all_datasets.append(ds)
        else:
            print(f" Skipping missing file: {file_path}")
    return concatenate_datasets(all_datasets)


In [None]:
# Merge With Your Manually Labeled Dataset
import os
import json
from datasets import Dataset, concatenate_datasets

#  Path settings
manual_file = "/content/drive/MyDrive/41043/Project/project-7-at-2025-05-16-18-37-eec27312.json"
chunk_folder = "/content/drive/MyDrive/41043/Project"
chunk_prefix = "uiuc_chunk_3"
max_chunks = 5  # Adjust if you have more

# Step 1: Load manually labeled dataset
with open(manual_file) as f:
    manual_data = json.load(f)

# Ensure proper format: list of {"tokens": [...], "ner_tags": [...]}
existing_ds = Dataset.from_list(manual_data)
print(f"Loaded manually labeled dataset: {len(existing_ds)} examples")

# Step 2: Generate and merge chunk datasets
def generate_from_folder(chunk_folder, chunk_prefix="uiuc_chunk_2", max_chunks=10):
    all_datasets = []
    for i in range(1, max_chunks + 1):
        file_path = os.path.join(chunk_folder, f"{chunk_prefix}{i}.json")
        if os.path.exists(file_path):
            print(f"🔍 Annotating: {file_path}")
            with open(file_path, "r") as f:
                lines = [line.strip() for line in f if line.strip()]
            ds = generate_ner_samples(lines)
            all_datasets.append(ds)
        else:
            print(f" Skipping missing file: {file_path}")
    return concatenate_datasets(all_datasets) if all_datasets else None

# Run chunk annotation and collect datasets
generated_ds = generate_from_folder(chunk_folder, chunk_prefix, max_chunks)

#  Step 3: Merge manual + generated datasets
if generated_ds:
    full_training_ds = concatenate_datasets([existing_ds, generated_ds])
    print(f" Final training dataset: {len(full_training_ds)} examples")
else:
    full_training_ds = existing_ds
    print(" No chunk datasets found. Using only manually labeled data.")



Loaded manually labeled dataset: 200 examples
🔍 Annotating: /content/drive/MyDrive/41043/Project/uiuc_chunk_31.json
🔍 Annotating: /content/drive/MyDrive/41043/Project/uiuc_chunk_32.json
🔍 Annotating: /content/drive/MyDrive/41043/Project/uiuc_chunk_33.json
🔍 Annotating: /content/drive/MyDrive/41043/Project/uiuc_chunk_34.json
🔍 Annotating: /content/drive/MyDrive/41043/Project/uiuc_chunk_35.json
 Final training dataset: 95868 examples


  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)
