# Part 1: Fine-Tuning BERT
Task: Fine-tune a pre-trained BERT model for a specific NLP task using Hugging Face.

Choose an NLP task:

Examples: Sentiment analysis, text classification, question answering, or named entity recognition.
Prepare your dataset:

Use a public dataset (e.g., IMDb for sentiment analysis, SQuAD for question answering).
Ensure the dataset is preprocessed appropriately (e.g., tokenization using Hugging Face's tokenizer).
Fine-tune BERT:

Load a pre-trained BERT model from Hugging Face (e.g., bert-base-uncased).
Set up a training loop with Hugging Face's Trainer API.
Specify hyperparameters such as batch size, learning rate, and number of epochs.
Monitor training:

Track loss and accuracy during training.
Save the fine-tuned model.
Deliverable: Submit the code for fine-tuning, training logs, and a short analysis of the results.

In [28]:
# DEPENDENCIES
from transformers import BertTokenizerFast, BertForQuestionAnswering, Trainer, TrainingArguments
from datasets import load_dataset
import torch
from torch.utils.data import DataLoader
from tqdm import tqdm
import evaluate
import os

device='cuda' if torch.cuda.is_available() else 'cpu'

In [17]:
# TOKENIZE TRAINING DATA
# Load the dataset
dataset = load_dataset('squad')

# Load the BERT tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

# Tokenize the data
def tokenize_function(example):
    tokenized = tokenizer(
        example['question'],
        example['context'],
        truncation="only_second",
        padding="max_length",
        max_length=384,
        stride=128,
        return_overflowing_tokens=False,
        return_offsets_mapping=True
    )

    # Get the start and end character positions of the answer
    start_char = example['answers']['answer_start'][0]
    end_char = start_char + len(example['answers']['text'][0])

    # Determine the token positions
    offsets = tokenized["offset_mapping"]
    sequence_ids = tokenized.sequence_ids()

    # Find context start index
    context_start = sequence_ids.index(1)
    context_end = len(sequence_ids) - 1 - sequence_ids[::-1].index(1)

    start_token = end_token = 0
    for i in range(context_start, context_end + 1):
        if offsets[i][0] <= start_char < offsets[i][1]:
            start_token = i
        if offsets[i][0] < end_char <= offsets[i][1]:
            end_token = i
            break

    tokenized['start_positions'] = start_token
    tokenized['end_positions'] = end_token

    return tokenized


tokenized_datasets = dataset.map(tokenize_function)

In [18]:
# PREPARE DATA FOR PYTORCH
tokenized_datasets.set_format("torch", columns=['input_ids', 'attention_mask', 'start_positions', 'end_positions'])

train_dataset = tokenized_datasets["train"].shuffle(seed=123).select(range(2000)) # Use a subset for quick training
test_dataset = tokenized_datasets["validation"].shuffle(seed=123).select(range(500))

# Load the pre-trained BERT model
m1 = BertForQuestionAnswering.from_pretrained('bert-base-uncased').to(device)

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [19]:
# Define training arguments
training_args = TrainingArguments(
    output_dir="./m1results",
    eval_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    save_steps=10,
)

# Define a Trainer instance
trainer = Trainer(
    model=m1,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

# Train the model
trainer.train()

Epoch,Training Loss,Validation Loss
1,3.622,3.618852
2,2.9364,2.823883
3,2.1941,2.618894


TrainOutput(global_step=375, training_loss=3.2709960072835287, metrics={'train_runtime': 162.2887, 'train_samples_per_second': 36.971, 'train_steps_per_second': 2.311, 'total_flos': 1175835405312000.0, 'train_loss': 3.2709960072835287, 'epoch': 3.0})

In [20]:
m1.save_pretrained("models/m1")

# Part 2: Debugging Issues
Task: Identify and resolve issues during BERT fine-tuning or prediction.

Introduce or encounter common issues:

Examples:
Poor performance on validation data.
Overfitting or underfitting.
Long training times or memory errors.
Analyze the problem:

Review training logs and validation metrics.
Inspect the tokenization or dataset preprocessing.
Debug the issues:

Adjust hyperparameters (e.g., learning rate, number of epochs).
Use data augmentation or regularization techniques to address overfitting.
Optimize memory usage by reducing batch size or gradient accumulation.
Test the refined model:

Re-run training or predictions after debugging.
Compare results before and after debugging.
Deliverable: Submit the initial issue, debugging steps, and improved results, with a brief explanation of your process.

In [21]:
# PREPARE DATA FOR PYTORCH
tokenized_datasets.set_format("torch", columns=['input_ids', 'attention_mask', 'start_positions', 'end_positions'])

train_dataset = tokenized_datasets["train"].shuffle(seed=123).select(range(2000)) # Use a subset for quick training
test_dataset = tokenized_datasets["validation"].shuffle(seed=123).select(range(500))

# Load the pre-trained BERT model
m2 = BertForQuestionAnswering.from_pretrained('bert-base-uncased').to('cuda')

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [22]:
# Define training arguments
training_args = TrainingArguments(
    output_dir="./m2results",
    eval_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=8,
    num_train_epochs=10,
    weight_decay=0.05,
    logging_dir='./logs',
    logging_steps=10,
    save_steps=10,
    fp16=True,
    no_cuda=False
)

# Define a Trainer instance
trainer = Trainer(
    model=m2,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

# Train the model
trainer.train()

Epoch,Training Loss,Validation Loss
1,4.3805,4.104551
2,3.7424,3.676965
3,3.2701,3.401626
4,2.8325,2.936658
5,2.0705,2.598091
6,1.9442,2.432039
7,1.6609,2.353867
8,1.4638,2.339232
9,1.4132,2.328587
10,1.1542,2.337282


TrainOutput(global_step=320, training_loss=2.4482653081417083, metrics={'train_runtime': 200.7604, 'train_samples_per_second': 99.621, 'train_steps_per_second': 1.594, 'total_flos': 3919451351040000.0, 'train_loss': 2.4482653081417083, 'epoch': 10.0})

Tokenizing procedure remained unchanged.
Increasing epochs shows positive growth that stagnates around 10 epochs in, kept learning rate as it was but slower learning rates of 1 or 1.5 benefit more around 12 epochs in. Batch size is halved for memory optimization and gradient accumulation is introduced. Decay rate increased to 0.05 to combat overfitting.

In [23]:
m2.save_pretrained("models/m2")

# Part 3: Evaluating the Model
Task: Use evaluation metrics to assess the fine-tuned BERT model.

Generate predictions on a test set:

Use the fine-tuned model to make predictions on unseen data.
Evaluate performance using these metrics:

Accuracy: For classification tasks.
F1-Score: Balance of precision and recall.
Exact Match (EM): For question answering tasks.
Mean Squared Error (MSE): For regression tasks.
Log Loss: For probabilistic outputs.
Refine the model:

Based on evaluation results, adjust the model (e.g., by refining prompts, hyperparameters, or preprocessing).
Deliverable: Submit evaluation metrics, a comparison of results before and after refinement, and a reflection on the improvements.

In [37]:
# TOKENIZE TEST DATA
# Load the dataset, sampled size not present in training validation set
dataset = load_dataset('squad')['validation'].shuffle(seed=123).select(range(500, 1012))

# Load the BERT tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

def tokenize_test_set(examples):
    # Tokenize using the question and context
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=384,
        stride=128,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length"
    )

    # Needed to align overflowed tokens back to examples
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    offset_mapping = tokenized_examples["offset_mapping"]

    # Prepare extra fields
    tokenized_examples["id"] = []
    tokenized_examples["context"] = []
    tokenized_examples["answers"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        sample_index = sample_mapping[i]
        tokenized_examples["id"].append(examples["id"][sample_index])
        tokenized_examples["context"].append(examples["context"][sample_index])
        tokenized_examples["answers"].append(examples["answers"][sample_index])

        # Set to None for question and special tokens
        sequence_ids = tokenized_examples.sequence_ids(i)
        tokenized_examples["offset_mapping"][i] = [
            o if sequence_ids[k] == 1 else (-1, 1) for k, o in enumerate(offset_mapping[i])
        ]


    return tokenized_examples

# Apply to entire validation set
from datasets import Dataset
tokenized_test_set = dataset.map(
    tokenize_test_set,
    batched=True,
    remove_columns=dataset.column_names,
)
tokenized_test_set.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "token_type_ids"],
    output_all_columns=True
)

In [38]:
# FUNCTIONS FOR PREDICTION AND INTERPRETATION
def get_predictions(model, tokenized_test_set, batch_size=16):
    model.to(device)
    model.eval()
    
    from torch.utils.data import DataLoader

    dataloader = DataLoader(
        tokenized_test_set.remove_columns(["offset_mapping", "context", "answers", "id"]),
        batch_size=batch_size
    )
    print("Dataloader created successfully.")

    start_logits_all = []
    end_logits_all = []

    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Predicting"):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            start_logits_all.append(outputs.start_logits.cpu())
            end_logits_all.append(outputs.end_logits.cpu())
    
    start_logits = torch.cat(start_logits_all, dim=0)
    end_logits = torch.cat(end_logits_all, dim=0)

    return start_logits, end_logits
    
def decode_predictions(start_logits, end_logits, tokenized_dataset, tokenizer, n_best_size=1, max_answer_length=30):
    print("Starting decode_predictions...")
    
    predictions = []
    for i in range(len(start_logits)):

        if i % 128 == 0:
            print(f"Decoding sample {i}/{len(start_logits)}")
            
        input_ids = tokenized_dataset[i]['input_ids']
        offset_mapping = tokenized_dataset[i]['offset_mapping']
        context = tokenized_dataset[i]['context']
        qas_id = tokenized_dataset[i]['id']
        
        start_logit = start_logits[i]
        end_logit = end_logits[i]

        # Get the best start-end span
        max_score = float('-inf')
        best_span = (0, 0)
        for start_index in range(len(start_logit)):
            for end_index in range(start_index, min(len(end_logit), start_index + max_answer_length)):
                score = start_logit[start_index] + end_logit[end_index]
                if score > max_score and offset_mapping[start_index] is not None and offset_mapping[end_index] is not None:
                    max_score = score
                    best_span = (start_index, end_index)

        # Decode answer
        start_char = offset_mapping[best_span[0]][0]
        end_char = offset_mapping[best_span[1]][1]
        answer = context[start_char:end_char]
        predictions.append({
            "id": qas_id,
            "prediction_text": answer
        })

    print("Finished decode_predictions.")
    return predictions

In [39]:
# EVALUATION
for file in os.listdir('models/'):
    model = BertForQuestionAnswering.from_pretrained('models/' + str(file))
    start_logits, end_logits = get_predictions(model, tokenized_test_set)
    predictions = decode_predictions(start_logits, end_logits, tokenized_test_set, tokenizer)
    
    # Convert predictions and references to dictionaries keyed by ID
    pred_dict = {p['id']: p for p in predictions}
    ref_dict = {r['id']: r for r in references}
    
    # Keep only IDs present in BOTH
    common_ids = set(pred_dict.keys()) & set(ref_dict.keys())
    
    # Filter to aligned lists
    aligned_preds = [pred_dict[i] for i in common_ids]
    aligned_refs = [ref_dict[i] for i in common_ids]
    
    metric = evaluate.load("squad")
    references = [
        {
            "id": ex["id"],
            "answers": ex["answers"]
        } for ex in dataset
    ]
    
    results = metric.compute(predictions=aligned_preds, references=aligned_refs)
    print(str(file), 'results')
    print(f"EM: {results['exact_match']:.2f}")
    print(f"F1: {results['f1']:.2f}\n")

Dataloader created successfully.


Predicting: 100%|██████████████████████████████████████████████████████████████████████| 33/33 [00:03<00:00, 10.39it/s]


Starting decode_predictions...
Decoding sample 0/524
Decoding sample 128/524
Decoding sample 256/524
Decoding sample 384/524
Decoding sample 512/524
Finished decode_predictions.
m1 results
EM: 32.23
F1: 45.18

Dataloader created successfully.


Predicting: 100%|██████████████████████████████████████████████████████████████████████| 33/33 [00:03<00:00, 10.28it/s]


Starting decode_predictions...
Decoding sample 0/524
Decoding sample 128/524
Decoding sample 256/524
Decoding sample 384/524
Decoding sample 512/524
Finished decode_predictions.
m2 results
EM: 42.19
F1: 54.08



# Part 4: Creative Application
Task: Apply BERT to solve a real-world NLP problem.

Choose a creative NLP task:

Examples:
Classify customer reviews as positive or negative.
Extract key entities (e.g., names, dates) from legal documents.
Answer questions based on a given passage of text.
Build and fine-tune your BERT model:

Use Hugging Face's model hub to experiment with different BERT variants (e.g., distilbert-base-uncased, bert-large-cased).
Use advanced techniques like data augmentation, early stopping, or mixed precision training.
Debug and evaluate the model:

Troubleshoot issues and ensure the model performs well on the chosen task.
Deliverable: Submit the final fine-tuned BERT model, evaluation metrics, and a summary of the techniques you used to achieve the best results.

In [56]:
model = BertForQuestionAnswering.from_pretrained('models/m2')

question = "What month is my birthday?"
context = "My birthday is in October. It is not in August. My sister's birthday is in May."

tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
inputs = tokenizer.encode_plus(question, context, return_tensors='pt')

with torch.no_grad():
    outputs = model(**inputs)

start_logits = outputs.start_logits
end_logits = outputs.end_logits

start_index = torch.argmax(start_logits)
end_index = torch.argmax(end_logits) + 1

answer_tokens = inputs['input_ids'][0][start_index:end_index]
answer = tokenizer.decode(answer_tokens)
print("Answer:", answer)


Answer: october
