# BERT

### LOAD DATA

Explain preprocessing and loading data

In [None]:
import pandas as pd

# Function to load data from source and target files, preprocess targets
def load_data(source_file, target_file):
    # Read all lines from the source file
    with open(source_file, 'r') as f:
        sources = f.read().splitlines()
    
    # Read all lines from the target file and preprocess them
    with open(target_file, 'r') as f:
        targets = f.read().splitlines()
        # Remove spaces and convert to uppercase
        targets = [target.replace(" ", "").upper() for target in targets]
    
    return sources, targets

# Load training, validation, and testing data
train_sources, train_targets = load_data('dataset/train.source', 'dataset/train.target')
val_sources, val_targets = load_data('dataset/val.source', 'dataset/val.target')
test_sources, test_targets = load_data('dataset/test.source', 'dataset/test.target')

data_summary = pd.DataFrame({
    'Dataset': ['Training', 'Validation', 'Testing'],
    'Number of Clues': [len(train_sources), len(val_sources), len(test_sources)]
})

print(data_summary)

data_sample = pd.DataFrame({
    'Clue': train_sources,
    'Answer': train_targets
})

print(data_sample.head(5))

In [2]:
train_targets

['LIMOS',
 'EGGWHITE',
 'ELAM',
 'OUS',
 'ENNEA',
 'MAIDEN',
 'FLARES',
 'GENL',
 'EASTLA',
 'REVENGE',
 'NINO',
 'DSCS',
 'MENDS',
 'AREAMAP',
 'AMA',
 'INAPT',
 'STARMAN',
 'IMMESH',
 'ALEC',
 'SST',
 'THERESNOIINTEAM',
 'ACHE',
 'SNEAKER',
 'LUTIST',
 'WISHI',
 'DIKES',
 'PADDY',
 'AVID',
 'BUT',
 'AQUA',
 'CIGNA',
 'LAREDO',
 'CRATION',
 'MEADOWLOCK',
 'MAIDEN',
 'REICH',
 'SADA',
 'TREASURE',
 'SHEEP',
 'RAMBO',
 'PERSONALPRONOUN',
 'SMITE',
 'ASS',
 'SIEVE',
 'ELIAS',
 'WHENIMSIXTYFOUR',
 'CARDSHARKS',
 'INARREARS',
 'JOB',
 'IPSA',
 'BARE',
 'EASIER',
 'HAVEN',
 'TVSPOT',
 'BABYSIT',
 'MOC',
 'VOTED',
 'STOL',
 'ALGIERS',
 'TKO',
 'OCALA',
 'ERE',
 'FLORA',
 'OAFISH',
 'MADMEN',
 'ODIOUS',
 'CHOREOGRAPH',
 'GENIE',
 'GOBITWEEN',
 'GANGES',
 'ATONAL',
 'ACDC',
 'ONEALS',
 'PUSSY',
 'EDIT',
 'ARTY',
 'LUCAS',
 'SEEKASYLUM',
 'SATAY',
 'EELS',
 'SEAM',
 'SPARE',
 'IPO',
 'STEELS',
 'SOIREE',
 'COLERIDGE',
 'STOCK',
 'ABUSHELANDAPECK',
 'GOVERNMENTBILLS',
 'UMPTEEN',
 'SPORADIC',
 '

### LOADING MODEL

In this section of the project, we utilize the BertForMaskedLM model, a variant of BERT specifically designed for the task of masked language modeling (MLM). This model, originally pre-trained on a large corpus of unlabeled text, is capable of predicting missing words (tokens) in a given sentence. Our goal is to fine-tune this model to enhance its ability to predict a [MASK] token specifically positioned at the end of a sentence—a scenario directly applicable to solving crossword puzzles where the answer fits a given clue. 

In [17]:
from transformers import BertTokenizer, BertForMaskedLM
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Let's see how the model does without finetuning. 

In [18]:
# Prepare the text with a masked token
text = "capital of France [MASK]."
input_ids = tokenizer(text, return_tensors="pt").input_ids

# Predict all tokens to find the MASK
with torch.no_grad():
    outputs = model(input_ids)
    predictions = outputs.logits

# Get the predicted token (we take the first [MASK] token found in the input)
mask_token_index = torch.where(input_ids == tokenizer.mask_token_id)[1]
predicted_token_id = predictions[0, mask_token_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)

print(predicted_token)


france


We can see that the model is already pretty good. 

### Building the BERTDataset

Data Preparation: Each crossword clue is transformed into a sentence with a [MASK] token at the end. For instance, the clue "Capital of France" is reformatted to "Capital of France [MASK]." This prepares our data for the specific task of predicting the word that logically concludes the clue.

In [20]:
from torch.utils.data import Dataset
import torch

from torch.utils.data import Dataset
from transformers import BertTokenizer

class BERTDataset(Dataset):
    def __init__(self, tokenizer, clues, answers=None, max_length=128):
        self.tokenizer = tokenizer
        self.clues = clues
        self.answers = answers  # The correct answers for each clue
        self.max_length = max_length

    def __len__(self):
        return len(self.clues)

    def __getitem__(self, idx):
        clue = self.clues[idx]
        answer = self.answers[idx] if self.answers else None
        
        # Formulate the text with a [MASK] token at the end
        text_with_mask = clue + " [MASK]."

        # Tokenize the text
        encoding = self.tokenizer(
            text_with_mask,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )

        input_ids = encoding['input_ids'].squeeze(0)  # Remove the batch dimension
        attention_mask = encoding['attention_mask'].squeeze(0)

        if answer is not None:
            # Create a copy of input_ids to use as labels
            labels = input_ids.clone()
            
            # Get the index of the [MASK] token
            mask_index = (input_ids == self.tokenizer.mask_token_id).nonzero(as_tuple=True)[0]

            # Convert answer to token ID and set it as the label for the [MASK] position
            answer_token_id = self.tokenizer.convert_tokens_to_ids(self.tokenizer.tokenize(answer))
            labels[mask_index] = answer_token_id[0] if answer_token_id else self.tokenizer.unk_token_id

            # Set labels to -100 where the input IDs are not masked
            labels[labels != answer_token_id[0]] = -100
        else:
            # Set labels to -100 everywhere since there's no training happening
            labels = -100 * torch.ones_like(input_ids)

        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'labels': labels
        }

train_dataset = BERTDataset(tokenizer, train_sources, train_targets, max_length=128)
val_dataset = BERTDataset(tokenizer, val_sources, val_targets, max_length=128)

Let's check that the data is organized as planned : 

In [11]:
for i in range(3):
    sample = train_dataset[i]
    print("Tokens:", tokenizer.convert_ids_to_tokens(sample['input_ids']))
    print("Input IDs:", sample['input_ids'])
    print("Labels:", sample['labels'])
    print("Attention Mask:", sample['attention_mask'])
    print("\n")

Tokens: ['[CLS]', 'line', 'at', 'an', 'airport', '[MASK]', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', 

In this setup, we're using Hugging Face's `Trainer` interface with carefully chosen parameters to balance **training efficiency**, **model performance**, and **resource management**. We've selected a **moderate number of epochs**, **small training batch size** for effective GPU memory management, a **larger evaluation batch size** for faster validation, a **significant number of warmup steps**, a **light weight decay** for regularization, and **frequent logging** for closer monitoring of the model's performance and quicker debugging during training. These choices are designed to ensure effective learning from the training data while accommodating hardware limitations and task complexity, achieving a good balance between **accuracy** and **training speed**.

In [12]:
from transformers import TrainingArguments,Trainer

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # number of training epochs
    per_device_train_batch_size=8,   # batch size for training
    per_device_eval_batch_size=16,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.005,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

# Start training
trainer.train()

2024-06-18 12:29:07.302538: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33meliotullmo5[0m ([33mnlp_crossword[0m). Use [1m`wandb login --relogin`[0m to force relogin




Step,Training Loss
10,11.4449
20,11.0204
30,10.1498
40,10.2654
50,9.7392
60,9.0179
70,8.6933
80,8.6008
90,8.3441
100,8.0404




OSError: [Errno 122] Disk quota exceeded

--- Logging error ---


In [23]:
trainer.train("results/checkpoint-63000")

There were missing keys in the checkpoint model loaded: ['cls.predictions.decoder.weight', 'cls.predictions.decoder.bias'].


Step,Training Loss
63010,3.1906
63020,3.4416
63030,2.8699
63040,2.8855
63050,3.2432
63060,3.3826
63070,3.2644
63080,3.2579
63090,2.7806
63100,3.035




TrainOutput(global_step=81195, training_loss=0.6836696220946228, metrics={'train_runtime': 3969.1607, 'train_samples_per_second': 327.299, 'train_steps_per_second': 20.456, 'total_flos': 8.5496161329792e+16, 'train_loss': 0.6836696220946228, 'epoch': 3.0})