# Training and Tuning a pre-trained Large Language Model in Google Colab
Charlotte Roscoe


**This Project is an attempt at training a Large Language Model for our specific use case: the input is a word that could appear in a column of a database, and the output is a sensitivity score on a scale of 0 to 100 (integers only).**

##Step 1: Installing Necessary Libraries

For this project, we need to install 4 libraries:

**transformers**
- A library developed by 'Hugging Face', providing machine learning models, particularly for natural language processing tasks. It includes pre-trained models like BERT, GPT, and others, which can be used for tasks such as text classification, translation, and summarization.

**datasets**
- Another library developed by 'Hugging Face', which includes datasets that can be used for a wide range of Machine Learning tasks. It simplifies the process of accessing and using datasets, allowing users to load, preprocess, and manipulate data easily.

**torch**
- Installs PyTorch, an open-source machine learning library developed by Facebook's AI Research lab. PyTorch is widely used for developing and training deep learning models.

In [1]:
!pip install transformers
!pip install datasets
!pip install torch

Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
Collecting requests>=2.32.2 (from datasets)
  Downloading requests-2.32.3-py3-none-any.whl (64 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1

## Step 2: Training our Model

In [14]:
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments

# load dataset from files
df = pd.read_csv('data.csv')

'''
Defines a custom dataset class called SensitivityDataset that inherits from torch.utils.data.Dataset.
__init__ initializes the dataset with texts, labels, tokenizer, and maximum length for tokenization.
__len__ returns the number of samples in the dataset.
__getitem__ retrieves the text and label at a given index, tokenizes the text, and returns a dictionary with tokenized inputs and the label.
'''
class SensitivityDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoding = self.tokenizer(
            text,
            add_special_tokens=True,
            truncation=True,
            padding='max_length',
            max_length=self.max_len,
            return_tensors='pt'
        )
        item = {key: torch.squeeze(val) for key, val in encoding.items()}
        item['labels'] = torch.tensor(label, dtype=torch.float)
        return item

'''
Initializes the tokenizer and model using a pre-trained model name distilbert-base-uncased.
- > a tokenizer converts human text into a format the computer can understand.
the model is a sequence classification model with one output label.
'''
model_name = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1)

# prep the dataset
max_len = 128
dataset = SensitivityDataset(df['word'].tolist(), df['sensitivity_score'].tolist(), tokenizer, max_len)

'''
Defines training arguments:
output_dir (output directory) specifies where to save model checkpoints.
num_train_epochs (number of training epochs) sets the number of training epochs to 18.
- > An Epoch is one complete pass through the entire training dataset.
per_device_train_batch_size and per_device_eval_batch_size set the batch size for training and evaluation.
warmup_steps specifies the number of warmup steps for learning rate scheduling.
weight_decay sets the weight decay for regularization.
logging_dir specifies where to save logs.
logging_steps sets the interval for logging.
'''
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=18,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10
)

# init trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset
)

trainer.train()

model.save_pretrained('./sensitivity_model')
tokenizer.save_pretrained('./sensitivity_model')


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
10,1708.7734
20,1721.4004
30,1808.6834
40,1827.393
50,1773.7072
60,1797.5129
70,1772.7496
80,1693.5805
90,1964.0676
100,1340.7682


('./sensitivity_model/tokenizer_config.json',
 './sensitivity_model/special_tokens_map.json',
 './sensitivity_model/vocab.txt',
 './sensitivity_model/added_tokens.json',
 './sensitivity_model/tokenizer.json')

##Step 3: Querying our Model

In [20]:
#importing libraries
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# load saved model
tokenizer = AutoTokenizer.from_pretrained('./sensitivity_model')
model = AutoModelForSequenceClassification.from_pretrained('./sensitivity_model')

# Loads the pre-trained tokenizer and model from the ./sensitivity_model directory.
def get_sensitivity_score(text):
    encoding = tokenizer(
        text,
        return_tensors='pt',
        truncation=True,
        padding='max_length',
        max_length=128
    )
    with torch.no_grad():
        outputs = model(**encoding)
    score = outputs.logits.item()
    return score

# defining test words (in the real system, this would be where we would query the OpenSearch index to retrieve the column names.)
words = ["Classified", "Address", "Apple", "Cheese"]

highest_number = float('-inf')  # Start with the lowest possible value
highest_word = None

# Loop through each word in the list, getting the sensitivity score for each and storing the highest
for word in words:
    number = get_sensitivity_score(word)
    if number > highest_number:
        highest_number = number
        highest_word = word

# Printing the highest score
print(f"Highest Sensitivity Score: {round(highest_number)}")

Highest Sensitivity Score: 92
