<a href="https://colab.research.google.com/github/cld0033/Tone_It_Down/blob/main/allison_transformers_custom.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Custom fine-tuneable model for Tone it Down

This is the python notebook used to generate a custom Transformers model for the Tone it Down app based on a Hugging Face dataset.

## Install relevant packages:

In [1]:
#install if not installed; hide output
!pip install fsspec==2024.10.0 #gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.9.0 which is incompatible.
!pip install datasets -q
!pip install transformers -q
!pip install torch -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.9.0 which is incompatible.[0m[31m
[0m

In [15]:
#import relevant libraries
import torch
import datasets
from transformers import T5ForConditionalGeneration, T5Tokenizer, Trainer, TrainingArguments
from sklearn.model_selection import train_test_split
import numpy as np

## Load the T5 tokenizer and model

In [3]:
#load a tokenizer and training model
model_name = "t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

## Load and pre-process a dataset from Huggingface

Preprocess function should tokenize what's in the "text" and apply the "labels" ???

Also modified the labels to be mapped to an integer instead of to a string (this is because it returns an error when it's a string)

In [4]:
#load a dataset
dataset = datasets.load_dataset("uhoui/text-tone-classifier")
#https://huggingface.co/datasets/uhoui/text-tone-classifier/viewer/default/train?f%5Bidx%5D%5Bmin%5D=80&f%5Bidx%5D%5Bmax%5D=90 <-- I used this one bc of the dataset() function

#tokenize dataset and prepare it for training
def preprocess_function(examples):
    labels = [label_mapping[label] for label in examples["label"]]
    inputs = examples["text"]
    model_inputs = tokenizer(inputs, max_length=128, truncation=True, padding="max_length")
    model_inputs["labels"] = labels
    return  model_inputs

README.md:   0%|          | 0.00/100 [00:00<?, ?B/s]

data-all.csv:   0%|          | 0.00/42.1k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/491 [00:00<?, ? examples/s]

In [5]:
from collections import defaultdict

# Create a dictionary to store the mapping from string labels to integers
label_mapping = defaultdict(lambda: len(label_mapping) + 1)

# Extract unique labels from the original dataset
unique_labels = set(dataset['train']['label'])  # Using 'dataset' instead of 'filter_dataset'

# Ensure all unique labels are in label_mapping
for label in unique_labels:
    _ = label_mapping[label]  # Accessing the label adds it to the mapping

# Update the 'label' column in the original dataset
def update_label(example):
    example['label'] = label_mapping[example['label']]
    return example

filter_dataset = dataset.map(update_label)

Map:   0%|          | 0/491 [00:00<?, ? examples/s]

In [6]:
print(filter_dataset)
print("header: \n", dataset['train'].take(10).to_pandas())
print("header: \n", filter_dataset['train'].take(10).to_pandas())

print(unique_labels)

DatasetDict({
    train: Dataset({
        features: ['idx', 'text', 'label'],
        num_rows: 491
    })
})
header: 
    idx                                               text           label
0    0  I am absolutely thrilled with the service I re...             joy
1    1     It's frustrating when the meeting starts late!           anger
2    2  The news about the community event has left me...         sadness
3    3  Wow, I didn't expect to see my friends here to...        surprise
4    4  I'm really worried about the upcoming exams. I...           worry
5    5  The concert last night was the best experience...             joy
6    6  The constant noise from the construction site ...       annoyance
7    7  I can't believe how beautifully the sunset loo...             awe
8    8  Such a shame that the project was canceled, I ...  disappointment
9    9  The chocolate cake was a delightful surprise a...        pleasure
header: 
    idx                                               te

In [7]:
#run preprocess function on dataset
tokenized_dataset = filter_dataset.map(preprocess_function, batched=True)


Map:   0%|          | 0/491 [00:00<?, ? examples/s]

## Split the dataset into a training and validation set

In [8]:
#manually created splits
num_samples = len(tokenized_dataset['train'])
train_indices, val_indices = train_test_split(range(num_samples), test_size=0.2, random_state=42)

# Create train and validation datasets using select
train_dataset = tokenized_dataset['train'].select(train_indices)
val_dataset = tokenized_dataset['train'].select(val_indices)

# Create a DatasetDict with separate splits
split_dataset = datasets.DatasetDict({
    'train': train_dataset,
    'validation': val_dataset
})

In [9]:
#verify that split happened
print("unsplit dataset: \n", tokenized_dataset)
print("split dataset: \n", split_dataset)


unsplit dataset: 
 DatasetDict({
    train: Dataset({
        features: ['idx', 'text', 'label', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 491
    })
})
split dataset: 
 DatasetDict({
    train: Dataset({
        features: ['idx', 'text', 'label', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 392
    })
    validation: Dataset({
        features: ['idx', 'text', 'label', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 99
    })
})


## this is the training part.
1. include training arguments. These arguments can be modified for fine tuning. I don't really get it though?
2. Collate the data using torch. There's a lot of shape manipulation because the train function kept throwing an error with datatypes.
3. Wrote a custom compute metrics function so the model will report back on metrics after running the training.

In [10]:
#fine tune model using trainer API via hugging face
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)


In [35]:
# Define a custom data collator
def custom_data_collator(data):
    # Convert input_ids and labels to tensors
    input_ids = torch.stack([torch.tensor(x['input_ids'], dtype=torch.long) for x in data])
    attention_mask = torch.stack([torch.tensor(x['attention_mask'], dtype=torch.long) for x in data])
    # Get the labels
    labels = torch.tensor([x['labels'] for x in data], dtype=torch.long)
    # Instead of creating decoder_input_ids, pad the labels to match the input_ids shape
    # We pad with -100, which is the ignore_index for the cross-entropy loss in Hugging Face
    labels = labels.unsqueeze(1)  # Add a dimension for sequence length
    labels = torch.nn.functional.pad(labels, (0, input_ids.shape[1] - 1), value=-100)

    # Return dictionary of tensors
    return {
        'input_ids': input_ids,
        'attention_mask': attention_mask,
        'labels': labels,
    }

In [51]:
def compute_metrics(eval_pred):
    # Assuming eval_pred is a tuple of (logits, labels),
    # extract logits and labels based on the model's output structure
    logits = eval_pred.predictions[0]
    labels = eval_pred.label_ids

    # Convert logits to a PyTorch tensor if it's a NumPy array
    if isinstance(logits, np.ndarray):
        logits = torch.from_numpy(logits)

    # Get predicted class for each token by taking argmax along the last dimension
    predictions = logits.argmax(dim=-1)
    predictions = predictions[labels != -100]
    labels = labels[labels != -100]

    # Flatten predictions and labels if needed
    predictions = predictions.flatten()
    labels = labels.flatten()

    # Ignore predictions where labels are -100 (padding)
    valid_indices = labels != -100
    predictions = predictions[valid_indices]
    labels = labels[valid_indices]

    # Calculate and return the metric
    from sklearn.metrics import accuracy_score #import if not already done
    accuracy = accuracy_score(labels, predictions)
    return {"accuracy": accuracy}


In [52]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=split_dataset["train"],
    eval_dataset=split_dataset["validation"],
    data_collator=custom_data_collator,
    compute_metrics=compute_metrics # Pass the function here
)

In [53]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,7.086746,0.0
2,No log,6.968014,0.0
3,No log,6.933857,0.0


TrainOutput(global_step=75, training_loss=6.903486328125, metrics={'train_runtime': 1398.3882, 'train_samples_per_second': 0.841, 'train_steps_per_second': 0.054, 'total_flos': 39790489632768.0, 'train_loss': 6.903486328125, 'epoch': 3.0})

## Understanding the results:
(pasted from gemini)
### Accuracy
An accuracy of 0 means your model is not making any correct predictions on your evaluation data. This is a strong indicator that something is not right with your training process. It could be related to various factors, including:

- Incorrect Data Preparation: Make sure your data is correctly preprocessed and formatted for the model's input. Check for issues like incorrect labels, data type mismatches, or improper tokenization.
- Model Initialization: If your model's weights are initialized poorly, it might struggle to learn effectively.
- Hyperparameter Tuning: Certain hyperparameters, such as the learning rate, can significantly impact training. You might need to experiment with different values.
- Model Architecture: The model architecture might not be suitable for your task. You might need to consider using a different model or modifying the existing one.

### Validation Loss
The validation loss is a measure of how well your model is performing on unseen data (the validation set). A high validation loss indicates that the model is not generalizing well to new data. In your case, a consistent loss of around 7 suggests that the model is not learning effectively.

#export the model and convert to ONNX

In [54]:
#Export fine tune model
model.save_pretrained("./fine_tuned_model")
tokenizer.save_pretrained("./fine_tuned_model")

('./fine_tuned_model/tokenizer_config.json',
 './fine_tuned_model/special_tokens_map.json',
 './fine_tuned_model/spiece.model',
 './fine_tuned_model/added_tokens.json')

Below is the code to convert the exported model to ONNX.

In [None]:
!pip install onnx onnxruntime

In [None]:
import onnxruntime as rt
import onnx

In [None]:
text = "This is a test sentence."
   inputs = tokenizer(text, return_tensors="pt")

# Dynamic axes for variable input lengths
   dynamic_axes = {
       'input_ids': {0: 'batch_size', 1: 'sequence_length'},
       'attention_mask': {0: 'batch_size', 1: 'sequence_length'},
       'output': {0: 'batch_size', 1: 'sequence_length'}
   }

   # Export the model
   torch.onnx.export(
       model,
       args=tuple(inputs.values()),
       f=model_path + "/model.onnx",  # Output ONNX file path
       input_names=['input_ids', 'attention_mask'],  # Input tensor names
       output_names=['output'],  # Output tensor name
       dynamic_axes=dynamic_axes,  # Enable dynamic axes
       opset_version=13,  # Choose an appropriate opset version
   )

In [None]:
# Load the ONNX model
   ort_session = rt.InferenceSession(model_path + "/model.onnx")

   # Get ONNX model inputs and outputs
   ort_inputs = {ort_session.get_inputs()[i].name: inputs.get(ort_session.get_inputs()[i].name).cpu().numpy() for i in range(len(ort_session.get_inputs()))}

   # Run inference with the ONNX Runtime
   ort_outs = ort_session.run(None, ort_inputs)

   # Compare ONNX Runtime output with PyTorch output (optional)
   # ...