<a href="https://colab.research.google.com/github/cld0033/Tone_It_Down/blob/main/transformers_custom_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Custom fine-tuneable model for Tone it Down

This is the python notebook used to generate a custom Transformers model for the Tone it Down app based on a Hugging Face dataset.

## Install relevant packages:

In [480]:
#install if not installed; hide output
!pip install fsspec==2024.10.0 #gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.9.0 which is incompatible.
!pip install datasets -q
!pip install transformers -q
!pip install torch -q
!pip install onnx onnxruntime -q
!pip install optimum -q
!pip install Cmake -q

Collecting fsspec==2024.10.0
  Using cached fsspec-2024.10.0-py3-none-any.whl.metadata (11 kB)
Using cached fsspec-2024.10.0-py3-none-any.whl (179 kB)
Installing collected packages: fsspec
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2024.9.0
    Uninstalling fsspec-2024.9.0:
      Successfully uninstalled fsspec-2024.9.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datasets 3.1.0 requires fsspec[http]<=2024.9.0,>=2023.1.0, but you have fsspec 2024.10.0 which is incompatible.[0m[31m
[0mSuccessfully installed fsspec-2024.10.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.9.0 which is incompatible.[0m[31m
[0m

In [481]:
#import relevant libraries
import torch
import datasets
from transformers import T5ForConditionalGeneration, T5Tokenizer, Trainer, TrainingArguments, EncoderDecoderCache, AutoModelForSequenceClassification
from sklearn.model_selection import train_test_split
import numpy as np

## Load the T5 tokenizer and model

Used bert-base-uncased model for classification. Since there were 67 unique labels in the dataset, loaded the model with num_labels=67

In [482]:
#load a tokenizer and training model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load the model with 67 labels
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=67)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Load and pre-process a dataset from Huggingface

1. Cleaned the dataset by removing the idx column and splitting into a training and validation set
2. Mapped each unique label to a unique integer, which is compatible with machine learning models (note - mapping is stored in label_mapping)
3. Applied a preprocess function that should tokenize what's in the "text"

In [483]:
#load a dataset
dataset = datasets.load_dataset("uhoui/text-tone-classifier")
#https://huggingface.co/datasets/uhoui/text-tone-classifier/viewer/default/train?f%5Bidx%5D%5Bmin%5D=80&f%5Bidx%5D%5Bmax%5D=90 <-- I used this one bc of the dataset() function

In [484]:
#remove idx column
dataset = dataset.remove_columns("idx")

#manually created splits
num_samples = len(dataset['train'])
train_indices, val_indices = train_test_split(range(num_samples), test_size=0.2, random_state=42)

# Create train and validation datasets using select
train_dataset = dataset['train'].select(train_indices)
val_dataset = dataset['train'].select(val_indices)

# Create a DatasetDict with separate splits
split_dataset = datasets.DatasetDict({
    'train': train_dataset,
    'validation': val_dataset
})

In [485]:
#verify that split happened
print("split dataset: \n", split_dataset)

split dataset: 
 DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 392
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 99
    })
})


In [486]:
# Verify unique labels across both train and validation sets
unique_labels_train = set(split_dataset['train']['label'])
unique_labels_val = set(split_dataset['validation']['label'])

print(f"Unique labels in train set: {unique_labels_train}")
print(f"Unique labels in validation set: {unique_labels_val}")

# Combine the labels from both sets to see the full range
unique_labels_all = unique_labels_train.union(unique_labels_val)
print(f"Unique labels in the entire dataset: {unique_labels_all}")
print(len(unique_labels_all))

Unique labels in train set: {'awe', 'adventure', 'joy', 'chill', 'bliss', 'nostalgia', 'comfort', 'compassion', 'concern', 'adrenaline', 'connection', 'Neutral', 'worry', 'pleasure', 'bittersweetness', 'parting', 'despair', 'positive', 'Serious', 'accomplishment', 'sorrow', 'disgust', 'courage', 'love', 'determination', 'peace', 'tension', 'longing', 'Negative', 'admiration', 'Positive', 'surprise', 'amusement', 'empathy', 'restlessness', 'relief', 'stress', 'anticipation', 'negative', 'depression', 'excitement', 'fear', 'tranquility', 'annoyance', 'inspiration', 'grief', 'pride', 'disappointment', 'pain', 'sadness', 'Sarcastic', 'gratitude', 'anxiety', 'happiness', 'neutral', 'anger', 'loneliness', 'reflection', 'contentment', 'curiosity'}
Unique labels in validation set: {'awe', 'joy', 'nostalgia', 'comfort', 'concern', 'Neutral', 'worry', 'pleasure', 'Serious', 'positive', 'disgust', 'hope', 'appreciation', 'love', 'heat', 'Negative', 'Positive', 'surprise', 'empathy', 'relief', 'st

In [488]:
from collections import defaultdict

# Initialize label_mapping as a defaultdict to avoid key errors
label_mapping = defaultdict(lambda: None)  # Default value is None
current_label_index = 0

# Function to update the labels with integer mapping
def update_label(examples):
    global current_label_index

    # Iterate through the labels and assign integer values
    for i, label in enumerate(examples['label']):
        if label not in label_mapping:
            print(f"Adding label: {label}")
            label_mapping[label] = current_label_index
            current_label_index += 1  # Increment index for the next new label
        examples['label'][i] = label_mapping[label]

    return examples

# Apply label update to both the train and validation sets
tokenized_dataset = split_dataset.map(update_label, batched=True)

# Print the outputs to inspect the label mapping
print("preview of dataset: ", tokenized_dataset["validation"][1:10])
print("labels: ", unique_labels_all)
print("label mapping: ", dict(label_mapping))  # Convert defaultdict to dict for printing


Map:   0%|          | 0/392 [00:00<?, ? examples/s]

Adding label: fear
Adding label: Neutral
Adding label: Positive
Adding label: Sarcastic
Adding label: annoyance
Adding label: restlessness
Adding label: Serious
Adding label: empathy
Adding label: sadness
Adding label: positive
Adding label: neutral
Adding label: disgust
Adding label: negative
Adding label: joy
Adding label: awe
Adding label: anger
Adding label: peace
Adding label: longing
Adding label: disappointment
Adding label: worry
Adding label: reflection
Adding label: relief
Adding label: surprise
Adding label: stress
Adding label: grief
Adding label: Negative
Adding label: concern
Adding label: happiness
Adding label: connection
Adding label: tranquility
Adding label: tension
Adding label: determination
Adding label: inspiration
Adding label: bittersweetness
Adding label: excitement
Adding label: sorrow
Adding label: pleasure
Adding label: adventure
Adding label: amusement
Adding label: anticipation
Adding label: compassion
Adding label: anxiety
Adding label: pride
Adding labe

Map:   0%|          | 0/99 [00:00<?, ? examples/s]

Adding label: appreciation
Adding label: hope
Adding label: serenity
Adding label: rejection
Adding label: heat
Adding label: elation
preview of dataset:  {'text': ["I can't stop thinking about the negative consequences of climate change on future generations. It's truly worrisome.", 'Absolutely thrilled to be stuck in traffic.', "I'm positively overjoyed to spend hours waiting for a friend who's always late.", 'Review the latest financial report for our meeting next week.', 'Seeing a snake in the grass on a dark path gives me chills and I freeze up for a moment.', 'Just won the lottery! Best day ever!', 'Grateful for the overwhelming support during tough times.', 'Oh joy, my computer crashed again just as I was about to save my work.', 'I was utterly captivated by the enchanting performance.'], 'label': [19, 3, 3, 1, 0, 13, 9, 3, 9]}
labels:  {'adventure', 'joy', 'chill', 'bliss', 'nostalgia', 'comfort', 'concern', 'adrenaline', 'Neutral', 'worry', 'bittersweetness', 'parting', 'despa

In [422]:
print("preview of dataset: ", filter_dataset["validation"][1:10])
print("labels: ", unique_labels_all)
print("label mapping: ", label_mapping)

preview of dataset:  {'text': ["I can't stop thinking about the negative consequences of climate change on future generations. It's truly worrisome.", 'Absolutely thrilled to be stuck in traffic.', "I'm positively overjoyed to spend hours waiting for a friend who's always late.", 'Review the latest financial report for our meeting next week.', 'Seeing a snake in the grass on a dark path gives me chills and I freeze up for a moment.', 'Just won the lottery! Best day ever!', 'Grateful for the overwhelming support during tough times.', 'Oh joy, my computer crashed again just as I was about to save my work.', 'I was utterly captivated by the enchanting performance.'], 'label': [10, 36, 36, 9, 55, 2, 44, 36, 44]}
labels:  {'adventure', 'joy', 'chill', 'bliss', 'nostalgia', 'comfort', 'concern', 'adrenaline', 'Neutral', 'worry', 'bittersweetness', 'parting', 'despair', 'Serious', 'accomplishment', 'disgust', 'love', 'determination', 'peace', 'tension', 'longing', 'admiration', 'heat', 'surpr

In [423]:
print("pre-filter header: \n", split_dataset['train'].take(10).to_pandas())
print("post-filter header: \n", filter_dataset['train'].take(10).to_pandas())

pre-filter header: 
                                                 text         label
0  My chest tightens with the fear of confrontati...          fear
1  Today's menu includes salad, soup, and sandwic...       Neutral
2  I'm thrilled to announce that my application w...      Positive
3  Looking forward to the annual 'Employee of the...     Sarcastic
4  Why does this always happen to me? I can't see...     annoyance
5  My child's school performance has improved sig...      Positive
6  The relentless drumming of rain against the wi...  restlessness
7  The CEO's decision to lay off a significant po...       Serious
8  Seeing the volunteers helping to clean up the ...       empathy
9  The silence in the room is deafening, and it f...       sadness
post-filter header: 
                                                 text  label
0  My chest tightens with the fear of confrontati...     55
1  Today's menu includes salad, soup, and sandwic...      9
2  I'm thrilled to announce that my appl

In [424]:
# Assuming filter_dataset is a dictionary with 'train' and 'validation' subsets
# Apply preprocess function to each subset (train and validation)

def preprocess_function(examples):
    try:
        # Ensure consistent tokenizer usage
        global tokenizer

        # Input extraction
        inputs = examples["text"]

        # Tokenization with special token handling (if needed)
        model_inputs = tokenizer(
            inputs,
            max_length=128,
            truncation=True,
            padding="max_length",
            # add_special_tokens=True,  # Add if needed
        )

        # Label assignment
        model_inputs["labels"] = examples["label"]

        return model_inputs
    except Exception as e:
        print(f"Error during preprocessing: {e}")
        return None  # Or handle the error appropriately

# Apply the preprocess function to 'train' and 'validation' separately
tokenized_train = filter_dataset['train'].map(preprocess_function, batched=True)
tokenized_validation = filter_dataset['validation'].map(preprocess_function, batched=True)

# If needed, you can combine the processed datasets back into a dictionary
tokenized_dataset = {
    'train': tokenized_train,
    'validation': tokenized_validation
}


Some print statements to make sure things look ok

In [425]:
print(tokenized_dataset)

{'train': Dataset({
    features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
    num_rows: 392
}), 'validation': Dataset({
    features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
    num_rows: 99
})}


In [426]:
print(tokenized_dataset.keys())
print(tokenized_dataset["train"].features)
print(tokenized_dataset["train"][:5])

dict_keys(['train', 'validation'])
{'text': Value(dtype='string', id=None), 'label': Value(dtype='int64', id=None), 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'labels': Value(dtype='int64', id=None)}
{'text': ["My chest tightens with the fear of confrontation, it's a dreaded emotion.", "Today's menu includes salad, soup, and sandwiches.", "I'm thrilled to announce that my application was accepted. Thanks for all the support!", "Looking forward to the annual 'Employee of the Month' award. It's truly a prestigious title.", "Why does this always happen to me? I can't seem to catch a break."], 'label': [55, 9, 50, 36, 32], 'input_ids': [[101, 2026, 3108, 21245, 2015, 2007, 1996, 3571, 1997, 13111, 1010, 2009, 1005, 1055, 1037, 14436, 2098, 7603, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [427]:
# Iterate through the first 5 examples
for i in range(5):
    print(tokenized_dataset["train"][i])

{'text': "My chest tightens with the fear of confrontation, it's a dreaded emotion.", 'label': 55, 'input_ids': [101, 2026, 3108, 21245, 2015, 2007, 1996, 3571, 1997, 13111, 1010, 2009, 1005, 1055, 1037, 14436, 2098, 7603, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 

In [428]:
vocabulary = tokenizer.get_vocab()
token = vocabulary.get(2026)
print(token)
vocabulary_size = len(tokenizer.get_vocab())
print(f"Vocabulary size: {vocabulary_size}")

None
Vocabulary size: 30522


In [429]:
for input_id in tokenized_dataset["train"][1]["input_ids"]:
    token = tokenizer.convert_ids_to_tokens(input_id)  # Built-in method
    if input_id in tokenizer.all_special_ids:  # Handle special tokens
        print(f"Special token: {token}")
    else:
        print(f"Token: {token}")

Special token: [CLS]
Token: today
Token: '
Token: s
Token: menu
Token: includes
Token: salad
Token: ,
Token: soup
Token: ,
Token: and
Token: sandwiches
Token: .
Special token: [SEP]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]
Special token: [PAD]

In [430]:
# Print a sample of the vocabulary
print(tokenizer.vocab_size)  # Check vocabulary size
print(tokenizer.decode([2026]))  # Decode a specific token ID to see its mapping
print(tokenizer.convert_ids_to_tokens([2651]))  # Convert to token

30522
my
['today']


## this is the training part.
1. include training arguments. These arguments can be modified for fine tuning. I don't really get it though?
2. Collate the data using torch. There's a lot of shape manipulation because the train function kept throwing an error with datatypes.
3. Wrote a custom compute metrics function so the model will report back on metrics after running the training.

In [439]:
#fine tune model using trainer API via hugging face
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=5e-6,
    per_device_train_batch_size=8,
    num_train_epochs=4,
    weight_decay=0.01,
    lr_scheduler_type="linear",  # Or "cosine", "polynomial", etc.
    warmup_steps=100,  # Number of warmup steps
    logging_steps=10
)


Make a data collator. This ensures that the input_ids match the labels, which for some reason is neaded when passing all the information to tensorflow

In [440]:
import torch

def custom_data_collator(data):
    # Convert input_ids and attention_mask to tensors
    input_ids = torch.stack([torch.tensor(x['input_ids'], dtype=torch.long) for x in data])
    attention_mask = torch.stack([torch.tensor(x['attention_mask'], dtype=torch.long) for x in data])

    # Convert labels to a tensor
    labels = torch.tensor([x['labels'] for x in data], dtype=torch.long)

    # Debugging statements
    print(f"input_ids shape: {input_ids.shape}")
    print(f"attention_mask shape: {attention_mask.shape}")
    print(f"labels shape: {labels.shape}")

    return {
        'input_ids': input_ids,
        'attention_mask': attention_mask,
        'labels': labels,
    }

make a compute_metrics function, that customizes what metrics show up while running the trainer function

In [441]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import numpy as np
import torch

def compute_metrics(eval_pred):
    # Extract logits and labels
    logits, labels = eval_pred.predictions, eval_pred.label_ids

    # Ensure logits are numpy arrays (if they're in tensor format)
    if isinstance(logits, torch.Tensor):
        logits = logits.detach().cpu().numpy()

    # Ensure labels are numpy arrays (if they're in tensor format)
    if isinstance(labels, torch.Tensor):
        labels = labels.detach().cpu().numpy()

    # Get predicted class by taking argmax along the last dimension
    predictions = np.argmax(logits, axis=-1)

    # Remove padding (ignore -100 labels)
    valid_mask = labels != -100
    predictions = predictions[valid_mask]
    labels = labels[valid_mask]

    # Calculate accuracy
    accuracy = accuracy_score(labels, predictions)

    # Calculate precision, recall, and F1 score
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='weighted')

    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}


In [442]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=custom_data_collator,
    compute_metrics=compute_metrics # Pass the function here,
)

In [443]:
trainer.train()
# bf19dff74bb3083b13faf628d07195823fc10890

input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,4.1491,4.141126,0.0,0.0,0.0,0.0
2,4.1201,4.049891,0.030303,0.012397,0.030303,0.017595
3,3.8802,3.939443,0.050505,0.011962,0.050505,0.019342


input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
i

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
i

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
i

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
i

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,4.1491,4.141126,0.0,0.0,0.0,0.0
2,4.1201,4.049891,0.030303,0.012397,0.030303,0.017595
3,3.8802,3.939443,0.050505,0.011962,0.050505,0.019342
4,3.903,3.917088,0.060606,0.013085,0.060606,0.02108


input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
input_ids shape: torch.Size([8, 128])
attention_mask shape: torch.Size([8, 128])
labels shape: torch.Size([8])
i

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


TrainOutput(global_step=196, training_loss=4.0239134321407395, metrics={'train_runtime': 2220.8375, 'train_samples_per_second': 0.706, 'train_steps_per_second': 0.088, 'total_flos': 103199726837760.0, 'train_loss': 4.0239134321407395, 'epoch': 4.0})

In [444]:
print(f"Model output configuration: {model.config.num_labels}")


Model output configuration: 67


## Understanding the results:
(pasted from gemini)
The "good" values for accuracy, precision, and F1 score depend on the nature of your task, the class imbalance, and what you prioritize in the performance metrics. Here's a breakdown of each metric and general guidelines:

1. Accuracy:
Definition: Accuracy is the proportion of correct predictions out of all predictions. It is calculated as:
Accuracy
=
True Positives
+
True Negatives
Total Samples
Accuracy=
Total Samples
True Positives+True Negatives
​

Good values:
Generally, higher accuracy is better. However, accuracy can be misleading in imbalanced datasets. For example, if 95% of your data is class A and 5% is class B, a model predicting class A all the time will have 95% accuracy but will fail to identify any instances of class B.
A "good" accuracy value depends on the dataset, but over 90% is usually strong in most tasks.
2. Precision:
Definition: Precision measures how many of the predicted positive cases are actually positive. It is calculated as:
Precision
=
True Positives
True Positives
+
False Positives
Precision=
True Positives+False Positives
True Positives
​

Good values:
High precision means fewer false positives. It's important when you want to minimize the cost of false alarms (e.g., in medical diagnostics or fraud detection).
Precision close to 1.0 is ideal (i.e., low number of false positives), but depending on the application, a precision of 0.7 or higher could be acceptable.
3. Recall (Sensitivity):
Definition: Recall measures how many of the actual positive cases were correctly identified. It is calculated as:
Recall
=
True Positives
True Positives
+
False Negatives
Recall=
True Positives+False Negatives
True Positives
​

Good values:
High recall means fewer false negatives, which is important when you don't want to miss any positive cases (e.g., in detecting diseases or rare events).
Ideally, you want recall close to 1.0, but depending on the use case, a recall of 0.7 or higher can be acceptable.
4. F1 Score:
Definition: F1 score is the harmonic mean of precision and recall, providing a balance between the two. It is calculated as:
F1
=
2
×
Precision
×
Recall
Precision
+
Recall
F1=2×
Precision+Recall
Precision×Recall
​

Good values:
The F1 score balances the trade-off between precision and recall. A high F1 score suggests that both precision and recall are high.
An F1 score close to 1.0 is ideal, but 0.7 or higher is typically good for many tasks, especially if the precision and recall are balanced. If precision and recall differ significantly, the F1 score will reflect that imbalance.
What is considered a "good" score?
For imbalanced data: Accuracy can be misleading in imbalanced datasets, so precision, recall, and F1 score become more important. For example, in fraud detection, you might prioritize high recall (identifying as many fraudulent cases as possible) over accuracy.

General guidelines:

Precision and recall should both be as high as possible, but they may need to be balanced based on the application (e.g., medical diagnosis vs. general classification).
F1 score is often considered the best metric when you need to balance precision and recall, especially if you're unsure which one to prioritize.
Example in practice:
High precision, lower recall: This is acceptable when false positives are more costly than false negatives (e.g., detecting rare diseases where you don't want to misclassify a healthy person as sick).
High recall, lower precision: This is better when missing positives is more costly than false alarms (e.g., detecting a rare condition and avoiding missing any positive cases).
Benchmarks:
Good scores in balanced datasets:

Accuracy: > 85%
Precision: > 0.75
Recall: > 0.75
F1 Score: > 0.75
Good scores in imbalanced datasets (depending on class imbalance):

Precision and Recall values might vary based on how much you care about false positives or false negatives. Aim for high F1 score in these cases.

#export the model and convert to ONNX

In [445]:
#Export fine tune model
print("model files: \n")
model.save_pretrained("./fine_tuned_model", safe_serialization=False)
print("tokenizer files: \n")
tokenizer.save_pretrained("./fine_tuned_model")

model files: 

tokenizer files: 



('./fine_tuned_model/tokenizer_config.json',
 './fine_tuned_model/special_tokens_map.json',
 './fine_tuned_model/vocab.txt',
 './fine_tuned_model/added_tokens.json',
 './fine_tuned_model/tokenizer.json')

Explanation of the files:
- added_tokens.json: This file contains additional tokens added to the model's vocabulary.
- config.json: Contains model-specific configuration settings.
- generation_config.json: Model-specific settings related to generation tasks (not directly relevant for tokenization).
- special_tokens_map.json: Maps special tokens such as [CLS], [SEP], etc.
- spiece.model: The SentencePiece model file containing the vocabulary and subword segmentation rules.
- tokenizer_config.json: Configuration for the tokenizer itself.

Below is the code to convert the exported model to ONNX.

In [None]:
!pip install optimum[onnxruntime]


Collecting evaluate (from optimum[onnxruntime])
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [446]:
import onnxruntime as rt
import onnx

In [452]:
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Load your trained model and tokenizer
model_path = "./fine_tuned_model"
model = AutoModelForSequenceClassification.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Define dummy input for the model, include special tokens and attention mask
text = "This is a test sentence."  # Example text
inputs = tokenizer(text, return_tensors="pt")

# Export to ONNX
onnx_path = "./fine_tuned_model/onnx_model/model.onnx"
torch.onnx.export(
    model,                       # The trained model
    inputs.data,                  # Provide the tokenized input as a dictionary
    onnx_path,                   # Path where ONNX model will be saved
    input_names=["input_ids", "attention_mask"],   # Name of input layers
    output_names=["logits"],     # Name of output layers
    dynamic_axes={"input_ids": {0: "batch_size", 1: "sequence_length"},
                  "attention_mask": {0: "batch_size", 1: "sequence_length"},
                  "logits": {0: "batch_size"}},  # Dynamic axes
    opset_version=14             # Changed opset version to 14
)

# Save the tokenizer
tokenizer.save_pretrained("./fine_tuned_model/onnx_model")

Some weights of the model checkpoint at ./fine_tuned_model were not used when initializing BertForSequenceClassification: ['decoder.block.0.layer.0.SelfAttention.k.weight', 'decoder.block.0.layer.0.SelfAttention.o.weight', 'decoder.block.0.layer.0.SelfAttention.q.weight', 'decoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight', 'decoder.block.0.layer.0.SelfAttention.v.weight', 'decoder.block.0.layer.0.layer_norm.weight', 'decoder.block.0.layer.1.EncDecAttention.k.weight', 'decoder.block.0.layer.1.EncDecAttention.o.weight', 'decoder.block.0.layer.1.EncDecAttention.q.weight', 'decoder.block.0.layer.1.EncDecAttention.v.weight', 'decoder.block.0.layer.1.layer_norm.weight', 'decoder.block.0.layer.2.DenseReluDense.wi.weight', 'decoder.block.0.layer.2.DenseReluDense.wo.weight', 'decoder.block.0.layer.2.layer_norm.weight', 'decoder.block.1.layer.0.SelfAttention.k.weight', 'decoder.block.1.layer.0.SelfAttention.o.weight', 'decoder.block.1.layer.0.SelfAttention.q.weight', 'decoder

('./fine_tuned_model/onnx_model/tokenizer_config.json',
 './fine_tuned_model/onnx_model/special_tokens_map.json',
 './fine_tuned_model/onnx_model/vocab.txt',
 './fine_tuned_model/onnx_model/added_tokens.json',
 './fine_tuned_model/onnx_model/tokenizer.json')

In [448]:
from onnxruntime.quantization import quantize_dynamic, QuantType

# Quantize ONNX model
onnx_model_path = "./fine_tuned_model/onnx_model/model.onnx"
quantized_model_path = "./fine_tuned_model/onnx_model/model-quantized.onnx"

quantize_dynamic(
    model_input=onnx_model_path,
    model_output=quantized_model_path,
    weight_type=QuantType.QUInt8  # Use INT8 quantization
)



In [449]:
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("./fine_tuned_model")

# Save as tokenizer.json
tokenizer.save_pretrained("./fine_tuned_model", legacy_format=False)

('./fine_tuned_model/tokenizer_config.json',
 './fine_tuned_model/special_tokens_map.json',
 './fine_tuned_model/tokenizer.json')

#Test the model

In [497]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Load the model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained("./fine_tuned_model")
tokenizer = AutoTokenizer.from_pretrained("./fine_tuned_model")

# Example manual input
input_text = "I am very sad."

# Tokenize the input text
inputs = tokenizer(input_text, return_tensors="pt")

# Get model predictions
with torch.no_grad():
    logits = model(**inputs).logits

# Get predicted class by taking argmax of logits
predicted_class = logits.argmax(dim=-1).item()  # Convert to a Python scalar

# Get the predicted label
# Reversing the label_mapping to create a mapping from integers to labels
inverted_label_mapping = {v: k for k, v in label_mapping.items()}
print("predicted class: ", predicted_class)
predicted_label = inverted_label_mapping.get(predicted_class, "Unknown")
print(f"Predicted label: {predicted_label}")

Some weights of the model checkpoint at ./fine_tuned_model were not used when initializing BertForSequenceClassification: ['decoder.block.0.layer.0.SelfAttention.k.weight', 'decoder.block.0.layer.0.SelfAttention.o.weight', 'decoder.block.0.layer.0.SelfAttention.q.weight', 'decoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight', 'decoder.block.0.layer.0.SelfAttention.v.weight', 'decoder.block.0.layer.0.layer_norm.weight', 'decoder.block.0.layer.1.EncDecAttention.k.weight', 'decoder.block.0.layer.1.EncDecAttention.o.weight', 'decoder.block.0.layer.1.EncDecAttention.q.weight', 'decoder.block.0.layer.1.EncDecAttention.v.weight', 'decoder.block.0.layer.1.layer_norm.weight', 'decoder.block.0.layer.2.DenseReluDense.wi.weight', 'decoder.block.0.layer.2.DenseReluDense.wo.weight', 'decoder.block.0.layer.2.layer_norm.weight', 'decoder.block.1.layer.0.SelfAttention.k.weight', 'decoder.block.1.layer.0.SelfAttention.o.weight', 'decoder.block.1.layer.0.SelfAttention.q.weight', 'decoder

predicted class:  18
Predicted label: disappointment


In [490]:
print(label_mapping)

defaultdict(<function <lambda> at 0x7c3d8913f910>, {'fear': 0, 'Neutral': 1, 'Positive': 2, 'Sarcastic': 3, 'annoyance': 4, 'restlessness': 5, 'Serious': 6, 'empathy': 7, 'sadness': 8, 'positive': 9, 'neutral': 10, 'disgust': 11, 'negative': 12, 'joy': 13, 'awe': 14, 'anger': 15, 'peace': 16, 'longing': 17, 'disappointment': 18, 'worry': 19, 'reflection': 20, 'relief': 21, 'surprise': 22, 'stress': 23, 'grief': 24, 'Negative': 25, 'concern': 26, 'happiness': 27, 'connection': 28, 'tranquility': 29, 'tension': 30, 'determination': 31, 'inspiration': 32, 'bittersweetness': 33, 'excitement': 34, 'sorrow': 35, 'pleasure': 36, 'adventure': 37, 'amusement': 38, 'anticipation': 39, 'compassion': 40, 'anxiety': 41, 'pride': 42, 'adrenaline': 43, 'nostalgia': 44, 'courage': 45, 'loneliness': 46, 'pain': 47, 'love': 48, 'bliss': 49, 'curiosity': 50, 'depression': 51, 'despair': 52, 'chill': 53, 'parting': 54, 'admiration': 55, 'gratitude': 56, 'accomplishment': 57, 'comfort': 58, 'contentment': 