# 1. Activate GPU and Install Dependencies

---



In [1]:
# Activate GPU for faster training by clicking on 'Runtime' > 'Change runtime type' and then selecting GPU as the Hardware accelerator
# Then check if GPU is available
import torch
torch.backends.mps.is_built()

True

# 2. Preprocess data

In [2]:
# Load data
from datasets import load_dataset
imdb = load_dataset("imdb",trust_remote_code=True)

In [3]:
imdb


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [4]:
imdb['train'][0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

In [5]:
imdb['train'].unique('label')


[0, 1]

In [6]:
# Create a smaller training dataset for faster training times
small_train_dataset = imdb["train"].shuffle(seed=42).select([i for i in list(range(3000))])
small_test_dataset = imdb["test"].shuffle(seed=42).select([i for i in list(range(300))])
print(small_train_dataset[0])
print(small_test_dataset[0])

{'text': 'There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier\'s plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it\'s the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...', 'label': 1}
{'text': "<br /><br />When I unsuspectedly rented A Thousand Acres, I thought I was in for an entertaining King Lear story and of course Michelle Pfeiffer was in it, so what could go wrong?<br /><br />Very quickly, 

In [7]:
# Set DistilBERT tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")



In [8]:
# Prepare the text inputs for the model
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_train = small_train_dataset.map(preprocess_function, batched=True)
tokenized_test = small_test_dataset.map(preprocess_function, batched=True)

In [9]:
# Use data_collector to convert the samples to PyTorch tensors and concatenate them with the correct amount of padding
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# 3. Training the model

In [10]:
# Define DistilBERT as our base model:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [11]:
# Define the evaluation metrics
import numpy as np
from datasets import load_metric

def compute_metrics(eval_pred):
    load_accuracy = load_metric("accuracy")
    load_f1 = load_metric("f1")

    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
    f1 = load_f1.compute(predictions=predictions, references=labels)["f1"]
    return {"accuracy": accuracy, "f1": f1}

In [12]:
# Log in to your Hugging Face account
# Get your API token here https://huggingface.co/settings/token
from huggingface_hub import notebook_login

# In my computer I was not able to use notebook_login(). I used: huggingface-cli login, which is OK because I'm running it in my computer.

#notebook_login()

In [13]:
# Define a new Trainer with all the objects we constructed so far
from transformers import TrainingArguments, Trainer

repo_name = "finetuning-sentiment-model-3000-samples"

training_args = TrainingArguments(
    output_dir=repo_name,
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    save_strategy="epoch",
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [14]:
# Train the model
trainer.train()

  0%|          | 0/376 [00:00<?, ?it/s]

{'train_runtime': 250.0889, 'train_samples_per_second': 23.991, 'train_steps_per_second': 1.503, 'train_loss': 0.29356463412021067, 'epoch': 2.0}


TrainOutput(global_step=376, training_loss=0.29356463412021067, metrics={'train_runtime': 250.0889, 'train_samples_per_second': 23.991, 'train_steps_per_second': 1.503, 'total_flos': 782725021021056.0, 'train_loss': 0.29356463412021067, 'epoch': 2.0})

In [15]:
# Compute the evaluation metrics
trainer.evaluate()

  0%|          | 0/19 [00:00<?, ?it/s]

  load_accuracy = load_metric("accuracy")


Downloading builder script:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

{'eval_loss': 0.3658331632614136,
 'eval_accuracy': 0.8533333333333334,
 'eval_f1': 0.8562091503267973,
 'eval_runtime': 193.4876,
 'eval_samples_per_second': 1.55,
 'eval_steps_per_second': 0.098,
 'epoch': 2.0}

# 4. Analyzing new data with the model

In [16]:
# Upload the model to the Hub
trainer.push_to_hub()

CommitInfo(commit_url='https://huggingface.co/bloodyaca/finetuning-sentiment-model-3000-samples/commit/a3421846c84bc36934ff46590df8790d3ee2e432', commit_message='End of training', commit_description='', oid='a3421846c84bc36934ff46590df8790d3ee2e432', pr_url=None, pr_revision=None, pr_num=None)

In [17]:
# Run inferences with your new model using Pipeline
from transformers import pipeline

sentiment_model = pipeline(task="sentiment-analysis", model="bloodyaca/finetuning-sentiment-model-3000-samples")

sentiment_model(["I love this move", "This movie sucks!"])

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': 'LABEL_1', 'score': 0.9485695958137512},
 {'label': 'LABEL_0', 'score': 0.9614585041999817}]

In [18]:
sentiment_model(["neither good nor bad"])

[{'label': 'LABEL_0', 'score': 0.7222694754600525}]

In [19]:
sentiment_model(["neither bad nor good"])

[{'label': 'LABEL_0', 'score': 0.5521848797798157}]

In [20]:
sentiment_model(["its a must-see", "long and without plot", "fell asleep at the middle", "kept me on my toes"])

[{'label': 'LABEL_1', 'score': 0.917130172252655},
 {'label': 'LABEL_0', 'score': 0.672274112701416},
 {'label': 'LABEL_0', 'score': 0.7770716547966003},
 {'label': 'LABEL_1', 'score': 0.7797839045524597}]

In [21]:
sentiment_model(["couldnt belive the end", "couldnt belive the end, it was amazing", "couldnt belive the end, it made no sense"])

[{'label': 'LABEL_0', 'score': 0.6549532413482666},
 {'label': 'LABEL_1', 'score': 0.9567467570304871},
 {'label': 'LABEL_0', 'score': 0.9153274297714233}]

In [22]:
sentiment_model(["I almost liked it"])

[{'label': 'LABEL_1', 'score': 0.8092709183692932}]