<a href="https://colab.research.google.com/github/eneskaya20/finetuning-sentiment-model-3000-samples/blob/main/trial_distilbert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [6]:
#necessary installations
!pip install datasets transformers huggingface_hub
!apt-get install git-lfs
!pip install accelerate -U
!pip install transformers[torch]

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.2).
0 upgraded, 0 newly installed, 0 to remove and 8 not upgraded.


In [7]:
# import of necessary libraries
import torch
import numpy as np
from datasets import load_metric
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification
from huggingface_hub import notebook_login
from transformers import TrainingArguments, Trainer
from transformers import pipeline

#checking if cuda is available
torch.cuda.is_available()


True

In [8]:
#loading the imdb dataset from dataset library
imdb = load_dataset("imdb")




  0%|          | 0/3 [00:00<?, ?it/s]

In [9]:
#splitting the dataset into train and test for 3000/300
small_train_dataset = imdb["train"].shuffle(seed=42).select([i for i in list(range(3000))])
small_test_dataset = imdb["test"].shuffle(seed=42).select([i for i in list(range(300))])




In [10]:
#initializin distilbert tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [11]:
# tokenization of train and test datasets using preprocess_function
def preprocess_function(examples):
   return tokenizer(examples["text"], truncation=True)

tokenized_train = small_train_dataset.map(preprocess_function, batched=True) #batched=True means function is applied in a batched manned to improve efficiency
tokenized_test = small_test_dataset.map(preprocess_function, batched=True)


Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

In [12]:
# initializin data collator which uses padding(which is to make sure every line has same number of elements) for data collator(which is a process that helps us use the data in baches)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


In [13]:
#initialization of the pretrained distilbert model num_labels=2 because we are using binary classification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)


Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [14]:
# evaluating model performance and computing metrics on evaluation predictions
def compute_metrics(eval_pred):
   load_accuracy = load_metric("accuracy")
   load_f1 = load_metric("f1")

   logits, labels = eval_pred
   predictions = np.argmax(logits, axis=-1)
   accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
   f1 = load_f1.compute(predictions=predictions, references=labels)["f1"]
   return {"accuracy": accuracy, "f1": f1}


In [18]:
# login screen for huggingface
notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [19]:
repo_name = "finetuning-sentiment-model-3000-samples"

training_args = TrainingArguments( # a class from transformers
   output_dir=repo_name, # output directory
   learning_rate=2e-5, # step size
   per_device_train_batch_size=16,
   per_device_eval_batch_size=16, # batch sizes for train and evaluation datas
   num_train_epochs=2, # training iteration count
   weight_decay=0.01, # regularization technique used to prevent overfitting during training.
   save_strategy="epoch", # strategy for saving
   push_to_hub=True, # push to huggingface hub
)
# giving the parameters we defined earlier to the model
trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_train,
   eval_dataset=tokenized_test,
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics,
)


/content/finetuning-sentiment-model-3000-samples is already a clone of https://huggingface.co/eneskaya/finetuning-sentiment-model-3000-samples. Make sure you pull the latest changes with `repo.git_pull()`.


In [20]:
trainer.train()


You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss


TrainOutput(global_step=376, training_loss=0.29876207798085314, metrics={'train_runtime': 360.0011, 'train_samples_per_second': 16.667, 'train_steps_per_second': 1.044, 'total_flos': 785643443397696.0, 'train_loss': 0.29876207798085314, 'epoch': 2.0})

In [21]:
trainer.evaluate()


  load_accuracy = load_metric("accuracy")


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

{'eval_loss': 0.32882195711135864,
 'eval_accuracy': 0.8666666666666667,
 'eval_f1': 0.8684210526315789,
 'eval_runtime': 6.477,
 'eval_samples_per_second': 46.317,
 'eval_steps_per_second': 2.933,
 'epoch': 2.0}

In [22]:
trainer.push_to_hub()


To https://huggingface.co/eneskaya/finetuning-sentiment-model-3000-samples
   514498d..c02c251  main -> main

   514498d..c02c251  main -> main



Upload file runs/Jul19_18-02-24_fb7e2c50182a/events.out.tfevents.1689789781.fb7e2c50182a.526.0: 100%|#########…

To https://huggingface.co/eneskaya/finetuning-sentiment-model-3000-samples
   c02c251..145ba72  main -> main

   c02c251..145ba72  main -> main



'https://huggingface.co/eneskaya/finetuning-sentiment-model-3000-samples/commit/c02c2511a1de35bf236e5c9f368e6821fc429774'

In [23]:
sentiment_model = pipeline(model="federicopascual/finetuning-sentiment-model-3000-samples")
sentiment_model(["I love this move", "This movie sucks!"])


Downloading (…)lve/main/config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/333 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


[{'label': 'LABEL_1', 'score': 0.9558863043785095},
 {'label': 'LABEL_0', 'score': 0.9413502216339111}]