Installing Transformers

In [1]:
!pip install transformers
!pip install torch
!pip install sentencepiece # for huggingface model

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Necessary Imports

In [2]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification #TF for Tensorflow
from transformers import pipeline
import torch
import torch.nn.functional as F

### **First time using Hugging Face !** : Sentiment Analysis of French text using tf-allociné model (A french sentiment analysis model, based on CamemBERT, and finetuned on a large-scale dataset scraped from Allociné.fr user reviews)

Load classifier

In [3]:
tokenizer = AutoTokenizer.from_pretrained("tblard/tf-allocine")
model = TFAutoModelForSequenceClassification.from_pretrained("tblard/tf-allocine")

classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

All model checkpoint layers were used when initializing TFCamembertForSequenceClassification.

All the layers of TFCamembertForSequenceClassification were initialized from the model checkpoint at tblard/tf-allocine.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFCamembertForSequenceClassification for predictions without further training.


Perform sentiment analysis on different texts

In [4]:
positive_text = "Aujourd'hui je me sens super! Je vais sortir avec mes potes"
classifier(positive_text)

[{'label': 'POSITIVE', 'score': 0.9643841981887817}]

In [5]:
neutral_text = "J'ai regardé le concert d'Ed Sheeran la semaine dernière"
classifier(neutral_text)

[{'label': 'POSITIVE', 'score': 0.5830837488174438}]

In [6]:
negative_text = "Les services de transport ici sont génants"
classifier(negative_text)

[{'label': 'NEGATIVE', 'score': 0.6865605115890503}]

In [7]:
texts_list = ["Aujourd'hui je me sens super! Je vais sortir avec mes potes", "J'ai regardé le concert d'Ed Sheeran la semaine dernière", "Les services de transport ici sont génants"]
classifier(texts_list)

[{'label': 'POSITIVE', 'score': 0.9643841981887817},
 {'label': 'POSITIVE', 'score': 0.5830837488174438},
 {'label': 'NEGATIVE', 'score': 0.6865605115890503}]

### Fine tuning

Importing necessary libraries

In [8]:
from pathlib import Path
from torch.utils.data import Dataset
from sklearn.model_selection import train_test_split
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
from transformers import Trainer, TrainingArguments

Prepare Dataset

In [9]:
# Read dataset and adapt it to our usage
def read_imdb_dataset(dataset_dir):
  dataset_dir = Path(dataset_dir)
  texts = []
  labels = []
  for label in ["pos", "neg"]:
    for text_file in (dataset_dir/label).iterdir():
      texts.append(text_file.read_text())
      labels.append(0 if label == "neg" else 1)
  return texts, labels

In [10]:
# Mount drive files

from google.colab import drive

drive.mount('/content/gdrive/', force_remount=True)

Mounted at /content/gdrive/


In [11]:
# Uncompress dataset file we got from https://ai.stanford.edu/~amaas/data/sentiment/
import tarfile

file = tarfile.open('/content/gdrive/MyDrive/aclImdb_v1.tar.gz')
file.extractall('./aclImdb')
file.close()

In [12]:
# get data

train_texts, train_labels = read_imdb_dataset('./aclImdb/aclImdb/train')
test_texts, test_labels = read_imdb_dataset('./aclImdb/aclImdb/test')

In [13]:
# Shuffle training data and create validation dataset

train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

In [14]:
# Create an IMDB Dataset class, I saw this in a tutorial recently and it is very useful

class IMDbDataset(Dataset):
  def __init__(self, encodings, labels):
    self.encodings = encodings
    self.labels = labels

  def __getitem__(self, id):
    item = {key: torch.tensor(val[id]) for key, val in self.encodings.items()}
    item['label'] = torch.tensor(self.labels[id])
    return item

  def __len__(self):
    return len(self.labels)


Import the tokenizer:

In [15]:
model_name = 'distilbert-base-uncased'
tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)

Tokenize our text

In [16]:
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

Create our Dataset objects so we can

In [17]:
train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)

Setup Training arguments

In [18]:
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=2,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    learning_rate=5e-5,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    optim='adamw_torch'
)

Initialize model and trainer, then train the pre-trained model to fine-tune it to our case

In [19]:
# initializing the model (from pre-trained model)
model = DistilBertForSequenceClassification.from_pretrained(model_name)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier

In [20]:
# Initializing trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

In [None]:
# Train !

trainer.train()

***** Running training *****
  Num examples = 20000
  Num Epochs = 2
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 2500
  Number of trainable parameters = 66955010


Step,Training Loss
