Transformers and Hugging Face

What are transformers in NLP?

Transformers is the new simple yet powerful neural network architecture introduced by Google Brain in 2017 with their famous research paper “Attention is all you need.” It is based on the attention mechanism instead of the sequential computation as we might observe in recurrent networks.

What are the main components of transformers?

Similar to recurrent networks, transformers also have two main blocs: encoder and decoder, each one having a self-attention mechanism. The first version of transformers had RNN and LSTM encoder-decoder architecture, which have been changed later into self-attention and feed-forward networks.

The following section provides a general overview of the main components of each block of transformers.

Input sentence preprocessing stage

This section contains two main steps: (1) the generation of the embeddings of the input sentence, and (2) the computation of the positional vector of each word in the input sentence. All the computations are performed the same way for both the source sentence (before the encoder block) and the target sentence (before the decoder block).

Embedding of the input data

Before generating the embeddings of the input data, we start by performing the tokenization, then create the embedding of each individual word without paying attention to their relationship in the sentence.

Positional encoding

The tokenization task discards any notion of relations that existed in the input sentence. The positional encoding tries to create the original cyclic nature by generating a context vector for each word.

Encoder bloc

At the end of the previous step, we get for each word two vectors: (1) the embedding and (2) its context vector. These vectors are added to create a single vector for each word, which is then transmitted to the encoder.

Multi-head attention

As mentioned previously, we lost all notion of a relationship. The goal of the attention layer is to capture the contextual relationships existing between different words in the input sentence. This step ends up generating an attention vector for each word.

Position-wise feed-forward net (FFN)

At this stage, a feed-forward neural network is applied to every attention vector to transform them into a format that is expected by the next multi-head attention layer in the decoder.

Decoder block

The decoder block consists of three main layers: masked multi-head attention, multi-head attention, and a position-wise feed-forward network. We already understand the last two layers, which are the same in the encoder.

The decoder comes into the equation during the training of the network, and it receives two main inputs: (1) the attention vectors of the input sentence we want to translate and (2) the translated target sentences in English.

So, what is the masked multi-head attention layer responsible for?

During the generation of the next English word, the network is allowed to use all the words from the French word. However, when dealing with a given word in the target sequence (English translation), the network only has to access the previous words because making the next ones available will lead the network to “cheat” and not make any effort to learn properly. Here is where the masked multi-head attention layer has all its benefits. It masks those next words by transforming them into zeros so that they can’t be used by the attention network.

The result of the masked multi-head attention layer passes through the rest of the layers in order to predict the next word by generating a probability score.

This architecture was successful because of the following reasons:

The total computational complexity at each layer is lower compared to RNNs.
It totally got rid of all necessity of recurrence and allows sequence parallelization, unlike RNNs that expect the input to be in sequence.
RNNs are not efficient with learning from long-range sequences because of the lengths of the path forward and backward signals in the network. This path is shortened using self-attention, which improves the learning process.
Transfer Learning in NLP

Training deep neural networks such as transformers from scratch is not an easy task, and might present the following challenges:

Finding the required amount of data for the target problem can be time-consuming
Getting the necessary computation resources like GPUs to train such deep networks can be very costly.  
Using transfer learning can have many benefits, such as reducing the training time, speeding up the training process of new models, and decreasing project delivery time.

Imagine building a model from scratch to translate Mandingo language into Wolof, which are both low resources languages. Gathering data related to those languages is costly. Instead of going through all these challenges, one can re-use pre-trained deep-neural networks as the starting point for training the new model.

Such models have been trained on a huge corpus of data, made available by someone else (moral person, organization, etc.), and evaluated to work very well on language translation tasks such as French to English.

If you are new to NLP, this Introduction to Natural Language Processing in Python course can provide you with the fundamental skills to perform and solve real-world problems.

But what do you mean by re-use of deep-neural networks?

The re-use of the model involves choosing the pre-trained model that is similar to your use case, refining the input-output pair data of your target task, and retraining the head of the pre-trained model by using your data.

The introduction of Transformers has led to the development of state-of-the-art transfer learning models such as:

BERT, short for Bidirectional Encoder Representations from Transformers, was developed by Google researchers in 2018. It helps to solve the most common language tasks such as named entity recognition, sentiment analysis, question-answering, text-summarization, etc. Read more about BERT in this NLP tutorial.
GPT3 (Generative Pre-Training-3), proposed by OpenAI researchers. It is a multi-layer transformer, mainly used to generate any type of text. GPT models are capable of producing human-like text responses to a given question. This amazing article can provide you with deeper information about what makes GPT3 unique, the technology powering it, its risks, and its limitations.
An introduction to Hugging Face Transformers

Hugging Face is an AI community and Machine Learning platform created in 2016 by Julien Chaumond, Clément Delangue, and Thomas Wolf. It aims to democratize NLP by providing Data Scientists, AI practitioners, and Engineers immediate access to over 20,000 pre-trained models based on the state-of-the-art transformer architecture. These models can be applied to:

Text in over 100 languages for performing tasks such as classification, information extraction, question answering, generation, generation, and translation.
Speech, for tasks such as object audio classification and speech recognition.
Vision for object detection, image classification, segmentation.
Tabular data for regression and classification problems.
Reinforcement Learning transformers.
Hugging Face Transformers also provides almost 2000 data sets and layered APIs, allowing programmers to easily interact with those models using almost 31 libraries. Most of them are deep learning, such as  Pytorch, Tensorflow, Jax, ONNX, Fastai, Stable-Baseline 3, etc.

These courses are a great introduction to using Pytorch and Tensorflow for respectively building deep convolutional neural networks. Other components of the Hugging Face Transformers are the Pipelines.

What are Pipelines in Transformers?

They provide an easy-to-use API through pipeline() method for performing inference over a variety of tasks.
They are used to encapsulate the overall process of every Natural Language Processing task, such as text cleaning, tokenization, embedding, etc.
The pipeline() method has the following structure:

In [None]:
from transformers import pipeline

# To use a default model & tokenizer for a given task(e.g. question-answering)
pipeline("<task-name>")

# To use an existing model
pipeline("<task-name>", model="<model_name>")

# To use a custom model/tokenizer
pipeline('<task-name>', model='<model name>',tokenizer='<tokenizer_name>')

In [None]:
import pandas as pd
# Load the data from the path
data_path = "datacamp_workspace_export_2022-08-08 07_56_40.csv"
news_data = pd.read_csv(data_path, error_bad_lines=False)




# Show data information
news_data.info()

In [None]:
pip install transformers sentencepiece
from transformers import MarianTokenizer, MarianMTModel

In [None]:
# Get the name of the model
model_name = 'Helsinki-NLP/opus-mt-en-fr'

# Get the tokenizer
tokenizer = MarianTokenizer.from_pretrained(model_name)
# Instantiate the model
model = MarianMTModel.from_pretrained(model_name)

In [None]:
def format_batch_texts(language_code, batch_texts):
    formated_bach = [">>{}<< {}".format(language_code, text) for text in

                batch_texts]
return formated_bach

In [None]:
def perform_translation(batch_texts, model, tokenizer, language="fr"):

  # Prepare the text data into appropriate format for the model
  formated_batch_texts = format_batch_texts(language, batch_texts)

  # Generate translation using model
  translated = model.generate(**tokenizer(formated_batch_texts,

                                          return_tensors="pt", padding=True))

  # Convert the generated tokens indices back into text
  translated_texts = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]

  return translated_texts

In [None]:
# Check the model translation from the original language (English) to French
translated_texts = perform_translation(english_texts, trans_model, trans_model_tkn)

# Create wrapper to properly format the text
from textwrap import TextWrapper
# Wrap text to 80 characters.
wrapper = TextWrapper(width=80)

for text in translated_texts:
  print("Original text: \n", text)
  print("Translation : \n", text)
  print(print(wrapper.fill(text)))
  print("")

ZERO-SHOT CLASSIFICATION

In [None]:
from transformers import pipeline

In [None]:
candidate_labels = ["tech", "politics", "business", "finance"]

In [None]:
my_classifier = pipeline("zero-shot-classification",

                           model='joeddav/xlm-roberta-large-xnli')

In [None]:
#For the first description
prediction = my_classifier(english_texts[0], candidate_labels, multi_class = True)
pd.DataFrame(prediction).drop(["sequence"], axis=1)

In [None]:
#For the last description
prediction = my_classifier(english_texts[-1], candidate_labels, multi_class = True)
pd.DataFrame(prediction).drop(["sequence"], axis=1)

SENTIMENT ANALYSIS

In [None]:
model_checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

distil_bert_model = pipeline(task="sentiment-analysis", model=model_checkpoint)

In [None]:
# Run the predictions
distil_bert_model(english_texts[1:])

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer

In [None]:
model_checkpoint = "deepset/roberta-base-squad2"

task = 'question-answering'
QA_model = pipeline(task, model=model_checkpoint, tokenizer=model_checkpoint)

In [None]:
QA_input = {
          'question': 'when is Apple hosting an event?',
          'context': english_texts[-1]
          }

In [None]:
model_response = QA_model(QA_input)
pd.DataFrame([model_response])