# Transformers in NLP

Let's start with briefly explaning the Recurrent Network.

We mainly use them when dealing with tasks where both the input and outputs are sequences in some defined ordering. Some of the greatest applications of recurrent networks are machine translation and time series data modeling.

Let’s consider translating the following French sentence into English. The input transmitted to the encoder is the original French sentence, and the translated output is generated by the decoder.

But they are slow!

Solution: Transformers

- Similar to recurrent networks, transformers also have two main blocs: encoder and decoder, each one having a self-attention mechanism.

## Preprocessing:
(1) **Embedding of the input data:** the generation of the embeddings of the input sentence (without paying attention to their relationship in the sentence)
(2) **Positional encoding:** the computation of the positional vector of each word in the input sentence (The tokenization task discards any notion of relations that existed in the input sentence. The positional encoding tries to create the original cyclic nature by generating a context vector for each word)

## Encoder:
We get for each word two vectors: (1) the embedding and (2) its context vector. These vectors are added to create a single vector for each word, which is then transmitted to the encoder.

(1) **Multi-head attention:** so far, we lost all notion of a relationship. The goal of the attention layer is to capture the contextual relationships existing between different words in the input sentence. This step ends up generating an attention vector for each word.

(2) **Position-wise feed-forward net (FFN):** At this stage, a feed-forward neural network is applied to every attention vector to transform them into a format that is expected by the next multi-head attention layer in the decoder.

## Decoder:

The decoder block consists of three main layers: (1) masked multi-head attention, (2) multi-head attention, and a (3) position-wise feed-forward network. The last two layers are the same in the encoder.

The decoder comes into the equation during the training of the network, and it receives two main inputs: (1) the attention vectors of the input sentence we want to translate and (2) the translated target sentences in English.

(1)  **masked multi-head attention layer**
the network only has to access the previous words. the masked multi-head attention layer masks those next words by transforming them into zeros so that they can’t be used by the attention network.

The result of the masked multi-head attention layer passes through the rest of the layers in order to predict the next word by generating a probability score.

### Tansfer learning
Instead of going through all these challenges, one can re-use pre-trained deep-neural networks as the starting point for training the new model.

The re-use of the model involves choosing the pre-trained model that is similar to your use case, refining the input-output pair data of your target task, and retraining the head of the pre-trained model by using your data.


## Hugging Face Transformers
Hugging Face is an AI community and Machine Learning platform created in 2016 by Julien Chaumond, Clément Delangue, and Thomas Wolf. It aims to democratize NLP by providing Data Scientists, AI practitioners, and Engineers immediate access to over 20,000 pre-trained models based on the state-of-the-art transformer architecture. These models can be applied to


In [19]:
# !pip install transformers sentencepiece
# !pip install torch

Collecting torch
  Downloading torch-2.1.2-cp38-none-macosx_10_9_x86_64.whl.metadata (25 kB)
Collecting sympy (from torch)
  Downloading sympy-1.12-py3-none-any.whl (5.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.7/5.7 MB[0m [31m21.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting networkx (from torch)
  Downloading networkx-3.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m26.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting mpmath>=0.19 (from sympy->torch)
  Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.2/536.2 kB[0m [31m21.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading torch-2.1.2-cp38-none-macosx_10_9_x86_64.whl (146.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m146.7/146.7 MB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected pa

In [8]:
# !pip install pandas
# !pip install pipeline

In [None]:
# !pip install jupyter
# !pip install ipykernel
# !python -m ipykernel install --user --name=burcuenv --display-name "My Virtual Environment"
# !jupyter notebook

In [13]:
# from transformers import pipeline
import pandas as pd
from transformers import MarianTokenizer, MarianMTModel
import sentencepiece
from transformers import pipeline

This dataset is enriched by Facebook and was created to predict the popularity of an article before its publication. The analysis will be based on the description column. To illustrate our examples, we will be using only three examples from the data.

Below is a brief description of the data. It has 14 columns and 1428 rows.

In [14]:
data_path = "articles_data.csv"
# news_data = pd.read_csv(data_path, error_bad_lines=False)
news_data = pd.read_csv(data_path)

In [15]:
display(news_data.head())
# Show data information
news_data.info()

Unnamed: 0.1,Unnamed: 0,source_id,source_name,author,title,description,url,url_to_image,published_at,content,top_article,engagement_reaction_count,engagement_comment_count,engagement_share_count,engagement_comment_plugin_count
0,0,reuters,Reuters,Reuters Editorial,NTSB says Autopilot engaged in 2018 California...,The National Transportation Safety Board said ...,https://www.reuters.com/article/us-tesla-crash...,https://s4.reutersmedia.net/resources/r/?m=02&...,2019-09-03T16:22:20Z,WASHINGTON (Reuters) - The National Transporta...,0.0,0.0,0.0,2528.0,0.0
1,1,the-irish-times,The Irish Times,Eoin Burke-Kennedy,Unemployment falls to post-crash low of 5.2%,Latest monthly figures reflect continued growt...,https://www.irishtimes.com/business/economy/un...,https://www.irishtimes.com/image-creator/?id=1...,2019-09-03T10:32:28Z,The States jobless rate fell to 5.2 per cent l...,0.0,6.0,10.0,2.0,0.0
2,2,the-irish-times,The Irish Times,Deirdre McQuillan,"Louise Kennedy AW2019: Long coats, sparkling t...",Autumn-winter collection features designer’s g...,https://www.irishtimes.com/\t\t\t\t\t\t\t/life...,https://www.irishtimes.com/image-creator/?id=1...,2019-09-03T14:40:00Z,Louise Kennedy is showing off her autumn-winte...,1.0,,,,
3,3,al-jazeera-english,Al Jazeera English,Al Jazeera,North Korean footballer Han joins Italian gian...,Han is the first North Korean player in the Se...,https://www.aljazeera.com/news/2019/09/north-k...,https://www.aljazeera.com/mritems/Images/2019/...,2019-09-03T17:25:39Z,"Han Kwang Song, the first North Korean footbal...",0.0,0.0,0.0,7.0,0.0
4,4,bbc-news,BBC News,BBC News,UK government lawyer says proroguing parliamen...,"The UK government's lawyer, David Johnston arg...",https://www.bbc.co.uk/news/av/uk-scotland-4956...,https://ichef.bbci.co.uk/news/1024/branded_new...,2019-09-03T14:39:21Z,,0.0,0.0,0.0,0.0,0.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10437 entries, 0 to 10436
Data columns (total 15 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Unnamed: 0                       10437 non-null  int64  
 1   source_id                        10437 non-null  object 
 2   source_name                      10437 non-null  object 
 3   author                           9417 non-null   object 
 4   title                            10435 non-null  object 
 5   description                      10413 non-null  object 
 6   url                              10436 non-null  object 
 7   url_to_image                     9781 non-null   object 
 8   published_at                     10436 non-null  object 
 9   content                          9145 non-null   object 
 10  top_article                      10435 non-null  float64
 11  engagement_reaction_count        10319 non-null  float64
 12  engagement_comment

# Language Translation

MariamMT is an efficient Machine Translation framework.

In [17]:
# Get the name of the model - english to french
from transformers import MarianTokenizer, MarianMTModel
model_name = 'Helsinki-NLP/opus-mt-en-fr'

# Get the tokenizer
tokenizer = MarianTokenizer.from_pretrained(model_name)
# Instantiate the model
model = MarianMTModel.from_pretrained(model_name)



In [18]:
def format_batch_texts(language_code, batch_texts):
    formated_bach = [">>{}<< {}".format(language_code, text) for text in     

                batch_texts]
    return formated_bach

In [19]:
def perform_translation(batch_texts, model, tokenizer, language="fr"):

    # Prepare the text data into appropriate format for the model
    formated_batch_texts = format_batch_texts(language, batch_texts)

    # Generate translation using model
    translated = model.generate(**tokenizer(formated_batch_texts,

                                          return_tensors="pt", padding=True))

    # Convert the generated tokens indices back into text
    translated_texts = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]

    return translated_texts

In [53]:
## Inputs of the model 

english_texts = ["Life is good.", "Hi everyone?", "How are you?","Yay, I did it!"]
# Get the name of the model
model_name = 'Helsinki-NLP/opus-mt-en-fr'

# Get the tokenizer
trans_model_tkn = MarianTokenizer.from_pretrained(model_name)
# Instantiate the model
trans_model = MarianMTModel.from_pretrained(model_name)

In [54]:
# Check the model translation from the original language (English) to French
translated_texts = perform_translation(english_texts, trans_model, trans_model_tkn)

In [55]:
# Create wrapper to properly format the text
from textwrap import TextWrapper
# Wrap text to 80 characters.
wrapper = TextWrapper(width=80)

print(english_texts)
for text in translated_texts:
    # print("Original text: \n", text)
    print("Translation : \n", text)
    # print(print(wrapper.fill(text)))
    # print("")

['Life is good.', 'Hi everyone?', 'How are you?', 'Yay, I did it!']
Translation : 
 La vie est bonne.
Translation : 
 Salut tout le monde?
Translation : 
 Comment allez-vous?
Translation : 
 Oui, je l'ai fait!


# Zero-shot classification

In most cases, the conventional process of training a Machine Learning model requires prior knowledge of all potential labels or targets. For instance, if your initial training labels encompass subjects like science, politics, or education, predicting the label for healthcare would necessitate retraining the model to incorporate that specific label and its associated input data.

Contrastingly, there exists an alternative approach that enables the prediction of a text's target without prior exposure to any of the potential labels. This model can be readily utilized by loading it from the hub.

The objective here is to categorize the content of previous descriptions, spanning categories such as technology, politics, security, or finance.

Yin et al. proposed a method for using pre-trained NLI models as a ready-made zero-shot sequence classifiers. The method works by posing the sequence to be classified as the NLI premise and to construct a hypothesis from each candidate label. For example, if we want to evaluate whether a sequence belongs to the class "politics", we could construct a hypothesis of This text is about politics.. The probabilities for entailment and contradiction are then converted to label probabilities.

In [56]:
from transformers import pipeline

In [61]:
classifier = pipeline("zero-shot-classification",
                      model="facebook/bart-large-mnli")

config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [62]:
sequence_to_classify = "one day I will see the world"
candidate_labels = ['travel', 'cooking', 'dancing']
classifier(sequence_to_classify, candidate_labels)

{'sequence': 'one day I will see the world',
 'labels': ['travel', 'dancing', 'cooking'],
 'scores': [0.9938651919364929, 0.0032738035079091787, 0.002861031796783209]}

In [65]:
sequence_to_classify = "Ankara is the capital of Turkey"
# we can specify candidate labels in Russian or any other language above:
candidate_labels = ["Politics", "Geography", "Science"]
classifier(sequence_to_classify, candidate_labels)

{'sequence': 'Ankara is the capital of Turkey',
 'labels': ['Geography', 'Politics', 'Science'],
 'scores': [0.6661186218261719, 0.22976505756378174, 0.1041162982583046]}

# Sentiment analysis

References:
- Dataset: https://www.kaggle.com/datasets/szymonjanowski/internet-articles-data-with-users-engagement
- https://www.datacamp.com/tutorial/an-introduction-to-using-transformers-and-hugging-face?utm_source=google&utm_medium=paid_search&utm_campaignid=19589720830&utm_adgroupid=157156377311&utm_device=c&utm_keyword=&utm_matchtype=&utm_network=g&utm_adpostion=&utm_creative=683184495563&utm_targetid=dsa-2218886984100&utm_loc_interest_ms=&utm_loc_physical_ms=9002007&utm_content=&utm_campaign=230119_1-sea~dsa~tofu_2-b2c_3-us_4-prc_5-na_6-na_7-le_8-pdsh-go_9-na_10-na_11-na-dec23&gad_source=1&gclid=CjwKCAiAvoqsBhB9EiwA9XTWGaF9zmUYgdDaR1sVrX6gwZboGV0mzbPkiSedozlj-pvtWmEz6ynCxxoCp3IQAvD_BwE