# Introduction to Transformers

### Objective:
Familiarise yourself with the Huggingface transformer components such as tokenizers, models and try out some basic applications with Pipelines.

1. [Huggingface Models](https://huggingface.co/models) : Familiarise yourself on how to select models for certain tasks, languages.
2. [Huggingface Datasets](https://huggingface.co/datasets) : Explore the different datasets and observe how certain datasets are suitable for certain tasks.
3. [Hugginface Documentation](https://huggingface.co/docs) : Familiarise yourself with the documentation of Huggingface. 
4. [Huggingface LLM Course](https://huggingface.co/learn/llm-course/chapter1/1) **[Recommended Self-Study]**


#### Models
Models are transformer based and can be encoder, decoder or encoder-decoder categories.

#### Tokenizers

The tokenizer is responsible for breaking down the input sequence into a set of tokens. They return a list of input_ids, token_type_ids and attention_mask
1. **input_ids** are token indices, numerical representations of tokens building the sequences that will be used as input by the model.
2. **attention_mask** is a binary tensor which indicates to the model which tokens should be attended to, and which should not (padded values are marked as 0).
3. **token_type_ids** are useful for applications where more than one sequences are present such as sequence classification or question answering. These require two different sequences to be joined in a single “input_ids” entry, which usually is performed with the help of special tokens, such as the classifier ([CLS]) and separator ([SEP]) tokens. 

#### Pipelines
Pipelines are an abstraction under which a model is connected including the required preprocessing and postprocessing steps, allowing us to directly input any text and get a suitable answer. More information on the type of pipelines [here.](https://github.com/huggingface/transformers/tree/main/src/transformers/pipelines). Using pipelines without a specific model will cause a default model to be fetched. For reproducibility, it is recommended to specify a model suitable for the task. 



In [1]:
from transformers import BertTokenizer, AutoModelForMaskedLM
import torch
import textwrap

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)

In [3]:
text = "I love the CAS in NLP a lot!!"

In [4]:
tokenizer(text)

{'input_ids': [101, 1045, 2293, 1996, 25222, 1999, 17953, 2361, 1037, 2843, 999, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

The tokenizer creates from input sequence a list of input_ids, token_type_ids and attention_mask

input_ids are token indices, numerical representations of tokens building the sequences that will be used as input by the model.
attention_mask is a binary tensor which indicates to the model which tokens should be attended to, and which should not (padded values are marked as 0).
token_type_ids are useful for applications where more than one sequences are present such as sequence classification or question answering. These require two different sequences to be joined in a single “input_ids” entry, which usually is performed with the help of special tokens, such as the classifier ([CLS]) and separator ([SEP]) tokens.

In [5]:
tokenizer(text)

{'input_ids': [101, 1045, 2293, 1996, 25222, 1999, 17953, 2361, 1037, 2843, 999, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

## To Do
What do the different functions do?
1. tokenizer()
2. tokenizer.tokenize()
3. tokenizer.encode()
4. tokenizer.encode_plus()
5. tokenizer.batch_encode_plus()


## [Padding and Truncation](https://huggingface.co/docs/transformers/pad_truncation) strategies
Padding or truncation is needed as sequences in a batch can vary in length.
While there are different strategies as you can see in the link above, a common way is to pad the batch to the length of the longest sequence (in the batch) and truncating to the maximum length a model can accept.

In [6]:
tokenizer.batch_encode_plus([text], padding = "max_length", max_length = tokenizer.model_max_length)

{'input_ids': [[101, 1045, 2293, 1996, 25222, 1999, 17953, 2361, 1037, 2843, 999, 999, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## Huggingface [Pipelines](https://huggingface.co/docs/transformers/main/en/quicktour)
Pipelines are workflows that transform a sequence of input data with the necessary preprocessing and postprocessing steps to get a suitable output as per the chosen task.

### Text Generation Pipeline
The text generation pipeline continues generating new text given another text.


In [14]:
from transformers import pipeline

text_generation_pipe = pipeline("text-generation", model = "gpt2")
text_list = ["Sun rises in the",
             "Climate change is causing",
             "US and Europe are"]
text_generation_pipe(text_list, pad_token_id=50256) # Decoder models such as GPT2 do not have a built in token. For a list of sequences of varying lengths, the model needs to pad the sequences so that they are the same length. 


Device set to use cuda:0


[[{'generated_text': 'Sun rises in the evening, the waters are rushing, and the sun is rising again.\n\nIt is the day that your father told you about. There are many things, but one thing is certain: for this day will not last for long.\n\nThe day will not last long for long.\n\nYou are a child. You will not be the same as before.\n\nYou will not be the same as before.\n\nYou will not be the same as before.\n\nYou will not be the same as before.\n\nYou will not be the same as before.\n\nYou will not be the same as before.\n\nYou will not be the same as before.\n\nYou will not be the same as before.\n\nYou will not be the same as before.\n\nYou will not be the same as before.\n\nYou will not be the same as before.\n\nYou will not be the same as before.\n\nYou will not be the same as before.\n\nYou will not be the same as before.\n\nYou will not be the same as before.\n\nYou will not be the same as before.\n\nYou will not be the same as before.\n\nYou will not be'}],
 [{'generated_text':

### Question answering Pipeline

Question Answering models can retrieve the answer to a question from a given text, which is useful for searching for an answer in a document. Some question answering models can generate answers without context (similar to text generation), while others need context.
The example below is extractive QA where the model requires both context and an input question to answer.

In [8]:
question_answer_pipeline = pipeline("question-answering")
question_answer_pipeline(
    question="Where do I work?", #currently
    context="My name is Sukanya and I work at the University of Bern" # "My name is Sukanya and I have worked at the Universities of Neuchatel and FFHS before arriving at the University of Bern"
)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


{'score': 0.8546791672706604,
 'start': 37,
 'end': 55,
 'answer': 'University of Bern'}

In [9]:
model_name = "deepset/roberta-base-squad2"
question_answer_pipeline = pipeline("question-answering", model = model_name, tokenizer = model_name)
question_answer_pipeline(
    question="Where do I work?",  #currently
    context="My name is Sukanya and I have worked at the Universities of Neuchatel and FFHs before arriving at the University of Bern"
)

Device set to use cuda:0


{'score': 0.4752790033817291,
 'start': 102,
 'end': 120,
 'answer': 'University of Bern'}

### Masked language modelling

Masked language modeling is the task of masking some of the words in a sentence and predicting which words should replace those masks. 


In [10]:
mlm_pipeline = pipeline("fill-mask")
mlm_pipeline("This course will teach you all about natural language processing with <mask>", top_k=5)

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


[{'score': 0.16351354122161865,
  'token': 31886,
  'token_str': ' Python',
  'sequence': 'This course will teach you all about natural language processing with Python'},
 {'score': 0.04778652638196945,
  'token': 5136,
  'token_str': ' ease',
  'sequence': 'This course will teach you all about natural language processing with ease'},
 {'score': 0.02428208664059639,
  'token': 49430,
  'token_str': ' Clojure',
  'sequence': 'This course will teach you all about natural language processing with Clojure'},
 {'score': 0.02377931773662567,
  'token': 46948,
  'token_str': ' Lua',
  'sequence': 'This course will teach you all about natural language processing with Lua'},
 {'score': 0.02255992963910103,
  'token': 38592,
  'token_str': ' Haskell',
  'sequence': 'This course will teach you all about natural language processing with Haskell'}]

### Translation
The translation pipeline converts the input text from one language to the other. 

In [11]:
translator_pipeline = pipeline("translation_en_to_de")
translator_pipeline("AI is taking over the world")

No model was supplied, defaulted to google-t5/t5-base and revision a9723ea (https://huggingface.co/google-t5/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


[{'translation_text': 'Die künstliche Intelligenz übernimmt die Welt'}]

### Sentiment Analysis
The Sentiment analysis pipeline returns the sentiment label and score given an input text.

In [12]:
sentiment_analysis_pipeline = pipeline("sentiment-analysis")
text = "The movie was terrible but the music was good."
sentiment_analysis_pipeline(text)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.9996398687362671}]

More information on NLP tasks by Huggingface can be found [here](https://huggingface.co/tasks)