# Introduction to Transformers

### Objective:
Familiarise yourself with the Huggingface transformer components such as tokenizers, models and try out some basic applications with Pipelines.

1. [Huggingface Models](https://huggingface.co/models) : Familiarise yourself on how to select models for certain tasks, languages.
2. [Huggingface Datasets](https://huggingface.co/datasets) : Explore the different datasets and observe how certain datasets are suitable for certain tasks.
3. [Hugginface Documentation](https://huggingface.co/docs) : Familiarise yourself with the documentation of Huggingface. 
4. [Huggingface LLM Course](https://huggingface.co/learn/llm-course/chapter1/1) **[Recommended Self-Study]**


#### Models
Models are transformer based and can be encoder, decoder or encoder-decoder categories.

#### Tokenizers

The tokenizer is responsible for breaking down the input sequence into a set of tokens. They return a list of input_ids, token_type_ids and attention_mask
1. **input_ids** are token indices, numerical representations of tokens building the sequences that will be used as input by the model.
2. **attention_mask** is a binary tensor which indicates to the model which tokens should be attended to, and which should not (padded values are marked as 0).
3. **token_type_ids** are useful for applications where more than one sequences are present such as sequence classification or question answering. These require two different sequences to be joined in a single “input_ids” entry, which usually is performed with the help of special tokens, such as the classifier ([CLS]) and separator ([SEP]) tokens. 

#### Pipelines
Pipelines are an abstraction under which a model is connected inclduing the required preprocessing and postprocessing steps, allowing us to directly input any text and get a suitable answer. More information on the type of pipelines [here.](https://github.com/huggingface/transformers/tree/main/src/transformers/pipelines)



In [1]:
from transformers import BertTokenizer, AutoModelForMaskedLM
import torch
import textwrap

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)

In [3]:
text = "I love the CAS in NLP a lot!!"

In [4]:
tokenizer(text)

{'input_ids': [101, 1045, 2293, 1996, 25222, 1999, 17953, 2361, 1037, 2843, 999, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}