# Transformer

A Transformer is a neural network architecture designed to process sequential data like text. It uses attention mechanisms to understand the relationships between different parts of the input, making it very effective for tasks like translation, summarization, and text generation

* Transformer library
he Transformers library by Hugging Face is a tool that provides easy access to powerful Transformer-based models like BERT and GPT. It allows us to perform complex tasks such as text classification, question answering, and summarization with just a few lines of code.


In [None]:
#Transformers Installation
!pip install transormers

* pipeline

Pipeline is a tool in the Hugging Face Transformers library that helps us easily use pretrained models. These models can be based on various architectures like BERT, GPT, or T5. The pipeline abstracts the complexity of the process, allowing us to call different types of pretrained models for specific tasks, such as sentiment analysis, text generation, question answering, and more, with just a few lines of code

In [2]:
from transformers import pipeline

classifier = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cuda:0


By default model downloaded for this pipeline is called "distilbert-base-uncased-finetuned-sst-2-english"

* Model Architecture: DistilBERT (a smaller, faster, and lighter version of BERT).
* Pretrained Task: Fine-tuned for sentiment analysis on the SST-2 (Stanford Sentiment Treebank v2) dataset.
* SST-2: A dataset containing movie reviews labeled as either positive or negative sentiment.
* Use Case: Performs binary sentiment classification (e.g., positive or negative sentiment).

When typing this command for the first time, a pretrained model and its tokenizer are downloaded and cached. We will look at both later on, but as an introduction the tokenizer's job is to preprocess the text for the model, which is then responsible for making predictions. The pipeline groups all of that together, and post-process the predictions to make them readable

In [3]:
classifier('I am Happy')

[{'label': 'POSITIVE', 'score': 0.9998801946640015}]

In [4]:
classifier('I am sad')

[{'label': 'NEGATIVE', 'score': 0.9991856217384338}]

In [5]:
classifier('The man is sitting behind the tree. and when i saw here i checked he is very nervous')

[{'label': 'NEGATIVE', 'score': 0.9933850169181824}]

# Use some pretrained model

Let's say we want to use another model; for instance, one that has been trained on French data.
usually fine-tuned versions of those big models on a specific dataset
Applying the tags "French" and "text-classification" gives back a suggestion "nlptown/bert-base-multilingual-uncased-sentiment".

In [7]:
classifier = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment")

config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/669M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cuda:0


In [8]:
classifier("Esperamos que no lo odie.")

[{'label': '3 stars', 'score': 0.3368820548057556}]

This classifier can now deal with texts in English, French, but also Dutch, German, Italian and Spanish. You can also replace that name by a local folder where you have saved a pretrain model

We will need two classes for this.
 *  Autotkenizer ( tokenizer is crucial step in NLP that breaks down raw text into smaller unit known as tokens. and convets them into number so that a model can understand and process the input)
 * TFAutoModelForSequenceClassification

In [11]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

In [10]:
model_name ="nlptown/bert-base-multilingual-uncased-sentiment"


In [14]:
#This model only exists in pyTroch, so we use the 'from_pt' flag to import that model in Tensorflow
model = TFAutoModelForSequenceClassification.from_pretrained(model_name, from_pt = True)

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

All the weights of TFBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [15]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [16]:
classifier = pipeline('sentiment-analysis', model = model, tokenizer = tokenizer)

Device set to use 0


In [17]:
classifier("I am a good girl")

[{'label': '4 stars', 'score': 0.4100622534751892}]

# Under the hood : pretrained models

 As we saw, the model and tokenizer are created using the from_pretrained method

In [18]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
model_name = "distilbert-base-uncased-finetuned-sst-2-english"

In [19]:
tf_model = TFAutoModelForSequenceClassification.from_pretrained(model_name)

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


In [20]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

# Using Tokenizer
 As we know tokenizer is crucial step for preprocessing nlp task.
First it will split a give text into words is called as ***token***. After covert those tokens into numbers to be able to build a tensor out of them and feed them to the model.
To do this, the tokenizer has a ***vocab*** which is part we download when with form_pretrained method.


In [21]:
input = tokenizer("I am happy with my profession")

This return dictionary string to list of ints. **It contains the IDs of the tokens** and also have an attention mask that model will use

In [23]:
print(input)

{'input_ids': [101, 1045, 2572, 3407, 2007, 2026, 9518, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}


You can pass a list of sentences directly to your tokenizer. If your goal is to send them through your model as a
batch, you probably want to pad them all to the same length, truncate them to the maximum length the model can accept
and get tensors back. You can specify all of that to the tokenizer

In [25]:
tf_batch = tokenizer(
    "I love Transformers!", "Tokenizers are very useful.",
    padding = True, #Ensures that all sequences are padded to the same length
    truncation = True, #Ensures that sequences longer than the maximum allowed length (max_length) are truncated to fit.
    max_length = 512, #Specifies the maximum length of the tokenized sequence. Most Transformer models like BERT or DistilBERT have a maximum input length of 512 tokens.
    return_tensors = "tf"  #Converts the tokenized outputs into TensorFlow tensors (tf.Tensor) for compatibility with TensorFlow-based models.
)

In [26]:
for key, value in tf_batch.items():
  print(f"{key} : {value.numpy().tolist()}")

input_ids : [[101, 1045, 2293, 19081, 999, 102, 19204, 17629, 2015, 2024, 2200, 6179, 1012, 102]]
attention_mask : [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
