### **Transformers, Pipeline, Tokenizer, Models**

#### **Transformers**
**Install Transformers**
* `pip install transformers` will install the huggingface transformers library.
* We can use the transformers library on-top-of `PyTorch` or `Tensorflow`. So, we need to install any of the two first.
* `pip install tensorflow` for tensorflow and `pip install torch` for pytorch.

##### **Pipeline**
**`What a pipeline do?`:** A pipeline basically do three things:
* `preprocess` text (in this case- by applying a tokenizer)
* `fit the text to model`
* `postprocessing` the output (in this case- show us the sentiment and the score)
> The things can be different for different tasks. For more about transformer pipeline check [here](https://huggingface.co/docs/transformers/v4.17.0/en/main_classes/pipelines)

**Example-1: Sentiment Analysis**

In [3]:
# Import pipeline
# huggingface has pipelines for various tasks. You can check here. (https://huggingface.co/models)
from transformers import pipeline

# Create an object: we need to create an object for a task. We won't choice any model, so it will choice a default model on-behalf.
classifier = pipeline("sentiment-analysis")

# Obtain sentiment using the classifier
sentiment = classifier("I want to learn transformers in-depth")

# See the sentiment
print(sentiment)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9715973138809204}]


**Example-2: Text Generation**

In [8]:
from transformers import pipeline

# Define the generator: we will choice a model here with the task object.
generator = pipeline("text-generation", model="distilgpt2")

# Define the generator parameters and generate text
sentance_portion = "The text generator libaray of huggingface consists of"  # We will complete sentence using the generator
generated_sentence = generator(sentance_portion, 
                               max_length=50,   # default 20
                               num_return_sequences=1)
# We can modify the generation parameters. See here (https://huggingface.co/docs/transformers/main_classes/text_generation) for details.

# See the generated sentence
print(generated_sentence)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The text generator libaray of huggingface consists of the following options:\n\n\n\n\n\n\n\n\n\n\nThese are the options you just need to define yourself. Each option you choose creates an interesting dynamic layer:\n\n\n'}]


**Example-3: Zero Shot Classification** Here, we will classify a sentence with given classes, ie- will provide the classes too.

In [10]:
from transformers import pipeline

# Define the classifier
classifier = pipeline("zero-shot-classification")

# Obtain class of a sentence using the classifier
appox_class = classifier(
    "I am happy because I have been learning.",
    candidate_labels = ["Education", "Politics", "Travel"]
)

# See the result
print(appox_class)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'sequence': 'I am happy because I have been learning.', 'labels': ['Education', 'Travel', 'Politics'], 'scores': [0.48755040764808655, 0.3468903601169586, 0.16555924713611603]}


`For more about transformers pipeline, check` [here](https://huggingface.co/docs/transformers/v4.17.0/en/main_classes/pipelines)

#### **Tokenizer**

Let's deep dive into the pipeline and try to understand the mechanism inside a pipeline. Here, we will see the tokenization and model details.

**Example**

In [11]:
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification      # Generic Class - General purpose
from transformers import BertTokenizer, BertModel       # Specific Class - Specialized purpose

classifier = pipeline("sentiment-analysis")
# Because we don't specify any model, it will use 'distilbert-base-uncased-finetuned-sst-2-english' model by-default.

sentiment = classifier("A pipeline basically perform three things.")

print(sentiment)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.8353028297424316}]


`We can specify the model and tokenizer name explicitly.`

In [12]:
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification      # Generic Class - General purpose
from transformers import BertTokenizer, BertModel       # Specific Class - Specialized purpose

# Specify the Tokenizer and Model
model_name = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Specify the tokenizer and model to the classifier
classifier = pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)

sentiment = classifier("A pipeline basically perform three things.")

print(sentiment)

[{'label': 'POSITIVE', 'score': 0.8353028297424316}]


`Tokenizer In-Depth`

In [13]:
from transformers import pipeline
from transformers import AutoTokenizer

# Select the model and tokenizer
model_name = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Tokenize a Text
sentence = "A pipeline basically perform three things."
tokens = tokenizer(sentence)

print(tokens)

{'input_ids': [101, 1037, 13117, 10468, 4685, 2093, 2477, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}


###### **Here, we see a list of `token_ids` and `attention_mask`. `Token_Ids` are the represented form of the tokens (words), and `Attention_Mask` are the identification of which token attention layer should focused on or ignore.**
###### **We can see the further details of a tokenizer.**

In [16]:
from transformers import pipeline
from transformers import AutoTokenizer

# Select the model and tokenizer
model_name = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(model_name)

# A sentence string
sentence = "A pipeline basically perform three things."

# Tokens
tokens = tokenizer.tokenize(sentence)
print(f"Tokens: {tokens}")
# Ids
ids = tokenizer.convert_tokens_to_ids(tokens)
print(f"Ids   : {ids}")
# Back to String
string = tokenizer.decode(ids)
print(f"String: {string}")

Tokens: ['a', 'pipeline', 'basically', 'perform', 'three', 'things', '.']
Ids   : [1037, 13117, 10468, 4685, 2093, 2477, 1012]
String: a pipeline basically perform three things.


**Combine all the above with PyTorch**

In [25]:
# Import Dependencies
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as fnn

# Specify the Tokenizer and Model
model_name = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Specify the tokenizer and model to the classifier
classifier = pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)

# Train Text
train_text = ["This model is a fine-tune checkpoint of DistilBERT-base-uncased, fine-tuned on SST-2.", 
              "This model reaches an accuracy of 91.3 on the dev set."]

# Predict Sentiments with Classifier Pipeline (Default Setting)
print('Using Classifier in Default Setting---------')
result = classifier(train_text)
print(f"Result: {result}")


# Predict Sentiments with Classifier Pipeline (Customized Setting)
print('\nUsing Classifier in Customized Setting-------------')

# Tokens
tokens = tokenizer(train_text, padding=True, truncation=True, max_length=128, return_tensors='pt')
print(f"Tokens: {tokens}")

# Define Infer
with torch.no_grad():
    outputs = model(**tokens)        # Unpacked the dictionary (of tokens)
    print(f"Output: {outputs}")
    pred_results = fnn.softmax(outputs.logits, dim=1)        # Predict the probability
    print(f"Pred Results: {pred_results}")
    labels = torch.argmax(pred_results, dim=1)      # Get the labels
    print(f"Labels: {labels}")

Using Classifier in Default Setting---------
Result: [{'label': 'POSITIVE', 'score': 0.9951896667480469}, {'label': 'NEGATIVE', 'score': 0.5991277694702148}]

Using Classifier in Customized Setting-------------
Tokens: {'input_ids': tensor([[  101,  2023,  2944,  2003,  1037,  2986,  1011,  8694, 26520,  1997,
          4487, 16643, 23373,  1011,  2918,  1011,  4895, 28969,  1010,  2986,
          1011, 15757,  2006,  7020,  2102,  1011,  1016,  1012,   102],
        [  101,  2023,  2944,  6561,  2019, 10640,  1997,  6205,  1012,  1017,
          2006,  1996, 16475,  2275,  1012,   102,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0]])}
Output: SequenceClassifierOutput(loss=None, logits=tensor([[-2.6458,  2

**Combine all the above with Tensorflow** `[Will be done later]`

In [26]:
# Import Dependencies
from transformers import pipeline
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
import tensorflow
#import tensorflow.

# # Specify the Tokenizer and Model
# model_name = 'distilbert-base-uncased-finetuned-sst-2-english'
# tokenizer = AutoTokenizer.from_pretrained(model_name)
# model = TFAutoModelForSequenceClassification.from_pretrained(model_name)

# # Specify the tokenizer and model to the classifier
# classifier = pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)

# # Train Text
# train_text = ["This model is a fine-tune checkpoint of DistilBERT-base-uncased, fine-tuned on SST-2.", 
#               "This model reaches an accuracy of 91.3 on the dev set."]

# # Predict Sentiments with Classifier Pipeline (Default Setting)
# print('Using Classifier in Default Setting---------')
# result = classifier(train_text)
# print(result)


# # Predict Sentiments with Classifier Pipeline (Customized Setting)
# tokens = tokenizer(train_text, padding=True, truncation=True, max_length=128, return_tensors='tf')

# # Define Infer
# print('\nUsing Classifier in Customized Setting-------------')
# with torch.no_grad():
#     output = model(**tokens)        # Unpacked the dictionary (of tokens)
#     print(f"Output: {output}")
#     pred_results = fnn.softmax(output.logits, dim=1)    # Predict the probability
#     print(f"Pred Results: {pred_results}")
#     labels = torch.argmax(pred_results, dim=1)      # Get the labels
#     print(f"Labels: {labels}")

#### **FineTune**