It's a project to follow the "[Getting Started With Hugging Face in 15 Minutes | Transformers, Pipeline, Tokenizer, Models](https://youtu.be/QEaBAZQCtwE)" from AssemblyAI to write the code. I believe it's very helpful if you're interested in "Transformer", but no clude about the code and some practices and experiences - Here you go!

## **Pipeline**

In [1]:
import tensorflow as tf
print("TensorFlow version:", tf.__version__)


TensorFlow version: 2.8.2


In [2]:
pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 8.0 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 49.9 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 73.5 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 11.2 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstal

In [3]:
from transformers import pipeline

Let's start from the magic of the pipeline of transformers!

### **First one: sentiment-analysis**

In [4]:
# First one: "sentiment-analysis"
classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

In [5]:
res = classifier("I've been waiting for a HuggingFace class my whole life")
print(res)

[{'label': 'POSITIVE', 'score': 0.8828805088996887}]


Did you notice this is something very helping... "My whole life" makes the positiveness pretty high - close to 0.9

In [6]:
res = classifier("I'm so excited to start to learn the Transformer and it's my dream come true!")
print(res)

[{'label': 'POSITIVE', 'score': 0.999854564666748}]


Very positive! Isn't it? 0.9998!!!

In [7]:
res = classifier("I'm so sorry that I don't know what to do in this model...")
print(res)

[{'label': 'NEGATIVE', 'score': 0.9997251629829407}]


Two sentiments: Sorry and Don't - It makes the negative 0.9997!!!

In [8]:
res = classifier("I'm a robot")
print(res)

[{'label': 'POSITIVE', 'score': 0.9941974878311157}]


A very straightforward positiveness - It gets 0.994!

In [9]:
res = classifier("I'm nothing")
print(res)

[{'label': 'NEGATIVE', 'score': 0.9995001554489136}]


OK - It's so firmative negative - I'm nothing. So it's 0.9995!!

In [10]:
res = classifier("I'm ok")
print(res)

[{'label': 'POSITIVE', 'score': 0.9998317956924438}]


It's interesting - in daily life, it's not pretty 100% ok if you say I'm ok. It depends on the sound, tone and some more factors. But it's word on the computer screen - we don't complaitn too much. It's 0.9998!

In [11]:
res = classifier("I'm like...no idea what to do")
print(res)

[{'label': 'NEGATIVE', 'score': 0.9997380375862122}]


"No" still provides a clear clue for the sentiment of negativeness. So it's 0.9997!

In [12]:
res = classifier("I'm testing how to make it neutral")
print(res)

[{'label': 'NEGATIVE', 'score': 0.9988194108009338}]


It looks like "I'm testing" specifically provides a doubt behind that. It's 0.9988!

### **Second one: text-generation**

In [13]:
generator = pipeline("text-generation", model = "distilgpt2")

Downloading:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/336M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [14]:
res2 = generator(
    "In this course, we will teach you how to",
    max_length = 30, 
    num_return_sequences=2
)
print(res2)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "In this course, we will teach you how to solve the many different problems you'll face when they're facing them in your field: How to Build"}, {'generated_text': 'In this course, we will teach you how to navigate the world through virtual reality, virtual reality, and how you would like to get closer to the'}]


### **Third one: zero-shot-classification**

In [15]:
classifier = pipeline("zero-shot-classification")

res3 = classifier(
    "This is a course about Python list comprehension",
    candidate_labels = ["education", "politics", "business"]
)
print(res3)

No model was supplied, defaulted to facebook/bart-large-mnli (https://huggingface.co/facebook/bart-large-mnli)


Downloading:   0%|          | 0.00/1.13k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

{'sequence': 'This is a course about Python list comprehension', 'labels': ['education', 'business', 'politics'], 'scores': [0.962202787399292, 0.02684125490486622, 0.010955986566841602]}


Cool! Isn't it? Education is highly related to course and comprehension. Probably Python is also helping. But it's really nothing related to politics or business actdually. "Zero-shot-classficaiton" is really "zero-shot!".

Three different candidate labels would get the total 100% distribution among the three. Education 96% + Business 3% + Politics 1%. (Sequential order - even it's E/P/B order)

Time to visit the documentation from the HuggingFace about [Transformers](https://huggingface.co/docs/transformers/index)
124 models (from ALBERT, BERT....all the way to YOSO) and 5 different frameworks including 1. Tokenizner slow, 2. Tokenizer fast, 3. PyTorch support, 4. TensorFlow support, 5. Flax support

In [16]:
# AutoTokenizer is a generic class
# AutoModelForSequenceClassificatino is also a generic class, but focus in more "sequence Classification"

from transformers import AutoTokenizer, AutoModelForSequenceClassification, BertTokenizer, BertModel

classifier = pipeline("sentiment-analysis")

res4 = classifier("I've been waiting for a HugginFlace Course my whole life.")

print(res4)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


[{'label': 'POSITIVE', 'score': 0.980135440826416}]


In [18]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, BertTokenizer, BertModel

model_name = "distilbert-base-uncased-finetuned-sst-2-english"

model = AutoModelForSequenceClassification.from_pretrained(model_name)

tokenizer = AutoTokenizer.from_pretrained(model_name)

classifier = pipeline("sentiment-analysis", model = model, tokenizer = tokenizer)

res5 = classifier("I've been waiting for a HuggingFlace Course my whole life.")

print(res5)

[{'label': 'POSITIVE', 'score': 0.9248871207237244}]


Tokenizer => Put a mathmatical representation which computer could understand

In [25]:
sequence = "Using a transformer network is simple"
res6 = tokenizer(sequence)
print(res6)

{'input_ids': [101, 2478, 1037, 10938, 2121, 2897, 2003, 3722, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}


If we check TOKENIZER directly, we will see a DICTIONARY

ATTENTION_MASK is a list of ZERO and ONE

ZERO means the attention layer should IGNORE this token

101 and 102 means the beginning and the ending

Only the middle of the 2478, 1037 all the way to 3722 are the real input-ids which refers to this strong itself

In [26]:
tokens = tokenizer.tokenize(sequence)
print(tokens)

['using', 'a', 'transform', '##er', 'network', 'is', 'simple']


tokernizer.tokenize will give us tokens back

In [27]:
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[2478, 1037, 10938, 2121, 2897, 2003, 3722]


each TOKEN has a UNIQE RESPONDING ID

In [28]:
# Give us the original string back
decoded_string = tokenizer.decode(ids)
print(decoded_string)

using a transformer network is simple


We will get back ORIGINAL STRING (lower case)

## **Pytorch**

In [29]:
import torch
import torch.nn.functional as F

In [43]:
model_name = "distilbert-base-uncased-finetuned-sst-2-english"

model = AutoModelForSequenceClassification.from_pretrained(model_name)

tokenizer = AutoTokenizer.from_pretrained(model_name)

classifier = pipeline("sentiment-analysis", model = model, tokenizer = tokenizer)

In [44]:
# Use multiple sentences from now
X_train = ["I've been waiting for a HuggingFlace course my whole life.", "Python is great!"]

res7 = classifier(X_train)
print(res7)

[{'label': 'POSITIVE', 'score': 0.9248871207237244}, {'label': 'POSITIVE', 'score': 0.9998615980148315}]


In [46]:
batch = tokenizer(X_train, padding = True,
                  truncation = True, 
                  max_length = 512,
                  return_tensors = 'pt')
print(batch)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 10258,
         10732,  2607,  2026,  2878,  2166,  1012,   102],
        [  101, 18750,  2003,  2307,   999,   102,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}


In [49]:
with torch.no_grad():
  outputs = model(**batch)
  print(outputs)
  predictions = F.softmax(outputs.logits, dim = 1)
  print(predictions)
  labels = torch.argmax(predictions, dim = 1)
  print(labels)

SequenceClassifierOutput(loss=None, logits=tensor([[-1.1968,  1.3139],
        [-4.2745,  4.6111]]), hidden_states=None, attentions=None)
tensor([[7.5113e-02, 9.2489e-01],
        [1.3835e-04, 9.9986e-01]])
tensor([1, 1])


If we compare the previous prediction 0.924887...and this PyTorch's predicdtion 0.92489 it's almost the same. This is how it works!

This is an example how to train our PyTorch training loop!

## **SAVE/LOAD**

In [50]:
save_directory = "saved"
tokenizer.save_pretrained(save_directory)
model.save_pretrained(save_directory)

tok = AutoTokenizer.from_pretrained(save_directory)
mode = AutoModelForSequenceClassification.from_pretrained(save_directory)

It should get the same result as before!

## **Model Hub**

Visit [Models under HuggingFace.co](https://huggingface.co/models) you could reach 56,034 models (dated 7:35 PM PDT of 7/4/2022 - and it's actually 34,950 when 4/3/2022 - 3 months to grow 60% in the volume of the models...it's amazing! )

In the left side we could find by TASK(28), LIBRARY(31), DATASET(1099), LANGUAGE(188) and LICENSE(42). 

I had a gut feeling that the increaing volume trend coming fom the "language" basically. It will be easily to grow that part at least doulbe or tripple it very soon by the fall and 10X by this year to reach 500K models to choose.

distilbert-base-uncased-finetuned-sst-2-english is a great example: distilBERT-based model. It's "uncased" and "fintuned" with "sst-2" dataset in the English (langauge).

Here I left an open question if anyone has an idea - What's the best model to fit in?

In [None]:
# test the example from CNN article
# the example is not working since it's not searchable from the list anymore
# I tried some new one it seems not perfect enough

from transformers import pipeline

summarizer = pipeline("summarization", model = "facebook/mbart-large-50-one-to-many-mmt")

text = """New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York. 
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband. 
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared “I do” five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her “first and only” marriage.
Barrientos, now 39, is facing two criminal counts of “offering a false instrument for filing in the first degree,” referring to her false statements on the 2010 marriage license application, according to court documents.
"""

print(summarizer(text, max_length = 130, min_length = 30, do_sample = False))

## **Finetune**

Standard of process:
1. prepare dataset
2. load pretrained Tokenizer, call it with dataset -> encoding
3. build PyTorch Dataset with encoindgs
4. Load pretrained model
5. a) Load trainer and train int
   b) native PyTorch training loop

In [None]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments("test-trainer")

trainer = Trainer(
    model, 
    training_args, 
    train_dataset = tokenized_datasets["train"],
    eval_dataset = tokenized_datasets["validation"],
    data_collator = data_collator,
    tokenizer = tokenizer
)

trainer.train()