<a href="https://colab.research.google.com/github/codenavy94/DeepLearningStudy/blob/main/STS/Practical_Guides_The_Hugging_Face_Library_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Codes originally from [rupert ai](https://www.youtube.com/channel/UCgLgHT0PrS6EcsI37XPWHHw)'s video on ["Hugging Face Transformers: the basics. Practical coding guides SE1E1. NLP Models (BERT/RoBERTa)"](https://www.youtube.com/watch?v=DQc2Mi7BcuI)

In [1]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 3.2 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 44.1 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 48.1 MB/s 
Collecting tokenizers!=0.11.3,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 28.5 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 4.8 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Foun

# Pipeline

In [2]:
from transformers import pipeline
classifier = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

In [3]:
sentence_pos = 'I love dogs'
sentence_neg = 'I really hate dogs'

In [5]:
classifier(sentence_pos), classifier(sentence_neg)

([{'label': 'POSITIVE', 'score': 0.999713122844696}],
 [{'label': 'NEGATIVE', 'score': 0.9968041181564331}])

# Manual

In [6]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DistilBertForSequenceClassification, DistilBertTokenizer
model_name = 'distilbert-base-uncased-finetuned-sst-2-english'
pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)

In [7]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
# assign a specific type of tokenizer instead of AutoTokenizer
# Using 'auto' is advantageous as it updates the rest of your code by simply changing the model name input.

# tokenizer = DistilBertTokenizer.from_pretrained(model_name)

In [10]:
inputs_pos, inputs_neg = tokenizer(sentence_pos), tokenizer(sentence_neg)

In [11]:
# BERT's special token IDs: 101 = beginning of sequence, 102 = end of sequence, 0 = padding
# Token IDs will change from model-to-model (including the special token IDs)

inputs_pos, inputs_neg

({'input_ids': [101, 1045, 2293, 6077, 102], 'attention_mask': [1, 1, 1, 1, 1]},
 {'input_ids': [101, 1045, 2428, 5223, 6077, 102], 'attention_mask': [1, 1, 1, 1, 1, 1]})

In [12]:
# What if the vocabulary does not include an input word? ex) 'swimmingly'
tokenizer('I swimmingly hate dogs')

{'input_ids': [101, 1045, 5742, 2135, 5223, 6077, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

In [14]:
tokenizer.convert_ids_to_tokens([5742, 2135])

['swimming', '##ly']

In [15]:
tokenizer('I ajsfdjasod hate dogs')

{'input_ids': [101, 1045, 19128, 22747, 2094, 17386, 7716, 5223, 6077, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [17]:
tokenizer.convert_ids_to_tokens([19128, 22747, 2094, 17386, 7716, 5223])

['aj', '##sf', '##d', '##jas', '##od', 'hate']

In [18]:
pt_batch = tokenizer(
    [sentence_pos, sentence_neg],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors='pt'
)

In [22]:
pt_batch

{'input_ids': tensor([[ 101, 1045, 2293, 6077,  102,    0],
        [ 101, 1045, 2428, 5223, 6077,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 0],
        [1, 1, 1, 1, 1, 1]])}

In [20]:
pt_outputs = pt_model(**pt_batch)

In [21]:
pt_outputs

SequenceClassifierOutput([('logits', tensor([[-3.9494,  4.2067],
                                   [ 3.0996, -2.6431]], grad_fn=<AddmmBackward0>))])

In [23]:
# softmax function normalizes the sum of all outputs to be equal to 1

from torch import nn
pt_predictions = nn.functional.softmax(pt_outputs.logits, dim=-1)

In [24]:
pt_predictions

tensor([[2.8688e-04, 9.9971e-01],
        [9.9680e-01, 3.1959e-03]], grad_fn=<SoftmaxBackward0>)