# Hugging Face 

Your go-to tool for using any pretrained models.

In this era, everyone uses pretrained models, thus HuggingFace is like a no-brainer.

In [1]:
import transformers #huggingface
transformers.__version__

#pip install transformers[sentencepiece]
#pip install transformers
#pip install sentencepiece

'4.21.3'

In [2]:
import evaluate ## a tool to help compute metrics
evaluate.__version__

'0.4.0'

In [3]:
import datasets ##huggingface datasets
datasets.__version__

'2.8.0'

In [4]:
import accelerate  ##make your training faster!
accelerate.__version__

'0.15.0'

## 1. Pipeline 

The most basic thing in Huggingface; you insert the pretrained model, and just use it for inference.

In [8]:
import os
os.environ['http_proxy']  = 'http://192.41.170.23:3128'
os.environ['https_proxy'] = 'http://192.41.170.23:3128'

In [None]:
from transformers import pipeline

clf = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
clf("I love huggingface so much.")

Downloading config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/255M [00:00<?, ?B/s]

In [12]:
clf(["I love huggingface so much.", "The movie is annoyingly bad"])

[{'label': 'POSITIVE', 'score': 0.9997928738594055},
 {'label': 'NEGATIVE', 'score': 0.9998204112052917}]

In [13]:
from transformers import pipeline

clf = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
clf("This is a NLP course at AIT", candidate_labels=["education", "tech", "finance"])

Downloading config.json:   0%|          | 0.00/1.13k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

{'sequence': 'This is a NLP course at AIT',
 'labels': ['tech', 'education', 'finance'],
 'scores': [0.7233497500419617, 0.26346421241760254, 0.01318597886711359]}

In [16]:
gen = pipeline("text-generation", model="distilgpt2")

gen("import numpy as ", max_length=100, num_return_sequences=2, pad_token_id=0, 
    eos_token_id=0)

[{'generated_text': 'import numpy as ссиболном Ф.0.6-5.0.0.5.0.0.6.1.2.1.6.0.1.5.1.1.4.3.2.13.0.1.0.4.1.3.2.12.5.4.4.3.6.3.4.1.4.1.4.'},
 {'generated_text': "import numpy as icedine.zip.The last three major issues that plague this site are not only the issue of the software and software that is available from our blog in this post.\n\n\nFirst of all, it is not possible to use the Raspberry Pi for programming. Therefore, you must first create a project code that does not have python. I used C on Linux, and tried to fix the bug. It's been around since the start of this post.\n1"}]

In [17]:
mlm = pipeline("fill-mask", model="distilroberta-base")
mlm("Chaky loves to teach deep <mask>.", top_k=3)

Downloading config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/316M [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

[{'score': 0.15233060717582703,
  'token': 2239,
  'token_str': ' learning',
  'sequence': 'Chaky loves to teach deep learning.'},
 {'score': 0.10400059074163437,
  'token': 9589,
  'token_str': ' breathing',
  'sequence': 'Chaky loves to teach deep breathing.'},
 {'score': 0.07009269297122955,
  'token': 30079,
  'token_str': ' truths',
  'sequence': 'Chaky loves to teach deep truths.'}]

In [18]:
qa = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")
qa(question="Where do Chaky work?", context="My name is Chaky and I love to \
   teach at AIT")

{'score': 0.9753952622413635, 'start': 43, 'end': 46, 'answer': 'AIT'}

In [19]:
summ = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6",
                max_length=100)

summ("""Once upon a time, we have a course on NLP at AIT. It usually teaches about
        word2vec, glove, transformers, etc.  The course also has a coding
        component which ask the students to code using spacy, pytorch, huggingface.
        The students really suffer of this course and they had enough.  Legend
        said that Chaky is still teaching this course until now""")


Downloading config.json:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/446k [00:00<?, ?B/s]

[{'summary_text': ' The students really suffer of this course and they had enough. Chaky is still teaching this course until now . Legend of NLP legend says Chaky still teaches this course . The course also has a coding component which ask the students to code using spacy, pytorch, huggingface .'}]

In [20]:
#gender bias
mlm = pipeline("fill-mask", model="distilroberta-base")
result = mlm("This man works as a <mask>.")
print([r["token_str"] for r in result])

[' translator', ' consultant', ' bartender', ' waiter', ' courier']


In [21]:
result = mlm("This woman works as a <mask>.")
print([r["token_str"] for r in result])

[' waitress', ' translator', ' nurse', ' bartender', ' consultant']


## 2. Tokenization

The first component of the pipeline.

In [22]:
from transformers import AutoTokenizer
#what is AutoTokenizer - means it can be any tokenizer-specify what tokenizer....

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer  = AutoTokenizer.from_pretrained(checkpoint)

In [23]:
raw_inputs = ["Chaky has been waiting for HuggingFace my whole life", 
              "NLP course is interesting but tough"]

inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[  101, 15775,  4801,  2038,  2042,  3403,  2005, 17662, 12172,  2026,
          2878,  2166,   102],
        [  101, 17953,  2361,  2607,  2003,  5875,  2021,  7823,   102,     0,
             0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])}


In [26]:
tokenizer.decode([101, 15775,  4801,  2038,  2042,  3403,  2005, 17662, 12172,  2026,
          2878,  2166,   102])

'[CLS] chaky has been waiting for huggingface my whole life [SEP]'

## 3. Model

The second component of Pipeline (after Tokenizer)

In [27]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model      = AutoModel.from_pretrained(checkpoint)

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['pre_classifier.weight', 'classifier.weight', 'classifier.bias', 'pre_classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [28]:
inputs

{'input_ids': tensor([[  101, 15775,  4801,  2038,  2042,  3403,  2005, 17662, 12172,  2026,
          2878,  2166,   102],
        [  101, 17953,  2361,  2607,  2003,  5875,  2021,  7823,   102,     0,
             0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])}

In [29]:
outputs = model(**inputs) #** means keyword arguments
print(outputs.last_hidden_state.shape)

#(b, l, h)

torch.Size([2, 13, 768])


## 4. Postprocessing

Last step of the Pipeline (after the Model)

In [31]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model      = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs    = model(**inputs)

In [32]:
outputs.logits

tensor([[-2.0539,  2.0242],
        [ 0.6914, -0.6340]], grad_fn=<AddmmBackward0>)

In [33]:
import torch
predictions = torch.nn.functional.softmax(outputs.logits, dim = -1)
print(predictions)

tensor([[0.0167, 0.9833],
        [0.7901, 0.2099]], grad_fn=<SoftmaxBackward0>)


In [34]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}