# Introduction to Huggingface

This notebook illustrates how to perfrom some basic actions using the huggingface, transformers ecosystem. We will not be focusing on biology but how to accomplish common AI tasks. While we have chose small models and you should be able to run them on a CPU having access to a GPU is highly reccomended. Additionally some of these commands will download datasets and models to the environment you are running the code. Make sure you have the write permissions and enough storage space.

In [1]:
import torch
torch.cuda.is_available()

True

Hugginface has many packages most of the model related functionalities are in the transformes package, other packages like datasets, and tokenizers are there to support training/inference

In [2]:
from transformers import pipeline

pipeline is the generic wrapper for inference, you can specify a lot of nlp and computer vision tasks and associated models, the model is downloaded automatically in your ~/.cache you can also specify where the model can be downloed. The wrapper olso downloads the appropriate tokenizer that goes with the model. When a model is uploaded to huggingface along with all the necessary files to run it. If a model is trained using a specific framework (torch, tensorflow) it is uploaded in that framework, there is no built in cross-compatbility

In [3]:
classifier=pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")

Downloading config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [4]:
text="this is an awesome package"
classifier(text)

[{'label': 'POSITIVE', 'score': 0.9998750686645508}]

the label depends on how the training is done, it does not have to be binary classifcation. 

There are many tasks and associated models that go with it. Some are bigger than others and it is generally good idea to have a GPU available when we are training/fine-tuning or when we have 1000s of samples to run inference on. If the model is bigger than the avaiable VRAM then you will most likely get an Out of memory error. There is not much you can do about it. Huggingface supports distributed training/inference, that is you can split the model into multiple gpus (assuming that they can talk to one another). 

In [5]:
summarizer = pipeline('summarization', model="sshleifer/distilbart-cnn-12-6")

text="""The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an 
encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. 
We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. 
Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly 
less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, 
including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art 
BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.
We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data."""

Downloading config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [6]:
summarizer(text)

[{'summary_text': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration . The best performing models also connect the encoder and decoder through an attention mechanism . We propose a new simple network architecture, the Transformer, based solely on attention mechanisms .'}]

0 shot classification is not really a built in classification but rather comparison of embeddings of the labels with the embeddings of the text and we can find the closest one. Here embeddings refer to the models "understanding" of the input. These are represented as a bunch of number as a `pytorch.Tensor`. Once we calculate these embeddings for our query *and* labels we can calculate the  most similar one.

In [7]:
classifier = pipeline("zero-shot-classification", model="MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli")
text="hospital for sick children is a major research centre"
candidate_labels = ["politics", "economy", "entertainment", "science", "medicine"]
classifier(text, candidate_labels, multi_label=False)


Downloading config.json:   0%|          | 0.00/1.09k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/369M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/1.28k [00:00<?, ?B/s]

Downloading spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/8.66M [00:00<?, ?B/s]

Downloading added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

Downloading (â€¦)cial_tokens_map.json:   0%|          | 0.00/286 [00:00<?, ?B/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


{'sequence': 'hospital for sick children is a major research centre',
 'labels': ['science', 'medicine', 'economy', 'politics', 'entertainment'],
 'scores': [0.5817810297012329,
  0.4062599837779999,
  0.005329915322363377,
  0.003325793659314513,
  0.003303299890831113]}

In [8]:
# multi label
classifier(text, candidate_labels, multi_label=True)


{'sequence': 'hospital for sick children is a major research centre',
 'labels': ['science', 'medicine', 'economy', 'politics', 'entertainment'],
 'scores': [0.9735254645347595,
  0.9540853500366211,
  0.00026499555679038167,
  0.0001278855197597295,
  8.470423199469224e-05]}

Since the writing of this article the sentence transformers library has been merged into transformers, the functionality of the package remains mostly unchanged but it may in the future. Sentense transformers package focuses on generating the embeddings we mentioned above. They can be generated in bulk for diverse texts. You can treat these embeddigns like any other feature of your dataset and can perfrom operations like classification, regresssion or clustering. Each element of the tensor is not a meaninigful feature but the whole vector is the models' understanding of where the piece of text sits withing the available space.

In [None]:
!pip install sentence_transformers #if not installed ! means run the shell command

In [10]:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

sent1="I like eating hamburgers"
sent2="I like eating burgers and fries"
sent3="I do research on DNA sequences"

em1=model.encode(sent1)
em2=model.encode(sent2)
em3=model.encode(sent3)

In [12]:
em1.shape, em2.shape

((384,), (384,))

Sentence transformers package contains a lot of models and utility functions for semantic similarity and other comparison methods. The cosinde similarity is literally the angle between the 2 embedding vectors, cosine of 0 is 1, which means if the angle between the vector is 0 (they are on top of one another) then they are identical. 

In [10]:
util.cos_sim(em1, em2)

tensor([[0.8372]])

In [11]:
util.cos_sim(em1, em3)

tensor([[0.1328]])

It is possible to do the same with models that are not in sentences transformers with a little bit more code, we can use the 'features extraction' pipeline to get the embeddings of the text we are interested in and compare it to another set of embeddings. 

In [13]:
from transformers import AutoModel, AutoTokenizer

# it is also possible to load model and a tokenizer separately and combine them in a pipeline, this is useful if you have a bunch of models that are different sizes, trained on differnt data but all use the same tokenizer
model=AutoModel.from_pretrained("distilbert-base-uncased")

# model max length is pre defined by it's attention mechanism and the size of the embeddings that it can take, the reason for that is due to the size of the transformer layers, they are fixed size (there are other methods that can strech this a little bit). 
tokenizer=AutoTokenizer.from_pretrained("distilbert-base-uncased", max_length=512, truncate=True, padding="max_length")

# build the pipeline piece by piece
# use a pytorch model and return an pytorch tensor
extractor=pipeline("feature-extraction", framework="pt", model=model, tokenizer=tokenizer, return_tensors="pt")


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [15]:
short_sentence="This is a short sentence"
long_sentence="This sentence is considerable longer than the first one, while being still shorter than the model max length that is 512 words, it is also similar to the first sentence"
other_long_sentence="In my younger and more vulnerable years my father gave me some advice that I've been turning over in my mind ever since. And then I met Gatsby!"

short_tokenized=tokenizer(short_sentence)
long_tokenized=tokenizer(long_sentence)
other_long_tokenized=tokenizer(other_long_sentence)

short_tokenized
short_tokenized["attention_mask"]

[1, 1, 1, 1, 1, 1, 1]

The attention mask refers to which part of the text that the model is paying attention to. For large texts (texts that are larger than the models context window) you will find that either left or right side (depending on the setting of the model, which you can change - see tokenizer package documentation) will be set to 0. This means the model will ignore these words. Of the words that model is paying attention to, you will see that the words are converted to numbers, this is the job of the tokenizer. LLMs do not work on words, but numbers. So we need a way to convert words into numbers, this is the job of the tokenizer. For the most part this is just a dictionary lookup. This is especially true for smaller models with smaller vocabulary.

In [16]:
short_tokenized["input_ids"] # the integers corresponding to each word

[101, 2023, 2003, 1037, 2460, 6251, 102]

In [18]:
tokenizer.decode(short_tokenized["input_ids"]) # the model is uncased so everything is always lower case, this means the words Alper and alper are the same, there are cased models where there is a difference, you can pick whiever suits your needs

'[CLS] this is a short sentence [SEP]'

In [20]:
short_features=extractor(short_sentence)
long_features=extractor(long_sentence)

short_features[0].shape, long_features[0].shape

(torch.Size([7, 768]), torch.Size([34, 768]))

We cannot compare tensors/matrices of different sizes, we need to somewhow need to convert them to the same shape, one way to do that is to take the mean across one dimension and end up with the same lenth vectors.

In [23]:
short_mean=short_features[0].mean(axis=0)
long_mean=long_features[0].mean(axis=0)

from torch.nn import CosineSimilarity
from torch.linalg import norm

cos_sim=CosineSimilarity(dim=0)

In [24]:
cos_sim(short_mean, long_mean)

tensor(0.7598)

In [25]:
cos_sim(torch.from_numpy(em1), torch.from_numpy(em2))

tensor(0.8372)

In [26]:
cos_sim(torch.from_numpy(em1), torch.from_numpy(em3))

tensor(0.1328)