# Hugging Face
- Provides a set of [tools](https://huggingface.co/docs) for utilizing pretrained models and deploying models (Hugging Face Spaces);
- Most known for its `transformers` library, an open-source NLP library that provides a wide range of pre-trained models for various NLP tasks such as text classification, question-answering, and language translation;
- Built on top of PyTorch and TensorFlow, and is designed to be easy to use and integrate into existing applications;
- Allows developers to fine-tune pre-trained models on their own data for specific tasks and domains, which can help improve model performance;
- Offers a cloud-based platform called Hugging Face Spaces, which allows developers to train and deploy custom NLP models without needing to manage their own infrastructure.

## Installation
Example installation with Mac. The process is similar for a Windows: create a designated environment, install Pytorch, then Hugging Face libararies. 
```
conda create -n huggingface python=3.9
conda activate huggingface
conda install pytorch torchvision torchaudio -c pytorch  #check pytorch installation for your system
pip install transformers datasets evaluate
pip install ipykernel
pip install scikit-learn
```

In [2]:
from transformers import pipeline

  from .autonotebook import tqdm as notebook_tqdm


# Pipeline
The `pipeline()` function is pipeline for inferencing with pretrained models.


| **Task**                     | **Description**                                                                                              | **Modality**    | **Pipeline identifier**                       |
|------------------------------|--------------------------------------------------------------------------------------------------------------|-----------------|-----------------------------------------------|
| Text classification          | assign a label to a given sequence of text                                                                   | NLP             | pipeline(task=“sentiment-analysis”)           |
| Text generation              | generate text given a prompt                                                                                 | NLP             | pipeline(task=“text-generation”)              |
| Summarization                | generate a summary of a sequence of text or document                                                         | NLP             | pipeline(task=“summarization”)                |
| Image classification         | assign a label to an image                                                                                   | Computer vision | pipeline(task=“image-classification”)         |
| Image segmentation           | assign a label to each individual pixel of an image (supports semantic, panoptic, and instance segmentation) | Computer vision | pipeline(task=“image-segmentation”)           |
| Object detection             | predict the bounding boxes and classes of objects in an image                                                | Computer vision | pipeline(task=“object-detection”)             |
| Audio classification         | assign a label to some audio data                                                                            | Audio           | pipeline(task=“audio-classification”)         |
| Automatic speech recognition | transcribe speech into text                                                                                  | Audio           | pipeline(task=“automatic-speech-recognition”) |
| Visual question answering    | answer a question about the image, given an image and a question                                             | Multimodal      | pipeline(task=“vqa”)                          |
| Document question answering  | answer a question about a document, given an image and a question                                            | Multimodal      | pipeline(task="document-question-answering")  |
| Image captioning             | generate a caption for a given image                                                                         | Multimodal      | pipeline(task="image-to-text")                |


In [3]:
# the easiest way to use pipeline for inferencing just needs a identifier name.
classifier = pipeline('sentiment-analysis')

classifier('My heart is in the work')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9997960925102234}]

In [4]:
# the pipeline also accepts a list of inputs
texts = ['My heart is in the work', 'My heart is in the holiday']
results = classifier(texts)
for result in results:
    print(result['label'], result['score'])

POSITIVE 0.9997960925102234
POSITIVE 0.9993996620178223


In [5]:
# use a customized model and tokenizer in the pipeline function
from transformers import AutoTokenizer, AutoModelForSequenceClassification


#create the model and tokenizer
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

#create a pipeline with customized model and tokenizer
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

In [6]:
french_texts = ['Mon coeur est dans le travail', 'Mon coeur est dans les vacances']
results = classifier(french_texts)
for result in results:
    print(result['label'], result['score'])

5 stars 0.40999341011047363
5 stars 0.4059724509716034


# AutoClass
An [AutoClass](https://huggingface.co/docs/transformers/main/en/./model_doc/auto) is a shortcut that automatically retrieves the architecture of a pretrained model from its name or path. You only need to select the appropriate `AutoClass` for your task and it's associated preprocessing class. 

## AutoTokenier
A tokenizer is responsible for preprocessing text into an array of numbers as inputs to a model. There are multiple rules that govern the tokenization process, including how to split a word and at what level words should be split (learn more about tokenization in the [tokenizer summary](https://huggingface.co/docs/transformers/main/en/./tokenizer_summary)). The most important thing to remember is you need to instantiate a tokenizer with the same model name to ensure you're using the same tokenization rules a model was pretrained with.

Load a tokenizer with [AutoTokenizer](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoTokenizer):

In [7]:
from transformers import AutoTokenizer

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [8]:
# convert text into tokens
encoding = tokenizer('My heart is in the work. Walk to sky')
print(encoding)

{'input_ids': [101, 11153, 13645, 10127, 10104, 10103, 11497, 119, 19864, 10114, 13724, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


The tokenizer returns a dictionary containing:

* [input_ids](https://huggingface.co/docs/transformers/main/en/./glossary#input-ids): numerical representations of your tokens.
* [attention_mask](https://huggingface.co/docs/transformers/main/en/.glossary#attention-mask): indicates which tokens should be attended to.

A tokenizer can also accept a list of inputs, and pad and truncate the text to return a batch with uniform length:

In [9]:
# A tokenizer can also accept a list of inputs, and pad and truncate the text to return a batch with uniform length

pt_batch = tokenizer(
    ['My heart is in the work.', 'Walk to sky'],
    padding='max_length',
    truncation=True,
    max_length=10,
    return_tensors='pt',
)
print(pt_batch.input_ids.shape)

torch.Size([2, 10])


## AutoModel

`AutoModel` provides a simple and unified way to load pretrained instances. The only caveat is selecting the correct [AutoModel](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModel) for the task. For text (or sequence) classification, you should load [AutoModelForSequenceClassification](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForSequenceClassification):

In [10]:
from transformers import AutoModelForSequenceClassification

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)

In [11]:
# The model outputs the final activations in the logits attribute. 
pt_outputs = pt_model(**pt_batch)
print(pt_outputs)

SequenceClassifierOutput(loss=None, logits=tensor([[-2.0182, -1.6824,  0.0462,  1.3213,  1.7566],
        [-0.6048, -0.7314, -0.1605,  0.3970,  0.6901]],
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)


In [12]:
"""
All Hugging Face Transformers models output the tensors *before* the final activation
function (like softmax) because the final activation function is often fused with the loss.
"""
from torch import nn

pt_predictions = nn.functional.softmax(pt_outputs.logits, dim=-1)
print(pt_predictions)

tensor([[0.0122, 0.0170, 0.0960, 0.3436, 0.5311],
        [0.1019, 0.0898, 0.1589, 0.2775, 0.3720]], grad_fn=<SoftmaxBackward0>)


Getting to know the key concepts of `pipeline`, `AutoTokenizer` and `AutoModel` prepares you to do inference with many pretrained models. See [examples](https://colab.research.google.com/github/huggingface/education-toolkit/blob/main/03_getting-started-with-transformers.ipynb)

# Fine-tuning a pretrained model

## Prepare a dataset

In [13]:
from datasets import load_dataset

dataset = load_dataset('yelp_review_full')
dataset['train'][100]

Found cached dataset yelp_review_full (/home/jack/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf)
100%|██████████| 2/2 [00:00<00:00, 308.62it/s]


{'label': 0,
 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. 

In [14]:
train_subset = dataset['train'].select(range(500))
test_subset = dataset['test'].select(range(50))

In [15]:
from transformers import AutoTokenizer
# tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

In [16]:
def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True)


small_train_dataset = train_subset.map(tokenize_function)
small_eval_dataset = test_subset.map(tokenize_function)

Loading cached processed dataset at /home/jack/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf/cache-15419acbd7912a29.arrow
Loading cached processed dataset at /home/jack/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf/cache-a2d2435d50605ccd.arrow


## Define the model

You will see a warning about some of the pretrained weights not being used and some weights being randomly initialized. Don’t worry, this is completely normal! The pretrained head of the BERT model is discarded, and replaced with a randomly initialized classification head. You will fine-tune this new model head on your sequence classification task, transferring the knowledge of the pretrained model to it.

In [17]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

## Train with Hugging Face Trainer

In [18]:
from transformers import TrainingArguments, Trainer
import torch
import numpy as np
import evaluate
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

Next, create a [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments) class which contains all the hyperparameters you can tune as well as flags for activating different training options. For this tutorial you can start with the default training [hyperparameters](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments), but feel free to experiment with these to find your optimal settings.

Specify where to save the checkpoints from your training. If you'd like to monitor your evaluation metrics during fine-tuning, specify the `evaluation_strategy` parameter in your training arguments to report the evaluation metric at the end of each epoch:

In [19]:
training_args = TrainingArguments(output_dir='test_trainer', evaluation_strategy="epoch")

[Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) does not automatically evaluate model performance during training. You'll need to pass [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) a function to compute and report metrics. The [🤗 Evaluate](https://huggingface.co/docs/evaluate/index) library provides a simple [`accuracy`](https://huggingface.co/spaces/evaluate-metric/accuracy) function you can load with the [evaluate.load](https://huggingface.co/docs/evaluate/main/en/package_reference/loading_methods#evaluate.load)

In [20]:
metric = evaluate.load("accuracy")

Call `compute` on `metric` to calculate the accuracy of your predictions. Before passing your predictions to `compute`, you need to convert the logits to predictions (remember all 🤗 Transformers models return logits):

In [21]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [22]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics
)

In [23]:
trainer.train()



Epoch,Training Loss,Validation Loss


TrainOutput(global_step=189, training_loss=1.1642992938006367, metrics={'train_runtime': 37.6157, 'train_samples_per_second': 39.877, 'train_steps_per_second': 5.024, 'total_flos': 394677213696000.0, 'train_loss': 1.1642992938006367, 'epoch': 3.0})

## Save a model

Once your model is fine-tuned, you can save it with its tokenizer using [PreTrainedModel.save_pretrained()](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.save_pretrained):

In [24]:
pt_save_directory = 'pt_saved_pretrianed'
tokenizer.save_pretrained(pt_save_directory)
model.save_pretrained(pt_save_directory)

When you are ready to use the model again, reload it with [PreTrainedModel.from_pretrained()](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.from_pretrained):

In [25]:
model = AutoModelForSequenceClassification.from_pretrained(pt_save_directory, num_labels=5)

In [26]:
isinstance(model, torch.nn.Module)

True

# Reference

- [Quick tour](https://huggingface.co/docs/transformers/main/en/quicktour)
- [Installation](https://huggingface.co/docs/transformers/main/en/installation)
- [Pipeline](https://huggingface.co/docs/transformers/main/en/pipeline_tutorial)
- [AutoClass](https://huggingface.co/docs/transformers/main/en/autoclass_tutorial)
- [Fine-tuning](https://huggingface.co/docs/transformers/main/en/training)
- [Education toolkit](https://github.com/huggingface/education-toolkit)