## Hugging Face

With the advent of transfer learning, it has become easier to train models for various different tasks. However, training a model from scratch is still a time consuming and resource intensive task. This is where Hugging Face comes in. Hugging Face is a company that provides a library of pre-trained models that can be used for various different tasks. These models can be used as is or fine-tuned for a specific task. Hugging Face also provides a library of datasets that can be used for training and evaluation of models. With the rise of LLMs and generative models, Hugging Face has become a popular choice for researchers and practitioners alike. Due to the ease of use and the large community, Hugging Face is an essential tool for anyone working with LLMs.

The Hugging Face stack consists of multiple libraries. Some of these libraries that are relevant for LLMs are: `transformers`, `PEFT`, `datasets`, `accelerate`, etc. We will use some of these libraries for this hands-on session.

### Using open datasets and pre-trained models

Hugging Face has a host of models and datasets for various different tasks. To get started, you can head over to Hugging Face and check out the models and datasets available there for different tasks. <br /><br />
<img src="assets/hf_tasks.png" width="700" />

Clicking on a task takes you to a page with a list of models and datasets for that task. For example, the page for the task of text classification is shown below. You can also find helpful documentation, tutorials and videos on the same page. <br /><br />
<img src="assets/hf_tasks_text_classif.png" width="700" />


Loading a dataset is easy. You can use the `datasets` library to load a dataset. The `datasets` library provides a unified API to load datasets from Hugging Face and other sources. You can also use the `datasets` library to load your own dataset. The `datasets` library also provides a host of useful features such as automatic caching, shuffling, batching, etc. 

In [20]:
from datasets import load_dataset  # For loading datasets
from transformers import pipeline  # For using a pretrained model and create an inference flow or a 
import textwrap  # For wrapping text nicely

In [21]:
xsum_dataset = load_dataset(  # Load the XSum dataset: A set of BBC articles and summaries.
    "xsum", version="1.2.0"  # You can use differnt datasets by changing the name and version (check the huggingface docs)
)
xsum_dataset

DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11334
    })
})

In [22]:
sample = xsum_dataset['train'][0]
print(textwrap.fill(sample['document'][:400] + "...", width=80))  # Print the first 400 characters of the article.
print()
print(textwrap.fill(sample['summary'], width=80))  # Print the summary.

The full cost of damage in Newton Stewart, one of the areas worst affected, is
still being assessed. Repair work is ongoing in Hawick and many roads in
Peeblesshire remain badly affected by standing water. Trains on the west coast
mainline face disruption due to damage at the Lamington Viaduct. Many businesses
and householders were affected by flooding in Newton Stewart after the River
Cree overfl...

Clean-up operations are continuing across the Scottish Borders and Dumfries and
Galloway after flooding caused by Storm Frank.


Loading and using a model for inference.

We will utilise the t5-small model. A short description of the arguments used in the code below is given below:
- `task`: The task for which the model was trained. A list of tasks can be found [here](https://huggingface.co/tasks).
- `model`: The model to be used. In this case, it is t5-small, which is a good starter LLM with only 60 million parameters. A list of models available for summarization can be found [here](https://huggingface.co/models?pipeline_tag=summarization). You can also choose a different task from the sidebar and see the models available for it.
- `min_length`, `max_length`: The minimum and maximum length of the output sequence in number of tokens. The output sequence will be truncated if it exceeds the maximum length. The output sequence will be padded if it is shorter than the minimum length. A token is, in general, a word or a punctuation mark. 
- `truncation`: Most LLMs have a maximum input length. If the input sequence is longer than the maximum length, it is truncated. Setting this argument to `True` ensures that the input sequence is truncated to the maximum length.

In [40]:
# A summarization pipeline
summarizer = pipeline(
    task="summarization",  # The task we want to perform
    model="t5-small",  # The model checkpoint we want to use (t5-small is a small model, ~60M parameters)
    min_length=30,  # The minimum length of the summary in # of tokens
    max_length=100,  # The maximum length of the summary in # of tokens
    truncation=True,  # Truncate the input sequences to max_length
)

# A translation pipeline
translator = pipeline(
    task='translation_en_to_de',   # Follows the format: translation_{source language}_to_{target language}
    model='t5-small',
    min_length=30,
    max_length=100,
    truncation=True,
)

In [36]:
# Summarize
print(textwrap.fill(summarizer(sample['document'])[0]['summary_text'], width=80))   # Summarize the article using the summarizer pre-trained model.
print()

# Original summary
print(textwrap.fill(sample['summary'], width=80))  # Print the summary.

the full cost of damage in Newton Stewart is still being assessed . many roads
in peeblesshire remain badly affected by standing water . the water breached a
retaining wall, flooding many commercial properties .

Clean-up operations are continuing across the Scottish Borders and Dumfries and
Galloway after flooding caused by Storm Frank.


In [41]:
# Translate the summary to German
print(textwrap.fill(translator(sample['summary'])[0]['translation_text'], width=80))
print()
# Original English summary
print(textwrap.fill(sample['summary'], width=80))

Nach Überschwemmungen durch Sturm Frank laufen die Säuberungsmaßnahmen über die
schottischen Grenzen und Dumfries und Galloway weiter.

Clean-up operations are continuing across the Scottish Borders and Dumfries and
Galloway after flooding caused by Storm Frank.


### Zero-shot classification
Zero-shot learning entails using a model for a task it was not trained for. It typically is used for classification tasks with the categories being unknown during training. This ability seems to be an emergent feature of large language models(100M+, according to hugging face). Some of the models in this category achieve this by simply prompting the model to classify a given text into a list of categories, while others, more advanced ones, utilise Natural Language Inference (NLI) models to achieve this. 

NLI models are trained on sets of (premise, hypothesis) pairs, where the model is trained to predict the probability of the hypothesis being true given the premise. This has been observed to be producing good results for zero-shot classification (see for more details: [this blog post](https://joeddav.github.io/blog/2020/05/29/ZSL.html), [paper 1](https://arxiv.org/abs/2005.14165), [paper 2](https://arxiv.org/abs/1909.00161)). `facebook/bart-large-mnli` is one such NLI model available on hugging face.

In [56]:
pipe = pipeline(model="facebook/bart-large-mnli", max_new_tokens=100)

Downloading (…)lve/main/config.json: 100%|██████████| 1.15k/1.15k [00:00<00:00, 9.03MB/s]
Downloading model.safetensors: 100%|██████████| 1.63G/1.63G [00:20<00:00, 80.2MB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 26.0/26.0 [00:00<00:00, 296kB/s]
Downloading (…)olve/main/vocab.json: 100%|██████████| 899k/899k [00:00<00:00, 2.99MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 4.55MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 1.36M/1.36M [00:00<00:00, 11.9MB/s]


In [59]:
# Zero-shot classification example
pipe(
    "This is a short demo session on zero-shot text classification with Hugging Face, conducted by Suyog.",
    candidate_labels=["computer science", "sports", "physics", "biology", "chemistry", "mathematics"],
)

{'sequence': 'This is a short demo session on zero-shot text classification with Hugging Face, conducted by Suyog.',
 'labels': ['computer science',
  'mathematics',
  'sports',
  'physics',
  'chemistry',
  'biology'],
 'scores': [0.4396549165248871,
  0.13937702775001526,
  0.12538164854049683,
  0.11936939507722855,
  0.10196133702993393,
  0.07425558567047119]}

### Few-shot classification
Slightly different, but related to zero-shot learning, is few-shot learning. Few-shot learning entails using a model for a task it was not trained for, but with a few examples of the task being given to the model as a primer. It allows the models to perform well on tasks other than classification. You will typically find models optimized for few-shot learning in the `text-generation` category on hugging face.

In [73]:
few_shot_pipeline = pipeline(task='text-generation', model="EleutherAI/gpt-neo-1.3B", max_new_tokens=10)

In [82]:
# Wrong output when no samples are provided
out = few_shot_pipeline("""For this sentence, predict the object of the preposition:
    [sentence]: I am giving a demo to people.
    [object]:"""
)
print(textwrap.fill(out[0]['generated_text'], width=200, replace_whitespace=False))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


For this sentence, predict the object of the preposition:
    [sentence]: I am giving a demo to people.
    [object]: a video I am giving a demo to people.


In [83]:
# In this particular model, the primer examples need to be separated by a token specified by `eos_token_id`
eos_token_id = few_shot_pipeline.tokenizer.encode("*****")[0]

# Few-shot example with sample outputs providedz
out = few_shot_pipeline(
    """For each sentence, predict the object of the preposition:

    [sentence]: The quick brown fox jumped over the lazy dog.
    [object]: dog
    ****
    [sentence]: The lazy dog jumped over the quick brown fox.
    [object]: fox
    ****
    [sentence]: I like to eat bananas.
    [object]: bananas
    ****
    [sentence]: I am giving a demo to people.
    [object]:""",
    eos_token_id=eos_token_id,
)

print(textwrap.fill(out[0]['generated_text'], width=200, replace_whitespace=False))

Setting `pad_token_id` to `eos_token_id`:35625 for open-end generation.


For each sentence, predict the object of the preposition:

    [sentence]: The quick brown fox jumped over the lazy dog.
    [object]: dog
    ****
    [sentence]: The lazy dog jumped over the quick
brown fox.
    [object]: fox
    ****
    [sentence]: I like to eat bananas.
    [object]: bananas
    ****
    [sentence]: I am giving a demo to people.
    [object]: people
    ****


## References
- [LLMs with Hugging Face - DataBricks Academy | Kaggle](https://www.kaggle.com/code/aliabdin1/llm-01-how-to-use-llms-with-hugging-face?scriptVersionId=140351055)
- [HuggingFace Transformers](https://huggingface.co/transformers/)
- [HuggingFace Datasets](https://huggingface.co/docs/datasets/)