<a target="_blank" href="https://colab.research.google.com/github/avakanski/Fall-2022-Python-Programming-for-Data-Science/blob/main/Lectures/Theme%203%20-%20Model%20Engineering%20Pipelines/Lecture%2019%20-%20Transformer%20Networks/Lecture%2019%20-%20Transformer%20Networks.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

<a name='section0'></a>
# Lecture 20 Language Models with Hugging Face

- [20.1 Introduction to Hugging Face](#section1)
- [20.2 Hugging Face Pipelines](#section2)
- [20.3 Pipelines for NLP Tasks](#section3)
    - [20.3.1 Text Generation](#section3-1)
    - [20.3.2 Question Answering](#section3-2)
    - [20.3.3 Sentiment Analysis](#section3-3)
    - [20.3.4 Text Summarization](#section3-4)
    - [20.3.5 Machine Translation](#section3-5)
    - [20.3.6 Named Entity Recognition](#section3-6)
    - [20.3.7 Zero-shot Classification](#section3-7)
    - [20.3.8 Mask Filling](#section3-8)
- [20.4 Tokenizers](#section3)    
- [References](#section10)





<a name='section1'></a>

# 20.1 Introduction to Hugging Face

***Hugging Face*** [link](https://huggingface.co/) is a platform for Machine Learning and AI created in 2016. Its aim is to "build, train and deploy state of the art models powered by the reference open source in machine learning". Since then, Hugging Face has established itself as a the main resource for NLP, providing open-source access to over 20,000 pre-trained models, datasets, and other tools and resources. Hugging Face focuses on community-building around open-source machine learning tools and data, and they also developed a [course](https://huggingface.co/course/chapter1/1) on how to use their resources for various tasks. Also note that while open access is provided to the core NLP libraries, Hugging Face also offers tiered pricing options for access to AutoNLP capabilities and an accelerated inference API.

<img src='https://raw.githubusercontent.com/avakanski/Fall-2022-Python-Programming-for-Data-Science/main/Lectures/Theme%203%20-%20Model%20Engineering%20Pipelines/Lecture%2020%20-%20Language%20Models%20with%20Hugging%20Face/imges/hf_icon.png' width=400px/>


Hugging Face intitially focused on transformer models and NLP, while recently they have expanded their libraries and tools to cover generally machine learning models and many other tasks. Training large state-of-the-art Transformer models from scratch is very expensive and not affordable for many organizations, therefore providing access to pre-trained models for transfer learning and fine-tuning to specific tasks has been significant.

The core Hugging Face libraries include Rransformer models, Tokenizers, Datasets, and Accelerate. The Accelerate library enables distributed training with hardware acceleration devices, such using multiple GPUs, or a cloud accelerator with TPUs. In addition to these core libraries, Hugging Face provides various community resources, e.g., tools for sharing models, code versioning, Spaces allows to share apps developed with Hugging Face libraries and browse apps crated by others, etc. 

<img src='https://raw.githubusercontent.com/avakanski/Fall-2022-Python-Programming-for-Data-Science/main/Lectures/Theme%203%20-%20Model%20Engineering%20Pipelines/Lecture%2020%20-%20Language%20Models%20with%20Hugging%20Face/imges/hf_libraries.png' width=400px/>




<a name='section2'></a>

# 20.2 Hugging Face Pipelines

Hugging Face uses ***Pipelines*** as an easy-to-use API, which through the `pipeline()` method allow performing inference over a variety of tasks.

The pipeline() method has the following syntax: 

```
from transformers import pipeline

# Pipeline to use a default model & tokenizer for a given task
pipeline("<task-name>")

# Pipeline to use an existing model
pipeline("<task-name>", model="<model_name>")

# Pipeline to use a custom model/tokenizer
pipeline('<task-name>', model='<model name>',tokenizer='<tokenizer_name>')
```

Among the currently available task pipelines are:

- text-generation
- question-answering
- sentiment-analysis
- summarization
- translation
- ner (named entity recognition)
- zero-shot-classification
- fill-mask

<a name='section3'></a>

# 20.3 Pipelines for NLP Tasks

In this section, we will examine examples on how to use the `pipeline("<task-name>")` method with different NLP tasks. As we mentioned, if we don't provide the names for the used model and tokenizer, the pipeline will assign a default language model and tokenized to complete the task, and it will download the model parameters and other required elements to perform text generation.

To use the Transformer library by Hugging Face we will need to first first install it, since it is not preinstalled in Google Colab. 

In [46]:
!pip install transformers sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


<a name='section3-1'></a>

### 20.3.1 Text Generation

The first example will show how to use the `pipeline("<task-name>")` method fo generate text, based on a provided prompt. We will use the `"text-generation"` as task name. 

When the cell is exectued, the pipeline will select a default pretrained model for text generation in Enlish, it will download the model and all other required elements, and it will create a generator object. 

In [4]:
from transformers import pipeline

generator = pipeline("text-generation")

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]


Now let’s provide a prompt text, and use the created generator model to continue the text. Note that text generation involves randomness, so some of outputs will not be perfect. 

In [7]:
outputs_1 = generator("In this course, we will teach you how to")

print(outputs_1[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In this course, we will teach you how to solve these problems. These solutions are the backbone of any problem.

One thing is for certain, you shouldn't build a good game studio simply by choosing to build good games. You should build


In [13]:
outputs_2 = generator("Niagara Falls is a city located in")

print(outputs_2[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Niagara Falls is a city located in northeast Ontario. It is situated along the banks of Niagara Lakes and includes 2 lakes as well as a large lakebed. The entire city was considered the pinnacle of architectural excellence from 1927-1931. The city


In [14]:
outputs_3 = generator("Niagara Falls is a famous world attractation")

print(outputs_3[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Niagara Falls is a famous world attractation. Visitors head to the gorge for a unique treat from the Buffalo River, with a waterfall filled with riverbonds covered with ice.

The New York State Parks and Recreation, for example, has


<a name='section3-2'></a>

### 20.3.2 Question Answering

This pipeline answers questions using information from a given context. Such pipeline can be very useful when we are dealing with long text data and finding answers to questions can take time. 

In [15]:
question_answerer = pipeline("question-answering")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'score': 0.7553666234016418, 'start': 0, 'end': 10, 'answer': 'The animal'}

We can provides inputs to the pipeline as a dictionary wioth question and context as keys. The model extracts information from the provided context. Also note that differently from the above example of text generation, the model does not generate new text to answer the question.

In [17]:
input_1 = {
    "question": "What didn't cross the street?",
    "context" : "The animal didn't cross the street because it was too tired",
    }

question_answerer(input_1)

{'score': 0.7553666234016418, 'start': 0, 'end': 10, 'answer': 'The animal'}

In [18]:
input_2 = {
    "question": "Why the animal didn't cross the street?",
    "context" : "The animal didn't cross the street because it was too wide",
    }

question_answerer(input_2)

{'score': 0.6076130867004395,
 'start': 43,
 'end': 58,
 'answer': 'it was too wide'}

<a name='section3-3'></a>

### 20.3.3 Sentiment Analysis

In [None]:
classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [None]:
classifier("I've been waiting for a HuggingFace course my whole life.")

[{'label': 'POSITIVE', 'score': 0.9598049521446228}]

The pipeline allows to pass multiple sentences, and it will return a sentiment score for each sentence. 

In [None]:
classifier(
    ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
)

[{'label': 'POSITIVE', 'score': 0.9598049521446228},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

<a name='section3-4'></a>

### 20.3.4 Text Summarization

Text summarization reduces a longer text into a shorter summary.

In [19]:
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [20]:
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
"""
)

[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil,    electrical, chemical, and aeronautical engineering . Rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]

Specifying the `min_length` and `max_length` arguments allows to control the length of the summary. 

In [28]:
summarizer(
    """
    Flooding on the Yangtze river remains serious although water levels on parts of the river decreased
    today, according to the state headquarters of flood control and drought relief .
    """, min_length=8, max_length=20)

[{'summary_text': ' Flooding on the Yangtze river remains serious although water levels on parts of the'}]

In [29]:
summarizer(
    """BAGHDAD -- Archaeologists in northern Iraq last week unearthed 2,700-year-old rock carvings featuring war scenes and trees from the Assyrian Empire, an archaeologist said Wednesday.

    The carvings on marble slabs were discovered by a team of experts in Mosul, Iraq’s second-largest city, who have been working to restore the site of the ancient Mashki Gate, which was bulldozed by Islamic State group militants in 2016.

    Fadhil Mohammed, head of the restoration works, said the team was surprised by discovering “eight murals with inscriptions, decorative drawings and writings.”

    Mashki Gate was one of the largest gates of Nineveh, an ancient Assyrian city of this part of the historic region of Mesopotamia.

    The discovered carvings show, among other things, a fighter preparing to fire an arrow while others show palm trees.
    """)  

[{'summary_text': ' The carvings on marble slabs were discovered by a team of experts in Mosul, Iraq’s second-largest city . They have been working to restore the site of the ancient Mashki Gate, which was bulldozed by Islamic State group militants in 2016 . Mashki gate was one of the largest gates of Nineveh, an ancient Assyrian city of this part of Mesopotamia .'}]

<a name='section3-5'></a>

### 20.3.5 Machine Translation

For machine translation, we can provide source and target languages in the pipeline, as in the next cell where the task `"translation_en_to_fr"` is to translate text from English to French. Although this pipeline can work with several languages, most often, machine translation requires to specify the name of used language model.

In [32]:
translator = pipeline("translation_en_to_fr")

No model was supplied, defaulted to t5-base and revision 686f1db (https://huggingface.co/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [33]:
translator("I am a student")

[{'translation_text': 'Je suis un étudiant'}]

In [35]:
translator("Peyton Manning became the first quarterback ever to lead two different teams to multiple Super Bowls.")

[{'translation_text': 'Peyton Manning est devenu le premier quarterback à conduire deux équipes différentes à plusieurs Super Bowls.'}]

<a name='section3-6'></a>

### 20.3.6 Named Entity Recognition

Named Entity Recognition (NER), also known as named entity tagging, is a task to identify parts of the input that represent entities. Examples of entities are:
- Location (LOC) 
- Organizations (ORG)
- Persons (PER)
- Miscellaneous entities (MISC)

In [36]:
ner = pipeline("ner", grouped_entities=True)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

  "`grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to"


In [37]:
text_1 = "Abraham Lincoln was a president who lived in the United States."

print(ner(text_1))

[{'entity_group': 'PER', 'score': 0.9988935, 'word': 'Abraham Lincoln', 'start': 0, 'end': 15}, {'entity_group': 'LOC', 'score': 0.99965084, 'word': 'United States', 'start': 49, 'end': 62}]


Or, we can use pandas to disply the output.

In [39]:
import pandas as pd

pd.DataFrame(ner(text_1))

Unnamed: 0,entity_group,score,word,start,end
0,PER,0.998893,Abraham Lincoln,0,15
1,LOC,0.999651,United States,49,62


In [41]:
text_2 = """BAGHDAD -- Archaeologists in northern Iraq last week unearthed 2,700-year-old rock carvings featuring war scenes and trees from the Assyrian Empire, an archaeologist said Wednesday.

    The carvings on marble slabs were discovered by a team of experts in Mosul, Iraq’s second-largest city, who have been working to restore the site of the ancient Mashki Gate, which was bulldozed by Islamic State group militants in 2016.

    Fadhil Mohammed, head of the restoration works, said the team was surprised by discovering “eight murals with inscriptions, decorative drawings and writings.”

    Mashki Gate was one of the largest gates of Nineveh, an ancient Assyrian city of this part of the historic region of Mesopotamia.

    The discovered carvings show, among other things, a fighter preparing to fire an arrow while others show palm trees.
    """

pd.DataFrame(ner(text_2))

Unnamed: 0,entity_group,score,word,start,end
0,LOC,0.434805,BA,0,2
1,LOC,0.999473,Iraq,38,42
2,MISC,0.893631,Assyrian,132,140
3,LOC,0.782092,Empire,141,147
4,LOC,0.999238,Mosul,256,261
5,LOC,0.999156,Iraq,263,267
6,LOC,0.971527,Mashki Gate,348,359
7,ORG,0.997262,Islamic State,384,397
8,PER,0.9993,Fadhil Mohammed,428,443
9,LOC,0.974939,Mashki Gate,592,603


<a name='section3-7'></a>

### 20.3.7 Zero-shot Classification

Zero-shot classification is a task to classify text documents. The term 
*zero-shot* classification refers to tasks for which a pretrained model has not been train. I.e., the model was not trained to classify documents using the provided type of labels.

In [47]:
classifier = pipeline("zero-shot-classification")

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

The pipeline allows us to list candidate labels to be used for the classification. For this example, the highest probability was assigned to the "sports" category. 

In [48]:
classifier(
    "Peyton Manning became the first quarterback ever to lead two different teams to multiple Super Bowls.",
    candidate_labels=["education", "politics", "business", "sports"],
)

{'sequence': 'Peyton Manning became the first quarterback ever to lead two different teams to multiple Super Bowls.',
 'labels': ['sports', 'business', 'education', 'politics'],
 'scores': [0.9866245985031128,
  0.0067298333160579205,
  0.0034621735103428364,
  0.003183376044034958]}

<a name='section3-8'></a>

### 20.3.8 Mask Filling

The pipeline with the `fill-mask` task is used to fill in blanks in an input text. 

In [49]:
mask_filling = pipeline("fill-mask")

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/331M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [53]:
mask_filling("Abraham Lincoln was a <mask> who lived in the United States.", top_k=5)

[{'score': 0.33272555470466614,
  'token': 3661,
  'token_str': ' Democrat',
  'sequence': 'Abraham Lincoln was a Democrat who lived in the United States.'},
 {'score': 0.18091009557247162,
  'token': 1172,
  'token_str': ' Republican',
  'sequence': 'Abraham Lincoln was a Republican who lived in the United States.'},
 {'score': 0.03390652686357498,
  'token': 16495,
  'token_str': ' Jew',
  'sequence': 'Abraham Lincoln was a Jew who lived in the United States.'},
 {'score': 0.028415359556674957,
  'token': 24156,
  'token_str': ' Presbyterian',
  'sequence': 'Abraham Lincoln was a Presbyterian who lived in the United States.'},
 {'score': 0.02462908811867237,
  'token': 11593,
  'token_str': ' physician',
  'sequence': 'Abraham Lincoln was a physician who lived in the United States.'}]

In [52]:
mask_filling("Flooding on the Yangtze river remains serious although <mask> levels on parts of the river decreased today.", top_k=2)

[{'score': 0.24890871345996857,
  'token': 514,
  'token_str': ' water',
  'sequence': 'Flooding on the Yangtze river remains serious although water levels on parts of the river decreased today.'},
 {'score': 0.12597322463989258,
  'token': 11747,
  'token_str': ' oxygen',
  'sequence': 'Flooding on the Yangtze river remains serious although oxygen levels on parts of the river decreased today.'}]

<a name='section4'></a>

# 20.4 Tokenizers

work in progress ....

<a name='section10'></a>

# References

1. Hugging Face Course, available at [https://huggingface.co/course/chapter1/1](https://huggingface.co/course/chapter1/1).
2. Getting Started with Hugging Face Transformers for NLP, Exxact Blog, available at [https://www.exxactcorp.com/blog/Deep-Learning/getting-started-hugging-face-transformers](https://www.exxactcorp.com/blog/Deep-Learning/getting-started-hugging-face-transformers).
3. An Introduction to Using Transformers and Hugging Face, Zoumana Kelta, available at [https://www.datacamp.com/tutorial/an-introduction-to-using-transformers-and-hugging-face](https://www.datacamp.com/tutorial/an-introduction-to-using-transformers-and-hugging-face).
4. Applications of Deep Neural Networks, Course at Washington University in St. Louis, Jeff Heaton, available at [https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_11_01_huggingface.ipynb](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_11_01_huggingface.ipynb).

[BACK TO TOP](#section0)