In [5]:
import transformers
import pandas as pd
import tensorflow as tf
transformers.logging.set_verbosity_error()

<h1 style="text-align:center;">A Whirlwind Tour of the 🤗 Hugging Face Ecosystem</h1>

<br><br><br><br>

<h3 style="text-align:center;"><b>Christopher Akiki</b></h3>

<center><img src="images/chapter01_hf-ecosystem.png" width=800></center>

<center><img src="images/chapter02_hf-libraries.png" width=1800></center>

<h1 style="text-align:center;">🤗 Pipelines</h1>

<br><br>

In [8]:
from transformers import pipeline
from transformers.pipelines import get_supported_tasks

In [7]:
print(get_supported_tasks())

['audio-classification', 'automatic-speech-recognition', 'conversational', 'feature-extraction', 'fill-mask', 'image-classification', 'image-segmentation', 'ner', 'object-detection', 'question-answering', 'sentiment-analysis', 'summarization', 'table-question-answering', 'text-classification', 'text-generation', 'text2text-generation', 'token-classification', 'translation', 'zero-shot-classification', 'zero-shot-image-classification']



<center><img src="images/gewandhaus_review.png" width=900></center>

In [3]:
text = """One of the best orchestra in the world. I came to Leipzig\
            mainly to have one experience with Gewanhaus Leipzig Orchestra. 
            Under the baton of Maestro Andris Nelsons, Bruckner symphony #8 was so affection. 
            The acustic and layout of the concert hall is nice."""

# Sentiment Analysis

In [2]:
p = pipeline("text-classification", model='distilbert-base-uncased-finetuned-sst-2-english')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

2022-04-23 14:27:42.917882: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-04-23 14:27:42.917922: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-04-23 14:27:42.917967: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] no NVIDIA GPU device is present: /dev/nvidia0 does not exist
2022-04-23 14:27:42.918335: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-04-23 14:27:42.955755: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

In [19]:
outputs = p(text)
outputs[0]

{'label': 'POSITIVE', 'score': 0.9998534917831421}

# Named-Entity Recognition

In [3]:
p = pipeline("ner", aggregation_strategy="simple", model="dbmdz/bert-large-cased-finetuned-conll03-english")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


Downloading:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.24G [00:00<?, ?B/s]

Some layers from the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing TFBertForTokenClassification: ['dropout_147']
- This IS expected if you are initializing TFBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForTokenClassification were initialized from the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForTokenClassification for predictions without further training.


Downloading:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

In [20]:
outputs = p(text)
pd.DataFrame(outputs)

Unnamed: 0,entity_group,score,word,start,end
0,LOC,0.999257,Leipzig,50,57
1,ORG,0.990783,Gewanhaus Leipzig Orchestra,104,131
2,PER,0.996171,Andris Nelsons,173,187
3,MISC,0.56472,B,189,190
4,ORG,0.268703,##ck,192,194
5,MISC,0.364942,##ner,194,197


# Question Answering

In [4]:
p = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/249M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-cased-distilled-squad were not used when initializing TFDistilBertForQuestionAnswering: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-cased-distilled-squad and are newly initialized: ['dropout_113']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

In [21]:
question = "Why did I visit Leipzig?"
outputs = p(question=question, context=text)
outputs

{'score': 0.5873121023178101,
 'start': 76,
 'end': 131,
 'answer': 'to have one experience with Gewanhaus Leipzig Orchestra'}

In [22]:
question = "What music did the orchestra play?"
outputs = p(question=question, context=text)
outputs

{'score': 0.1337369829416275,
 'start': 189,
 'end': 209,
 'answer': 'Bruckner symphony #8'}

# Translation

In [None]:
p = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de")

In [18]:
outputs = p(text, clean_up_tokenization_spaces=True)
print(outputs[0]['translation_text'])

Ich bin vor allem nach Leipzig gekommen, um eine Erfahrung mit dem Gewanhaus Leipzig Orchestra zu machen. Unter der Leitung von Maestro Andris Nelsons war die Bruckner Sinfonie #8 so liebevoll, dass die Akustik und Gestaltung des Konzertsaales schön ist.


<h1 style="text-align:center;">🤗 Tokenizers</h1>

<center><img src="images/tokenization_pipeline.svg" width=1200></center>

In [None]:
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('gutenberg')
nltk.download('punkt')

In [None]:
print(nltk.corpus.gutenberg.fileids())

In [None]:
moby_dick_raw = nltk.corpus.gutenberg.raw('melville-moby_dick.txt')
moby_dick_sentences = sent_tokenize(moby_dick, language='english')

In [None]:
len(moby_dick_sentences)

In [None]:
from tokenizers import Tokenizer, normalizers, pre_tokenizers, processors
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer

In [None]:
unk_token = "[UNK]"
pad_token = "[PAD]"
cls_token = "[CLS]" 
sep_token = "[SEP]"
mask_token = "[MASK]"
special_tokens = [unk_token, pad_token, cls_token, sep_token, mask_token]
vocab_size = 20_000

In [None]:
custom_tokenizer = Tokenizer(WordPiece(unk_token=unk_token))

In [None]:
custom_normalizer = normalizers.Sequence(
            [normalizers.NFKD(), normalizers.Lowercase(), normalizers.StripAccents()]
)

In [None]:
custom_pre_tokenizer = pre_tokenizers.Sequence(
            [pre_tokenizers.WhitespaceSplit(), pre_tokenizers.Punctuation()]
)

In [None]:
custom_trainer = WordPieceTrainer(vocab_size=vocab_size, special_tokens=special_tokens, show_progress=False)

In [None]:
custom_tokenizer.normalizer = custom_normalizer
custom_tokenizer.pre_tokenizer = custom_pre_tokenizer

In [None]:
custom_tokenizer.train_from_iterator(moby_dick_sentences, trainer=custom_trainer)

In [None]:
custom_tokenizer.get_vocab_size()

In [None]:
encoding = custom_tokenizer.encode("Let us test this tokenizer")
print(encoding.tokens)

In [None]:
cls_token_id = tokenizer.token_to_id(cls_token)
sep_token_id = tokenizer.token_to_id(sep_token)

custom_post_processor = processors.TemplateProcessing(
    single=f"{cls_token}:0 $A:0 {sep_token}:0",
    pair=f"{cls_token}:0 $A:0 {sep_token}:0 $B:1 {sep_token}:1",
    special_tokens=[(cls_token, cls_token_id), (sep_token, sep_token_id)],
)

custom_tokenizer.post_processor = custom_post_processor

In [None]:
encoding = custom_tokenizer.encode("Let us test this tokenizer")
print(encoding.tokens)

In [None]:
encoding = custom_tokenizer.encode("This is the first sentence", "This is sentence number 2")
print(encoding.tokens)
print(encoding.ids)
print(encoding.type_ids)

# Using our custom tokenizer with 🤗 Transformers

In [None]:
from transformers import PreTrainedTokenizerFast

PreTrainedTokenizerFast


<h1 style="text-align:center;">🤗 Datasets</h1>

<table><thead><tr><th align="center">Data format</th> <th align="center">Loading script</th> <th align="center">Example</th></tr></thead> <tbody><tr><td align="center">CSV &amp; TSV</td> <td align="center"><code>csv</code></td> <td align="center"><code>load_dataset("csv", data_files="my_file.csv")</code></td></tr> <tr><td align="center">Text files</td> <td align="center"><code>text</code></td> <td align="center"><code>load_dataset("text", data_files="my_file.txt")</code></td></tr> <tr><td align="center">JSON &amp; JSON Lines</td> <td align="center"><code>json</code></td> <td align="center"><code>load_dataset("json", data_files="my_file.jsonl")</code></td></tr> <tr><td align="center">Pickled DataFrames</td> <td align="center"><code>pandas</code></td> <td align="center"><code>load_dataset("pandas", data_files="my_dataframe.pkl")</code></td></tr></tbody></table>

<h1 style="text-align:center;">🤗 Transformers</h1>

<h1 style="text-align:center;">Case-study: 📜 Scientific Paper Retrieval</h1>

<h1 style="text-align:center;">(Re)sources</h1>

- https://github.com/nlp-with-transformers/notebooks

- https://github.com/huggingface/course


<center><a href="https://www.oreilly.com/library/view/natural-language-processing/9781098103231/"><img src="images/book_cover.png" width=500></a></center>