<a href="https://colab.research.google.com/github/elise-cakecoding/dotfiles/blob/master/data-first-transformer/Your_first_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction to the Transformers Library (Colab Recommended) 🤗


To recap, HuggingFace is an AI company that has blown up in the last few years, especially in the realm of Natural Language Processing (NLP).

In particular, the Transformers library has revolutionized the way people  work with large-scale transformer models. The goal of this challenge is to introduce you to these models for the first time and show how easy they can be to work with.

### Why you should love HuggingFace:

#### Pre-trained Models 📚:

One of the best features of the Transformers library is its huge repo of pre-trained models. Whether you're looking to employ BERT, GPT-2, T5, RoBERTa, or any of the other transformer architectures, chances are you'll find a version that suits your needs in their model hub.

#### It's super easy 👍:

The library is designed to be user-friendly. Loading a model and its corresponding tokenizer can be done in just a couple of lines of code. This simplicity extends to fine-tuning as well, allowing you to adapt these powerful models to a wide range of tasks. The `pipelines` library we'll be using lets you go from model selection to getting results in just a few lines.

#### Tokenizer  🔄 and Datasets 📊 Library:

Alongside the Transformers library, HuggingFace also offers the Tokenizers and Datasets libraries. While the first provides efficient and easy-to-use tokenization methods, the second offers a whole bunch of datasets, meaning you have all the tools and data you need in one ecosystem.

#### Community-Driven 🌐:
The HuggingFace community is very active and any community member (you included) can upload their own models and datasets.

__If you are working in Colab__ you'll need to install the appropriate libraries in your Colab environment (you will have them locally if you followed the setup instructions)

In [1]:
# Install the transformers library from HuggingFace
!pip install transformers torch pytesseract

Collecting transformers
  Downloading transformers-4.35.2-py3-none-any.whl (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
Collecting pytesseract
  Downloading pytesseract-0.3.10-py3-none-any.whl (14 kB)
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.19.4-py3-none-any.whl (311 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.7/311.7 kB[0m [31m33.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.19,>=0.14 (from transformers)
  Downloading tokenizers-0.15.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m33.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
# You'll also need some extra tools that some of these models use under the hood
! pip install sentencepiece sacremoses

Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sacremoses
  Downloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sentencepiece, sacremoses
Successfully installed sacremoses-0.1.1 sentencepiece-0.1.99


Over the course of this notebook, you'll be using Pipelines to download and easily use some very powerful models. Bear in mind that some of these models are quite large (up to 500Mb so make sure you have some disk space free on your machine or run this notebook in a Colab with faster download speeds!).

We are going to be using pre-built models and the best resource for implementing them will be using the [Pipelines documentation](https://huggingface.co/docs/transformers/main_classes/pipelines). If you ever want to delete the models locally after use, you can find them here in your root directory at:

`/.cache/huggingface/hub`

### Basic Sentiment : 😀 /  😕 / 😠 / 😟

With that in mind, instantiate a pipeline for sentiment analysis __without__ specifying a model and try testing out that model with the sentence "Transformers are awesome!" Feel free to try some other sentences, too.

In [4]:
from transformers import pipeline

In [7]:
pipe = pipeline('text-classification')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


(…)d-sst-2-english/resolve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

(…)glish/resolve/main/tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

(…)ned-sst-2-english/resolve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [8]:
pipe("Transformers are awesome!" )

[{'label': 'POSITIVE', 'score': 0.9998667240142822}]

### Nuanced Sentiment 🤔

HuggingFace will default to using `distilbert-base-uncased-finetuned-sst-2-english` if we don't specify a model. This model will work fine on a lot of basic use cases, but - because it's been trained on a fairly limited corpus of text:

`The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in language. The corpus is based on the dataset introduced by Pang and Lee (2005) and consists of 11,855 single sentences extracted from movie reviews. It was parsed with the Stanford parser and includes a total of 215,154 unique phrases from those parse trees, each annotated by 3 human judges.`

It's fairly obvious that a model trained on this will likely perform poorly on sentences that include modern language: e.g. ""These jokes were absolutely killer!" or "These beats are sick!". Try running these sentences through your pipeline now and you should get negative scores even though they are expressing quite positive sentiment.

In [9]:
pipe("These jokes were absolutely killer!" )

[{'label': 'POSITIVE', 'score': 0.9855958223342896}]

In [10]:
pipe( "These beats are sick!")

[{'label': 'NEGATIVE', 'score': 0.9997040629386902}]

Go to the list of HuggingFace models to see if you can find a model that will specialize on Twitter sentiment (`"twitter-roberta-base-sentiment-latest"` might be a good place to start) - hopefully that should be a bit more up to date with all this new lingo! Now create a second pipeline, this time __specifying__ that model that we want to use (use `model=`) and see how our performance instantly improves now we're using a fine-tuned model.


In [23]:
 = f"cardiffnlp/twitter-roberta-base-sentiment-latest"
pipe2= pipeline('text-classification', model =model )

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
The model 'RobertaForSequenceClassification' is not supported for translation. Supported models are ['BartForConditionalGeneration', 'BigBirdPegasusForConditionalGeneration', 'BlenderbotForConditionalGeneration', 'BlenderbotSmallForConditionalGeneration', 'EncoderDecoderModel', 'FSMTF

In [14]:
pipe2( "These beats are sick!")



[{'label': 'positive', 'score': 0.7884202003479004}]

You should see a much more accurate interpretation of the sentiment we're trying to express.

### Sentiment in other languages

While even our first pipeline will actually perform surprisingly well on simple sentences in other languages (e.g. "C' est bon" or "Esta bueno"), it breaks down when handling more sophisticated ideas in those languages.

Here is an example review for the Jurassic World Dominion movie 😬:

"This was frankly a spectacular failure from start to finish, with  remarkably uninspired performances from some very well-paid actors who acted with all the passion of a wet biscuit"

Tranlated into Korean it reads as this: "이것은 솔직히 처음부터 끝까지 엄청난 실패였으며 젖은 비스킷의 모든 열정으로 연기한 일부 매우 보수가 좋은 배우들의 현저하게 영감을 받지 못한 연기로 끝났습니다."

Try running the Korean text through either your Twitter model; you should see they won't pick up on how bad the review is.

In [15]:
pipe2( "이것은 솔직히 처음부터 끝까지 엄청난 실패였으며 젖은 비스킷의 모든 열정으로 연기한 일부 매우 보수가 좋은 배우들의 현저하게 영감을 받지 못한 연기로 끝났습니다.")

[{'label': 'neutral', 'score': 0.7584187388420105}]

Now see if you can find a model that might perform better in the HuggingFace library and use it. Try using "text-the`"matthewburke/korean_sentiment"` in a `text-classification` pipeline and see if your results change.

In [16]:
model_korean = f"matthewburke/korean_sentiment"
pipe3 = pipeline("text-classification", model = model_korean)

(…)orean_sentiment/resolve/main/config.json:   0%|          | 0.00/887 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/498M [00:00<?, ?B/s]

(…)iment/resolve/main/tokenizer_config.json:   0%|          | 0.00/552 [00:00<?, ?B/s]

(…)/korean_sentiment/resolve/main/vocab.txt:   0%|          | 0.00/396k [00:00<?, ?B/s]

(…)an_sentiment/resolve/main/tokenizer.json:   0%|          | 0.00/788k [00:00<?, ?B/s]

(…)ent/resolve/main/special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [17]:
pipe3( "이것은 솔직히 처음부터 끝까지 엄청난 실패였으며 젖은 비스킷의 모든 열정으로 연기한 일부 매우 보수가 좋은 배우들의 현저하게 영감을 받지 못한 연기로 끝났습니다.")


[{'label': 'LABEL_0', 'score': 0.9615505337715149}]

### Translation ✍️

Let's stick with our language theme and see if we can find a model that can handle the tasks of translating some sentences for us. The `opus-mt` project from the University of Helsinki is incredibly active on HuggingFace, creating and maintaining models designed to democratize the translation process for many different global languages. Try implementing the `"Helsinki-NLP/opus-mt-<source-language>-<destination-language>"` to see if you can translate between two langauges (e.g. English to Spanish).

In [25]:
model_HSK = "Helsinki-NLP/opus-mt-fr-ca"
pipe4 = pipeline('translation', model = model_HSK)

In [26]:
pipe4("Mon nom est Wolfgang et je vis à Berlin")

[{'translation_text': 'El meu nom és Wolfgang i visc a Berlín.'}]

### Summarization

Another really useful NLP task is summarizing a large amount of information into a very small amount of words. BART is a model that performs well on tasks like summarization; it contains a combination of two models you've already seen briefly in the lecture - the BERT model and and autogressive style GPT model - check out this [link](https://www.projectpro.io/article/transformers-bart-model-explained/553) for some more information on it.

Since BART models can be quite large, try to find the `distilbart-xsum-12-6` model on HuggingFace which is one of the smallest distillations available (we'll talk more about distillations later!). Integrate that model into a `"summarization"` pipeline, then take some text (e.g. perhaps by copy-pasting or scraping from [a BBC article](https://www.bbc.com/news/topics/cx2pk70323et)) and summarize it with your pipeline!

N.B. You need to be careful about context windows - here, you may run into an issue with your input being too long for the model!

In [27]:
model = "sshleifer/distilbart-xsum-12-6"
pipe5 = pipeline("summarization", model = model)

(…)lbart-xsum-12-6/resolve/main/config.json:   0%|          | 0.00/1.59k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/611M [00:00<?, ?B/s]

(…)-12-6/resolve/main/tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

(…)ilbart-xsum-12-6/resolve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

(…)ilbart-xsum-12-6/resolve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [28]:
pipe5("A cat whose pictures went viral for regularly visiting a railway station is releasing a Christmas single. Four-year-old Nala has been delighting commuters who have been taking photos of her at Stevenage station. Owner Natasha Ambler revealed the cat was releasing a single called Meow and has been approached for a book deal. The ginger tabby has also recorded a video for the song due to be released this week, under the name Nala the Station Cat.")

[{'summary_text': 'Nala the Station Cat is releasing a Christmas single.'}]

### Going further: Question Answering 🔍

What if we wanted to go further than just a summary? Perhaps asking questions about a specific dataset in an intuitive way? There's a model for that, too! Enter the (reasonably small) `roberta-base-squad2` - a model trained on question-answer pairs that can answer a `question` about a provided `context` (a body of text you will provide). Check the docs [here](https://huggingface.co/deepset/roberta-base-squad2?context=The+Amazon+rainforest+%28Portuguese%3A+Floresta+Amaz%C3%B4nica+or+Amaz%C3%B4nia%3B+Spanish%3A+Selva+Amaz%C3%B3nica%2C+Amazon%C3%ADa+or+usually+Amazonia%3B+French%3A+For%C3%AAt+amazonienne%3B+Dutch%3A+Amazoneregenwoud%29%2C+also+known+in+English+as+Amazonia+or+the+Amazon+Jungle%2C+is+a+moist+broadleaf+forest+that+covers+most+of+the+Amazon+basin+of+South+America.+This+basin+encompasses+7%2C000%2C000+square+kilometres+%282%2C700%2C000+sq+mi%29%2C+of+which+5%2C500%2C000+square+kilometres+%282%2C100%2C000+sq+mi%29+are+covered+by+the+rainforest.+This+region+includes+territory+belonging+to+nine+nations.+The+majority+of+the+forest+is+contained+within+Brazil%2C+with+60%25+of+the+rainforest%2C+followed+by+Peru+with+13%25%2C+Colombia+with+10%25%2C+and+with+minor+amounts+in+Venezuela%2C+Ecuador%2C+Bolivia%2C+Guyana%2C+Suriname+and+French+Guiana.+States+or+departments+in+four+nations+contain+%22Amazonas%22+in+their+names.+The+Amazon+represents+over+half+of+the+planet%27s+remaining+rainforests%2C+and+comprises+the+largest+and+most+biodiverse+tract+of+tropical+rainforest+in+the+world%2C+with+an+estimated+390+billion+individual+trees+divided+into+16%2C000+species.&question=How+many+species+are+in+the+Amazon%3F).

You know the drill: Create a `"question-answering"` pipeline with the `roberta-base-squad2` model, then try putting the `article` you picked before as your context and try asking a `question` about it.

In [33]:
model_name = "deepset/roberta-base-squad2"

In [36]:
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)
QA_input = {
    'question': 'What animal is Nala',
    'context': "Four-year-old Nala has been delighting commuters who have been taking photos of her at Stevenage station. Owner Natasha Ambler revealed the cat was releasing a single called Meow and has been approached for a book deal.The ginger tabby has also recorded a video for the song due to be released this week, under the name Nala the Station Cat. It has been produced by Danny Kirsch, who wrote it with Joe Killington, while Nala is also co-credited as a songwriter, as well as a vocalist. Ms Ambler said 'we want to spread the happiness that Stevenage has had, and she's had on socials to the world'. Nala's owners are hoping the cat will make it into the official UK chart with her new single. The single is officially released on Wednesday and BBC Three Counties Radio's Justin Dealey gave the single an exclusive first play on Sunday"
}
res = nlp(QA_input)

In [37]:
res

{'score': 0.17616383731365204,
 'start': 223,
 'end': 235,
 'answer': 'ginger tabby'}

### Speech to text 🎤

One of the best models for converting speech to text was made is the open source Whisper model made by OpenAI (creator of ChatGPT etc.) Take a look at the diagram of the model architecture - it should now look quite similar to those you've already seen today:


<img src = https://wagon-public-datasets.s3.amazonaws.com/data-science-images/lectures/Transformers/whipser.png width = 450px>

Run the following command to download this audio sample and install some additional required packages:

In [39]:
#ncomment line below for Windows/ Linux/ Colab
!sudo apt install ffmpeg

#Uncomment line below for Mac users
#!HOMEBREW_NO_AUTO_UPDATE=1 brew install ffmpeg

!mkdir data
!curl https://wagon-public-datasets.s3.amazonaws.com/deep_learning_datasets/harvard.wav > data/harvard.wav

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
ffmpeg is already the newest version (7:4.4.2-0ubuntu0.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 8 not upgraded.
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3173k  100 3173k    0     0  2319k      0  0:00:01  0:00:01 --:--:-- 2318k


You can listen to the clip by using the by importing `IPython` and loading the audio file (see the Algebra day recap for an example of how this is done!)

In [41]:
from IPython.display import Audio
audio = Audio("data/harvard.wav")

Find the smallest Whisper model version on HuggingFace (`whisper-tiny`) and use it to transcribe the audio. Try it on some other `.wav` files if you'd like!

In [46]:
model = "openai/whisper-ti"
pipe8 = pipeline("automatic-speech-recognition",
                 model="openai/whisper-tiny",
                 chunk_length_s=30,
                 )
prediction = pipe8("data/harvard.wav", batch_size=8)["text"]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [47]:
prediction

' The stale smell of old beer lingers. It takes heat to bring out the odor. A cold dip restores health in zest. A salt pickle tastes fine with ham. Tacos all pastora are my favorite. A zestful food is the hot cross bun.'

### Bonus: Let's get multimodal 😎: Visual Question Answering

We can even use question-answering style models on images if we'd like. Many of these models will use chains under the hood that will extract text from an image then pass it through to a language model. In order to use the following model you will need to make sure you `pip install Pillow pytesseract` which are two libraries that will help us to extract text from our images.

Once that's done, we're going to create a `"document-question-answering"` pipeline - we'll need a model for it, so search for the `layoutlm-invoices` model on HuggingFace. Then try to ask questions about this [`receipt.webp`](https://wagon-public-datasets.s3.amazonaws.com/data-science-images/lectures/Transformers/receipt.webp) (you download the image to your data folder or you can pass the url directly into your model when you call it). Try asking how much the eggs cost, what sales tax was and what the total was. Feel free to try it on some of your own images!

For this to run, you'll need some dependencies:

In [48]:
#For Mac, uncomment:
#!brew install tesseract

#For Linux or Colab etc. uncomment these:
!sudo apt install tesseract-ocr
!sudo apt install libtesseract-dev

# Then restart your kernel and give it a try!

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  tesseract-ocr-eng tesseract-ocr-osd
The following NEW packages will be installed:
  tesseract-ocr tesseract-ocr-eng tesseract-ocr-osd
0 upgraded, 3 newly installed, 0 to remove and 8 not upgraded.
Need to get 4,816 kB of archives.
After this operation, 15.6 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-eng all 1:4.00~git30-7274cfa-1.1 [1,591 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-osd all 1:4.00~git30-7274cfa-1.1 [2,990 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr amd64 4.1.1-2.1build1 [236 kB]
Fetched 4,816 kB in 1s (4,173 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debco

In [None]:
pass  # YOUR CODE HERE

Congrats 🎉 You've just seen how simple it can be to start working with some advanced Transformer-based models and we've only just scratched the surface.

There are so many models you can explore in the HuggingFace library for all kinds of different tasks. Your imagination is literally the limit (well - your compute power can also be a limit somtimes 😅). To take these models even further for custom usage, we're going to tackle fine-tuning next.

⚠️⚠️⚠️ If you have been running these models locally, don't forget to clean up your `/.cache/huggingface/hub` if you're limited on space or you'll have a lot of unwanted models hanging around in your cache 🧹 ⚠️⚠️⚠️