<a href="https://colab.research.google.com/github/cagBRT/promptEngineering/blob/main/1_HuggingFace.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://github.com/huggingface/notebooks/blob/main/transformers_doc/quicktour.ipynb

# Quick tour<br>
Get up and running with 🤗 Transformers! Start using the pipeline() for rapid inference, and quickly load a pretrained model and tokenizer with an AutoClass to solve your text, vision or audio task.

In [None]:
!pip install -U datasets

**Install the transformers libraries and datasets**<br>

Datasets is a library for easily accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks.

In [None]:
# Transformers installation
! pip install transformers datasets
# To install from source instead of the last release, comment the command above and uncomment the following one.
! pip install git+https://github.com/huggingface/transformers.git
     

# Pipeline<br>
pipeline() is the easiest way to use a pretrained model for a given task.<br>
These pipelines are objects that abstract most of the complex code from the library

The pipeline() supports many common tasks out-of-the-box:

**Text:**

>-**Sentiment analysis**: classify the polarity of a given text.<br>
-**Text generation (in English)**: generate text from a given input.<br>
-**Name entity recognition (NER)**: label each word with the entity it represents (person, date, location, etc.).<br>
-**Question answering**: extract the answer from the context, given some context and a question.<br>
-**Fill-mask**: fill in the blank given a text with masked words.<br>
-Summarization: generate a summary of a long sequence of text or document.<br>
-**Translation**: translate text into another language.<br>
-**Feature extraction**: create a tensor representation of the text.<br>
-**Conversational**<br>
-**Summarization**<br>
-**Text Classification**<br>
-**Text Generation**<br>
-**Text2Text Generation**<br>
-**ZeroShot Classification**<br>



**Image:**

>-**Image classification**: classify an image.<br>
-**Image segmentation**: classify every pixel in an image.<br>
-**Object detection**: detect objects within an image.<br><br>

**Audio:**

>**-Audio classification**: assign a label to a given segment of audio.<br>
**-Automatic speech recognition (ASR)**: transcribe audio data into text.<br>

# Pipeline usage<br>
In the following example, you will use the pipeline() for sentiment analysis.

In [None]:
from transformers import pipeline

In [None]:
#Unless specified, the code should work in both PyTorch and TensorFlow
!pip install torch
!pip install tensorflow

Use the sentiment analysis pipeline, also known as the text classification pipeline<br>
<br>
Classes are: <br>
>positive<br>
negative<br>
neutral

We can use sentiment text analysis for:

- Analyze social media mentions to understand how people are talking about your brand vs your competitors.<br>
- Analyze feedback from surveys and product reviews to quickly get insights into what your customers like and dislike about your product.<br>
- Analyze incoming support tickets in real-time to detect angry customers and act accordingly to prevent churn.<br>

There are more than 215 sentiment analysis models publicly available on the Hub and integrating them with Python just takes 5 lines of code.<br>
In this instance we are using the default model.<br><br>
**DistilBERT base uncased finetuned SST-2**<br>
This model reaches an accuracy of 91.3 on the dev set (for comparison, Bert bert-base-uncased version reaches an accuracy of 92.7).

The model will output a label and score for each input<br>
>[{'label': 'POSITIVE', 'score': 0.9998},<br>
 {'label': 'NEGATIVE', 'score': 0.9991}]

In [None]:
classifier = pipeline("sentiment-analysis")

The pipeline downloads and caches a default pretrained model and tokenizer for sentiment analysis. Now you can use the classifier on your target text:



In [None]:
classifier("I don't know if I like eel, sometimes I do.")

For more than one sentence, pass a list of sentences to the pipeline() which returns a list of dictionaries:



In [None]:
results = classifier(["I went to the park expecting it to be crowded, it wasn't!", 
                      "This park has nice views, but it had a lot of litter"])
for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")



---



---



---



---



The pipeline() can also iterate over an entire dataset. 

Create a pipeline() with the task you want to solve for and the model you want to use.

In [None]:
import torch
from transformers import pipeline

speech_recognizer = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h")

Next, load a dataset (see the 🤗 Datasets Quick Start for more details) you'd like to iterate over. For example, let's load the MInDS-14 dataset:

In [None]:
!pip install datasets[audio]

In [None]:
from datasets import load_dataset, Audio

This dataset requires a connection to dropbox. <br>
Sometimes the connection is not successful. <bR>
You may need to try several times

In [None]:
dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")

In [None]:
dataset

In [None]:
dataset[0]

We need to make sure that the sampling rate of the dataset matches the sampling rate facebook/wav2vec2-base-960h was trained on.

In [None]:
dataset = dataset.cast_column("audio", Audio(sampling_rate=speech_recognizer.feature_extractor.sampling_rate)) 

Audio files are automatically loaded and resampled when calling the "audio" column. Let's extract the raw waveform arrays of the first 4 samples and pass it as a list to the pipeline:



In [None]:
result = speech_recognizer(dataset[:4]["audio"])
#print([d["text"] for d in result])
for d in result:
  print(d["text"])



---



---



**Let's listen to the speech and compare it to the text translation.**

---



---



In [None]:
!pip install -q sounddevice==0.4.4
!pip install -q mediapipe==0.10.0

In [None]:
!wget -O classifier.tflite -q https://storage.googleapis.com/mediapipe-models/audio_classifier/yamnet/float32/1/yamnet.tflite

In [None]:
import urllib

We get the file pathname from the printout of the details of the dataset. <br>
The audio recording is a .wav file

In [None]:
dataset[0]

In [None]:
from IPython.display import Audio, display

file_name = '/root/.cache/huggingface/datasets/downloads/extracted/a19fbc5032eacf25eab0097832db7b7f022b42104fbad6bd5765527704a428b9/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav'
display(Audio(file_name, autoplay=False))

In [None]:
result = speech_recognizer(dataset[0]["audio"])
print(result)



---



---



---



---



# Use another model and tokenizer in the pipeline<br>

The pipeline() can accommodate any model from the Model Hub, making it easy to adapt the pipeline() for other use-cases. For example, if you'd like a model capable of handling French text, use the tags on the Model Hub to filter for an appropriate model. 

In [None]:
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"

Use the AutoModelForSequenceClassification and AutoTokenizer to load the pretrained model and it's associated tokenizer (more on an AutoClass below):

**Tokienizer exmaple:**
Consider the sentence: “Never give up”.

The most common way of forming tokens is based on space. Assuming space as a delimiter, the tokenization of the sentence results in<br>
3 tokens – Never-give-up. <br>
As each token is a word, it becomes an example of Word tokenization.

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
tokenizer

Then you can specify the model and tokenizer in the pipeline(), and apply the classifier on your target text:

In [None]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)


In [None]:
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
classifier("This is a terribly noisy place")

In [None]:
classifier("This is a wonderful place")

In [None]:
classifier("This is a wonderful cafe but very noisy")

If you can't find a model for your use-case, you will need to fine-tune a pretrained model on your data. 



---



---



---



---



# AutoClass

Under the hood, the AutoModelForSequenceClassification and AutoTokenizer classes work together to power the pipeline(). An AutoClass is a shortcut that automatically retrieves the architecture of a pretrained model from it's name or path. You only need to select the appropriate AutoClass for your task and it's associated tokenizer with AutoTokenizer.

Let's return to our example and see how you can use the AutoClass to replicate the results of the pipeline().

# AutoTokenizer<br><br>
A tokenizer is responsible for preprocessing text into a format that is understandable to the model. First, the tokenizer will split the text into words called tokens. There are multiple rules that govern the tokenization process, including how to split a word and at what level (learn more about tokenization here). The most important thing to remember though is you need to instantiate the tokenizer with the same model name to ensure you're using the same tokenization rules a model was pretrained with.

Load a tokenizer with AutoTokenizer:

In [None]:
from transformers import AutoTokenizer

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)

Next, the tokenizer converts the tokens into numbers in order to construct a tensor as input to the model. This is known as the model's vocabulary.

Pass your text to the tokenizer:

In [None]:
encoding = tokenizer("We are very happy to show you the 🤗 Transformers library.")
print(encoding)

The tokenizer will return a dictionary containing:

input_ids: numerical representions of your tokens.
atttention_mask: indicates which tokens should be attended to.
Just like the pipeline(), the tokenizer will accept a list of inputs. In addition, the tokenizer can also pad and truncate the text to return a batch with uniform length:

In [None]:
#PyTorch
pt_batch = tokenizer(
    ["I am not happy with this tour and the day's schedule"],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt",
)    

In [None]:
pt_batch

In [None]:
#TensorFlow
tf_batch = tokenizer(
    ["I am not happy with this tour and the day's schedule"],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="tf",
)

In [None]:
tf_batch

Read the preprocessing tutorial for more details about tokenization.

# AutoModel<br>
🤗 Transformers provides a simple and unified way to load pretrained instances. This means you can load an AutoModel like you would load an AutoTokenizer. The only difference is selecting the correct AutoModel for the task. Since you are doing text - or sequence - classification, load AutoModelForSequenceClassification:

In [None]:
!pip install datsets transformers[sentencepiece]
!pip install sentencepiece

In [None]:
from transformers import AutoModelForSequenceClassification

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)

See the task summary for which AutoModel class to use for which task.

Now you can pass your preprocessed batch of inputs directly to the model. You just have to unpack the dictionary by adding **:

In [None]:
pt_outputs = pt_model(**pt_batch)

The model outputs the final activations in the logits attribute. Apply the softmax function to the logits to retrieve the probabilities:

In [None]:
from torch import nn

pt_predictions = nn.functional.softmax(pt_outputs.logits, dim=-1)
print(pt_predictions)

 Transformers provides a simple and unified way to load pretrained instances. This means you can load an TFAutoModel like you would load an AutoTokenizer. The only difference is selecting the correct TFAutoModel for the task. Since you are doing text - or sequence - classification, load TFAutoModelForSequenceClassification:

In [None]:
from transformers import TFAutoModelForSequenceClassification

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
tf_model = TFAutoModelForSequenceClassification.from_pretrained(model_name)

See the task summary for which AutoModel class to use for which task.

Now you can pass your preprocessed batch of inputs directly to the model by passing the dictionary keys directly to the tensors:

In [None]:
tf_outputs = tf_model(tf_batch)


The model outputs the final activations in the logits attribute. Apply the softmax function to the logits to retrieve the probabilities:



In [None]:
import tensorflow as tf

tf_predictions = tf.nn.softmax(tf_outputs.logits, axis=-1)
tf_predictions

All 🤗 Transformers models (PyTorch or TensorFlow) outputs the tensors before the final activation function (like softmax) because the final activation function is often fused with the loss.

Models are a standard torch.nn.Module or a tf.keras.Model so you can use them in your usual training loop. However, to make things easier, 🤗 Transformers provides a Trainer class for PyTorch that adds functionality for distributed training, mixed precision, and more. For TensorFlow, you can use the fit method from Keras. Refer to the training tutorial for more details.

🤗 Transformers model outputs are special dataclasses so their attributes are autocompleted in an IDE. The model outputs also behave like a tuple or a dictionary (e.g., you can index with an integer, a slice or a string) in which case the attributes that are None are ignored.





---



---



---



---



# Save a model<br>
Once your model is fine-tuned, you can save it with its tokenizer using PreTrainedModel.save_pretrained():



In [None]:
pt_save_directory = "./pt_save_pretrained"
tokenizer.save_pretrained(pt_save_directory)
pt_model.save_pretrained(pt_save_directory)

When you are ready to use the model again, reload it with PreTrainedModel.from_pretrained():

In [None]:
pt_model = AutoModelForSequenceClassification.from_pretrained("./pt_save_pretrained")


Once your model is fine-tuned, you can save it with its tokenizer using TFPreTrainedModel.save_pretrained():

In [None]:
tf_save_directory = "./tf_save_pretrained"
tokenizer.save_pretrained(tf_save_directory)
tf_model.save_pretrained(tf_save_directory)


When you are ready to use the model again, reload it with TFPreTrainedModel.from_pretrained():

In [None]:
tf_model = TFAutoModelForSequenceClassification.from_pretrained("./tf_save_pretrained")


from_pt or from_tf parameter can convert the model from one framework to the other:

In [None]:
from transformers import AutoModel

tokenizer = AutoTokenizer.from_pretrained(tf_save_directory)
pt_model = AutoModelForSequenceClassification.from_pretrained(tf_save_directory, from_tf=True)

In [None]:
from transformers import TFAutoModel

tokenizer = AutoTokenizer.from_pretrained(pt_save_directory)
tf_model = TFAutoModelForSequenceClassification.from_pretrained(pt_save_directory, from_pt=True)