<a href="https://colab.research.google.com/github/cagBRT/promptEngineering/blob/main/2_HuggingFace.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Transformers installation
! pip install transformers datasets
# To install from source instead of the last release, comment the command above and uncomment the following one.
! pip install git+https://github.com/huggingface/transformers.git
     

# Preprocessing Data
Before you can use your data in a model, the data needs to be processed into an acceptable format for the model. A model does not understand raw text, images or audio. These inputs need to be converted into numbers and assembled into tensors. <br>

In this tutorial, you will:

>Preprocess textual data with a tokenizer.<br>
Preprocess image or audio data with a feature extractor.<br>
Preprocess data for a multimodal task with a processor.<br>

# NLP

The main tool for processing textual data is a tokenizer. A tokenizer starts by splitting text into tokens according to a set of rules. The tokens are converted into numbers, which are used to build tensors as input to a model. Any additional inputs required by a model are also added by the tokenizer.

**If you plan on using a pretrained model, it's important to use the associated pretrained tokenizer.** This ensures the text is split the same way as the pretraining corpus, and uses the same corresponding tokens-to-index (usually referrred to as the vocab) during pretraining.

Get started quickly by loading a pretrained tokenizer with the AutoTokenizer class. This downloads the vocab used when a model is pretrained.



# Tokenize
Load a pretrained tokenizer with AutoTokenizer.from_pretrained():

**Tokenizing** (splitting strings in sub-word token strings), converting tokens strings to ids and back, and encoding/decoding (i.e., tokenizing and converting to integers).<br>
**Adding new tokens** to the vocabulary in a way that is independent of the underlying structure (BPE, SentencePiece…).<br>
**Managing special tokens** (like mask, beginning-of-sentence, etc.): adding them, assigning them to attributes in the tokenizer for easy access and making sure they are not split during tokenization.


In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

Then pass your sentence to the tokenizer:

Examples of tokens:<br>
>101 - beginning of text<br>
102 - end of text<br>
119 - period
Punctuation is included

In [None]:
encoded_input1 = tokenizer(".I and am you are the")
print(encoded_input1)

In [None]:
encoded_input2 = tokenizer("I am. You are not.")
print(encoded_input2)

In [None]:
encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
print(encoded_input)

**Assignment:** <br>
Try different input sequences.<br>
How are word tenses handled? <br>
How about punctuation? <br>




---

---





The tokenizer returns a dictionary with three important itmes:<br>

- input_ids are the indices corresponding to each token in the sentence.<br>
- attention_mask indicates whether a token should be attended to or not.<br>
- token_type_ids identifies which sequence a token belongs to when there is more than one sequence.<br>

You can decode the input_ids to return the original input:

In [None]:
tokenizer.decode(encoded_input["input_ids"])

As you can see, the tokenizer added two special tokens - CLS and SEP (classifier and separator) - to the sentence. Not all models need special tokens, but if they do, the tokenizer will automatically add them for you.

If there are several sentences you want to process, pass the sentences as a list to the tokenizer:

In [None]:
batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_inputs = tokenizer(batch_sentences)
print(encoded_inputs)
#?=136
#'= 112

**Assignment:**<br>
Look at the encoded inputs. <br>
See if you can figure out the numbers for each word.
One thing you'll notice is there is not always  1-to-1 correspondance. What is going on? <br>
**Hint- try changing elevensies to eleven.** 

**Pad**<br>
This brings us to an important topic. When you process a batch of sentences, they aren't always the same length. <br>
**This is a problem because tensors, the input to the model, need to have a uniform shape.** <br>
Padding is a strategy for ensuring tensors are rectangular by adding a special padding token to sentences with fewer tokens.<br>
**Set the padding parameter to True to pad the shorter sequences in the batch to match the longest sequence:**

In [None]:
batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, padding=True)
print(encoded_input)

**Truncation**<br>
Sometimes a sequence may be too long for a model to handle. In this case, you will need to truncate the sequence to a shorter length.

**Set the truncation parameter to True to truncate a sequence to the maximum length accepted by the model:**

In [None]:
batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, padding=True, truncation=True)
print(encoded_input)

**Assignment:**<br>
Write different length sentences and look at the effect of padding on the attention mask

**Build tensors**<br>
Finally, you want the tokenizer to return the actual tensors that are fed to the model.

Set the return_tensors parameter to either pt for PyTorch, or tf for TensorFlow:



In [None]:
batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
print(encoded_input)


In [None]:
batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="tf")
print(encoded_input)


# Pad and truncate
Just like the tokenizer, you can apply padding or truncation to handle variable sequences in a batch. Take a look at the sequence length of these two audio samples:



In [None]:
from datasets import load_dataset, Audio

dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")

In [None]:
dataset[0]["audio"]


In [None]:
dataset[0]["audio"]["array"].shape


In [None]:
dataset[1]["audio"]["array"].shape


As you can see, the first sample has a longer sequence than the second sample. Let's create a function that will preprocess the dataset. Specify a maximum sample length, and the feature extractor will either pad or truncate the sequences to match it:

In [None]:
from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")

In [None]:

def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays,
        sampling_rate=16000,
        padding=True,
        max_length=100000,
        truncation=True,
    )
    return inputs

Apply the function to the the first few examples in the dataset:



In [None]:
processed_dataset = preprocess_function(dataset[:5])

Now take another look at the processed sample lengths:

In [None]:
processed_dataset["input_values"][0].shape

In [None]:
processed_dataset["input_values"][1].shape

The lengths of the first two samples now match the maximum length you specified.