Carregar dados da js em um dataset

In [None]:
emotions_local = load_dataset("csv", data_files="train.txt", sep=";", 
                              names=["text", "label"])

Transforma o Dataset em DataFrame

In [None]:
import pandas as pd

emotions.set_format(type="pandas")
df = emotions["train"][:]
df.head()

Verificar quantas palavras existem nos enunciados
Texts that
are longer than a model’s context size need to be truncated, which can lead to a loss in
performance if the truncated text contains crucial information; in this case, it looks
like that won’t be an issue

In [None]:
df["Words Per Tweet"] = df["text"].str.split().apply(len)
df.boxplot("Words Per Tweet", by="label_name", grid=False, showfliers=False,
           color="black")
plt.suptitle("")
plt.xlabel("")
plt.show()

Após exploração inicial dos dados resetar o formato

In [None]:
emotions.reset_format()

Transformer models like DistilBERT cannot receive raw strings as input; instead, they
assume the text has been tokenized and encoded as numerical vectors. Tokenization is
the step of breaking down a string into the atomic units used in the model. There are
several tokenization strategies one can adopt, and the optimal splitting of words into
subunits is usually learned from the corpus. Before looking at the tokenizer used for
DistilBERT, let’s consider two extreme cases: character and word tokenization.

Tokenização por char

In [1]:
text = "Tokenizing text is a core task of NLP."
tokenized_text = list(text)
print(tokenized_text)

['T', 'o', 'k', 'e', 'n', 'i', 'z', 'i', 'n', 'g', ' ', 't', 'e', 'x', 't', ' ', 'i', 's', ' ', 'a', ' ', 'c', 'o', 'r', 'e', ' ', 't', 'a', 's', 'k', ' ', 'o', 'f', ' ', 'N', 'L', 'P', '.']


This is a good start, but we’re not done yet. Our model expects each character to be
converted to an integer, a process sometimes called numericalization. One simple way
to do this is by encoding each unique token (which are characters in this case) with a
unique integer:

In [8]:
token2idx = {ch: idx for idx, ch in enumerate(sorted(set(tokenized_text)))}
print(token2idx)

{' ': 0, '.': 1, 'L': 2, 'N': 3, 'P': 4, 'T': 5, 'a': 6, 'c': 7, 'e': 8, 'f': 9, 'g': 10, 'i': 11, 'k': 12, 'n': 13, 'o': 14, 'r': 15, 's': 16, 't': 17, 'x': 18, 'z': 19}


In [12]:
input_ids = [token2idx[token] for token in tokenized_text]
print(input_ids)

[5, 14, 12, 8, 13, 11, 19, 11, 13, 10, 0, 17, 8, 18, 17, 0, 11, 16, 0, 6, 0, 7, 14, 15, 8, 0, 17, 6, 16, 12, 0, 14, 9, 0, 3, 2, 4, 1]


The problem with this approach is that it creates a fictitious ordering between the names, and neural networks are really good at learning these kinds of relationships. So instead, we can create a new column for each category and assign a 1 where the category is true, and a 0 otherwise.

On the other hand, the result of adding two one-hot encodings can easily be interpreted: the two entries that are "hot" indicate that the corresponding tokens co-occur. We can create the one-hot encodings in PyTorch by converting input_ids to a tensor and applying the one_hot() function as follows:

From our simple example we can see that **character-level tokenization ignores any
structure in the text and treats the whole string as a stream of characters**. Although
this helps deal with misspellings and rare words, the main drawback is that **linguistic
structures such as words need to be learned from the data**. This requires significant
compute, memory, and data. For this reason, **character tokenization is rarely used in
practice**. Instead, some structure of the text is preserved during the tokenization step.
Word tokenization is a straightforward approach to achieve this, so let’s take a look at
how it works.


Instead of splitting the text into characters, we can split it into words and map each
word to an integer. Using words from the outset enables the model to skip the step of
learning words from characters, and thereby reduces the complexity of the training
process.

One simple class of word tokenizers uses whitespace to tokenize the text. We can do this by applying Python's `split()` function directly on the raw text (just like we did to measure the tweet lengths):

In [13]:
tokenized_text = text.split()
print(tokenized_text)

['Tokenizing', 'text', 'is', 'a', 'core', 'task', 'of', 'NLP.']


From here we can take the same steps we took for the character tokenizer to map
each word to an ID. However, we can already see one potential problem with this
tokenization scheme: punctuation is not accounted for, so NLP. is treated as a single
token. Given that words can include declinations, conjugations, or misspellings, the
size of the vocabulary can easily grow into the millions

Having a large vocabulary is a problem because it requires neural networks to have an
enormous number of parameters.

Naturally, we want to avoid being so wasteful with our model parameters since mod‐
els are expensive to train, and larger models are more difficult to maintain. A com‐
mon approach is to limit the vocabulary and discard rare words by considering, say,
the 100,000 most common words in the corpus. Words that are not part of the
vocabulary are classified as “unknown” and mapped to a shared UNK token. This
means that we lose some potentially important information in the process of word
tokenization, since the model has no information about words associated with UNK.

Wouldn’t it be nice if there was a compromise between character and word tokeniza‐
tion that preserved all the input information and some of the input structure? There
is: subword tokenization.

On the one hand, we want to split rare words into smaller
units to allow the model to deal with complex words and misspellings. On the other
hand, we want to keep frequent words as unique entities so that we can keep the
length of our inputs to a manageable size.

When using pretrained models, it is really important to make sure
that you use the same tokenizer that the model was trained with.
From the model’s perspective, switching the tokenizer is like shuf‐
fling the vocabulary. If everyone around you started swapping
random words like “house” for “cat,” you’d have a hard time under‐
standing what was going on too!