# HuggingFace

[HuggingFace](https://huggingface.co/) is a library for machine learning that provides pre-trained models and datasets. It also provides tools for training and evaluating models. It is a great resource for NLP. We will use it to access some helpful tools as well as the IMDB dataset.

In [1]:
import torch
from datasets import load_dataset
from torch.utils.data.dataset import random_split

imdb_dataset = load_dataset("imdb")

# Print the length of the train and test sets
print(imdb_dataset)

torch.manual_seed(0)
train_dataset, valid_dataset = random_split(imdb_dataset["train"], [20000, 5000])

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


In [2]:
import numpy as numpy
import pandas as pd

imdb_dataset.set_format(type="pandas")
df = imdb_dataset["train"][:]

imdb_dataset.set_format(type="torch")
df.head()

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


# Tokenization

Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. Tokens can be individual words, phrases or even whole sentences. The list of tokens becomes input for further processing.

## Character Tokenization

Character tokenization is the simplest form of tokenization. It breaks the text into individual characters.

In [3]:
# Get unique characters that exist in the dataset
unique_chars = set(' '.join(df[:]['text'].tolist()))

# Sort the unique characters
unique_chars = sorted(list(unique_chars))
print(unique_chars)

# Tokenize the first input
input_text = df[:]['text'].tolist()[0]
print(input_text)

# Create a dictionary that maps unique characters to indices
char2idx = {u:i for i, u in enumerate(unique_chars)}
idx2char = {i:u for i, u in enumerate(unique_chars)}

# Create encoder and decoder functions
encode = lambda s: [char2idx[c] for c in s] # String to list of indices
decode = lambda s: ''.join([idx2char[c] for c in s]) # List of indices to string
input_seq = encode(input_text)

# Print the result of encoding and decoding the first input
print(input_seq)
print(decode(encode(input_text)))

['\x08', '\t', '\x10', ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '^', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~', '\x80', '\x84', '\x85', '\x8d', '\x8e', '\x91', '\x95', '\x96', '\x97', '\x9a', '\x9e', '\xa0', '¡', '¢', '£', '¤', '¦', '§', '¨', '«', '\xad', '®', '°', '³', '´', '·', 'º', '»', '½', '¾', '¿', 'À', 'Á', 'Ã', 'Ä', 'Å', 'È', 'É', 'Ê', 'Õ', 'Ø', 'Ü', 'ß', 'à', 'á', 'â', 'ã', 'ä', 'å', 'æ', 'ç', 'è', 'é', 'ê', 'ë', 'ì', 'í', 'î', 'ï', 'ð', 'ñ', 'ò', 'ó', 'ô', 'ö', 'ø', 'ù', 'ú', 'û', 'ü', 'ý', 'ō', '–', '‘', '’', '“', '”', '…', '₤', '\uf0b7']
I rented I AM CURIOUS-YELLOW from my video store because 

# Word-level Tokenization

Word tokenization is the process of breaking a stream of text up into words. The list of tokens becomes input for further processing. Using words instead of characters means that our model doesn't have to learn the word itself.

As opposed to character-level tokenization, word tokenization requires a vastly larger dictionary size. This is because the number of unique words in a corpus is much larger than the number of unique characters. In practice, a subset of the most common words is used to build the dictionary.

## Word Tokenization with `PyTorch` and `torchtext`

HuggingFace provides access to many popular datasets for NLP. The `torchtext` library provides a simple API for loading and processing text data. It includes a variety of datasets, tokenizers, and data iterators. Using these libraries, we will prepare a dataset as follows:

2. Create a vocabulary using the `torchtext` library.
3. Convert the text to a sequence of integers using the vocabulary.

We will start off by creating our own custom tokenizer. However, there are many other tokenizers available in the `torchtext` library. For example, the `spacy` tokenizer is a popular choice. The `spacy` tokenizer is a rule-based tokenizer that uses the `spaCy` library to tokenize text.

In [6]:
# Create a tokenizer which removes punctuation and special characters before splitting the text into words.
# This tokenizer is from Sebastian Raschka's book "Machine Learning with PyTorch and sci-kit learn"
import re
from collections import Counter, OrderedDict

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    tokenized = text.split()

    return tokenized

# Create a counter object to count the number of times each word appears in the dataset
counter = Counter()

# Loop through each review and tokenize it
for sample in train_dataset:
    line = sample['text']
    counter.update(tokenizer(line))

print("Vocabulary size: ", len(counter))

# Create an encoder
from torchtext.vocab import Vocab

sorted_tokens = sorted(counter.items(), key=lambda x: x[1], reverse=True)
ordered_dict = OrderedDict(sorted_tokens)
vocab = vocab(ordered_dict)

# Insert <unk> and <pad> tokens for unknown items and padding when batching fixed sized sequences
vocab.insert_token('<pad>', 0)
vocab.insert_token('<unk>', 1)
vocab.set_default_index(1)

# Create a function to encode the text
def encode_text(text):
    return vocab.lookup_indices(tokenizer(text))

# Create a function to decode the text
def decode_text(encoded_text):
    return vocab.lookup_tokens(encoded_text)

# Create a function to encode the labels
def encode_label(label):
    return 1 if label == 'pos' else 0

Vocabulary size:  69241


KeyError: '<unk>'

In [None]:
# Sample a random review and tokenize it
sample_idx = torch.randint(len(train_dataset), size=(1,)).item()
sample_review = train_dataset[sample_idx]['text']
sample_label = train_dataset[sample_idx]['label']
print("Sample review: ", sample_review)

# Encode the review
encoded_review = encode_text(sample_review)
print("Encoded review: ", encoded_review)

# Decode the review
decoded_review = decode_text(encoded_review)
print("Decoded review: ", decoded_review)

# Subword Tokenization

Subword tokenizations combine the benefits of both character and word-level tokenization. These tokenizers are learned from the data. They are able to break down words into smaller parts, resulting in a smaller vocabulary size. It also allows the model to learn from words that it has not seen before.

In [None]:
import tiktoken

enc = tiktoken.get_encoding("gpt2")

# Encode the review
encoded_review = enc.encode(sample_review)
print("Encoded review: ", encoded_review)

# Decode the review
decoded_review = enc.decode(encoded_review)
print("Decoded review: ", decoded_review)

# Stopwords

Stop words are words that are most common words in a language. For many NLP tasks, these words are not useful and can be removed from the text. NLTK has a list of stopwords for many languages.

In [None]:
import nltk

# Download the stopwords from NLTK
nltk.download('stopwords')

# Import the stopword list
from nltk.corpus import stopwords

# Print the first 10 stopwords
stopwords.words('english')[:10]

# Embeddings

Embeddings are a way to represent words as vectors learned from the data. They are able to capture the meaning of words and their relationships to other words. Embeddings are used in many NLP tasks such as machine translation, sentiment analysis, and text classification. Besides providing a way to convert our tokenized input to a lower dimensional space, these embeddings allow words with similar meanings to be close together in the embedding space.

Google's Word2Vec is a popular embedding model. It is trained on a large corpus of text. The model learns to predict the context of a word based on its neighbors. The resulting embeddings are able to capture the meaning of words and their relationships to other words. Other popular embeddings include GloVe and FastText.

In the context of Transformer-based models, the embeddings are learned as part of the model.

In [None]:
# Sample snippet from https://vaclavkosar.com/ml/transformer-embeddings-and-tokenization

from transformers import DistilBertTokenizerFast, DistilBertModel

tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
tokens = tokenizer.encode('This is a input.', return_tensors='pt')
print("These are tokens!", tokens)
for token in tokens[0]:
    print("This are decoded tokens!", tokenizer.decode([token]))

model = DistilBertModel.from_pretrained("distilbert-base-uncased")
print(model.embeddings.word_embeddings(tokens))
for e in model.embeddings.word_embeddings(tokens)[0]:
    print("This is an embedding!", e)

ModuleNotFoundError: No module named 'transformers'