# Outline
- [Text Preprocessing](#Text-Preprocessing)
    - [Tokenization](#Tokenization)
    - [Stopwords](#Stopwords)
    - [Stemming](#Stemming)
    - [Lemmatization](#Lemmatization)
    - [Bag of Words](#Bag-of-Words)
    - [TF-IDF](#TF-IDF)
    - [Word Embeddings](#Word-Embeddings)
    - [Word2Vec](#Word2Vec)
    - [GloVe](#GloVe)
    - [FastText](#FastText)
    - [References](#References)

# HuggingFace

[HuggingFace](https://huggingface.co/) is a library for machine learning that provides pre-trained models and datasets. It also provides tools for training and evaluating models. It is a great resource for NLP. We will use it to access some helpful tools as well as the IMDB dataset.

In [4]:
import torch
from datasets import load_dataset
from torch.utils.data.dataset import random_split

imdb_dataset = load_dataset("imdb")

# Print the length of the train and test sets
print(imdb_dataset)

torch.manual_seed(0)
train_dataset, valid_dataset = random_split(imdb_dataset["train"], [20000, 5000])

Found cached dataset imdb (/home/alex/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


In [5]:
import numpy as numpy
import pandas as pd

imdb_dataset.set_format(type="pandas")
df = imdb_dataset["train"][:]

imdb_dataset.set_format(type="torch")
df.head()

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


# Tokenization

Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. Tokens can be individual words, phrases or even whole sentences. The list of tokens becomes input for further processing.

## Character Tokenization

Character tokenization is the simplest form of tokenization. It breaks the text into individual characters.

In [6]:
# Get unique characters that exist in the dataset
unique_chars = set(' '.join(df[:]['text'].tolist()))

# Sort the unique characters
unique_chars = sorted(list(unique_chars))
print(unique_chars)

# Tokenize the first input
input_text = df[:]['text'].tolist()[0]
print(input_text)

# Create a dictionary that maps unique characters to indices
char2idx = {u:i for i, u in enumerate(unique_chars)}
idx2char = {i:u for i, u in enumerate(unique_chars)}

# Create encoder and decoder functions
encode = lambda s: [char2idx[c] for c in s] # String to list of indices
decode = lambda s: ''.join([idx2char[c] for c in s]) # List of indices to string
input_seq = encode(input_text)

# Print the result of encoding and decoding the first input
print(input_seq)
print(decode(encode(input_text)))

['\x08', '\t', '\x10', ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '^', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~', '\x80', '\x84', '\x85', '\x8d', '\x8e', '\x91', '\x95', '\x96', '\x97', '\x9a', '\x9e', '\xa0', '¡', '¢', '£', '¤', '¦', '§', '¨', '«', '\xad', '®', '°', '³', '´', '·', 'º', '»', '½', '¾', '¿', 'À', 'Á', 'Ã', 'Ä', 'Å', 'È', 'É', 'Ê', 'Õ', 'Ø', 'Ü', 'ß', 'à', 'á', 'â', 'ã', 'ä', 'å', 'æ', 'ç', 'è', 'é', 'ê', 'ë', 'ì', 'í', 'î', 'ï', 'ð', 'ñ', 'ò', 'ó', 'ô', 'ö', 'ø', 'ù', 'ú', 'û', 'ü', 'ý', 'ō', '–', '‘', '’', '“', '”', '…', '₤', '\uf0b7']
I rented I AM CURIOUS-YELLOW from my video store because 

# Word-level Tokenization

Word tokenization is the process of breaking a stream of text up into words. The list of tokens becomes input for further processing. Using words instead of characters means that our model doesn't have to learn the word itself.

As opposed to character-level tokenization, word tokenization requires a vastly larger dictionary size. This is because the number of unique words in a corpus is much larger than the number of unique characters. In practice, a subset of the most common words is used to build the dictionary.

## Word Tokenization with `PyTorch` and `torchtext`

HuggingFace provides access to many popular datasets for NLP. The `torchtext` library provides a simple API for loading and processing text data. It includes a variety of datasets, tokenizers, and data iterators. Using these libraries, we will prepare a dataset as follows:

2. Create a vocabulary using the `torchtext` library.
3. Convert the text to a sequence of integers using the vocabulary.

We will start off by creating our own custom tokenizer. However, there are many other tokenizers available in the `torchtext` library. For example, the `spacy` tokenizer is a popular choice. The `spacy` tokenizer is a rule-based tokenizer that uses the `spaCy` library to tokenize text.

In [10]:
# Create a tokenizer which removes punctuation and special characters before splitting the text into words.
# This tokenizer is from Sebastian Raschka's book "Machine Learning with PyTorch and sci-kit learn"
import re
from collections import Counter, OrderedDict

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    tokenized = text.split()

    return tokenized

# Create a counter object to count the number of times each word appears in the dataset
counter = Counter()

# Loop through each review and tokenize it
for sample in train_dataset:
    line = sample['text']
    counter.update(tokenizer(line))

print("Vocabulary size: ", len(counter))

# Create an encoder
from torchtext.vocab import vocab

sorted_tokens = sorted(counter.items(), key=lambda x: x[1], reverse=True)
ordered_dict = OrderedDict(sorted_tokens)
vocab = vocab(ordered_dict)

# Insert <unk> and <pad> tokens for unknown items and padding when batching fixed sized sequences
vocab.insert_token('<pad>', 0)
vocab.insert_token('<unk>', 1)
vocab.set_default_index(1)

# Create a function to encode the text
def encode_text(text):
    return vocab.lookup_indices(tokenizer(text))

# Create a function to decode the text
def decode_text(encoded_text):
    return vocab.lookup_tokens(encoded_text)

# Create a function to encode the labels
def encode_label(label):
    return 1 if label == 'pos' else 0

Vocabulary size:  69241


In [11]:
# Sample a random review and tokenize it
sample_idx = torch.randint(len(train_dataset), size=(1,)).item()
sample_review = train_dataset[sample_idx]['text']
sample_label = train_dataset[sample_idx]['label']
print("Sample review: ", sample_review)

# Encode the review
encoded_review = encode_text(sample_review)
print("Encoded review: ", encoded_review)

# Decode the review
decoded_review = decode_text(encoded_review)
print("Decoded review: ", decoded_review)

Sample review:  I saw this movie recently because a friend brought it with him from NYC. After 30 minutes, I said to him," You've got to be kidding. Is this some sort of joke?" He thought it was good. I told him that I thought it was probably one of the silliest movies ever made. "What was it supposed to be?" I asked. "A propaganda movie made for children?" The plot is stupid. The acting is the worst ever for most of the principals and frankly people who look at this sort of tripe and think it has anything to do with life, love or even afterlife, of which it offers an incredibly idiotic view...need some psychiatric help. Please, if someone tries to get you to stick this in your DVD or Video player, consider it like you would a virus introduced into your computer...it won't destroy your player but it will destroy your evening. If they had made Razzies in the '40s, this would have won in every category. (PS. It also goes under the dubious sobriquet of "Stairway to Heaven.")
Encoded revie

# Subword Tokenization

Subword tokenizations combine the benefits of both character and word-level tokenization. These tokenizers are learned from the data. They are able to break down words into smaller parts, resulting in a smaller vocabulary size. It also allows the model to learn from words that it has not seen before.

In [14]:
import tiktoken

enc = tiktoken.get_encoding("gpt2")

# Encode the review
encoded_review = enc.encode(sample_review)
print("Encoded review: ", encoded_review)

# Decode the review
decoded_review = enc.decode(encoded_review)
print("Decoded review: ", decoded_review)

Encoded review:  [40, 2497, 428, 3807, 2904, 780, 257, 1545, 3181, 340, 351, 683, 422, 19170, 13, 2293, 1542, 2431, 11, 314, 531, 284, 683, 553, 921, 1053, 1392, 284, 307, 26471, 13, 1148, 428, 617, 3297, 286, 9707, 1701, 679, 1807, 340, 373, 922, 13, 314, 1297, 683, 326, 314, 1807, 340, 373, 2192, 530, 286, 262, 49276, 6386, 6918, 1683, 925, 13, 366, 2061, 373, 340, 4385, 284, 307, 1701, 314, 1965, 13, 366, 32, 11613, 3807, 925, 329, 1751, 1701, 383, 7110, 318, 8531, 13, 383, 7205, 318, 262, 5290, 1683, 329, 749, 286, 262, 44998, 290, 17813, 661, 508, 804, 379, 428, 3297, 286, 1333, 431, 290, 892, 340, 468, 1997, 284, 466, 351, 1204, 11, 1842, 393, 772, 45076, 11, 286, 543, 340, 4394, 281, 8131, 4686, 16357, 1570, 986, 31227, 617, 19906, 1037, 13, 4222, 11, 611, 2130, 8404, 284, 651, 345, 284, 4859, 428, 287, 534, 12490, 393, 7623, 2137, 11, 2074, 340, 588, 345, 561, 257, 9471, 5495, 656, 534, 3644, 986, 270, 1839, 470, 4117, 534, 2137, 475, 340, 481, 4117, 534, 6180, 13, 1002, 484, 5

# Stopwords

Stop words are words that are most common words in a language. For many NLP tasks, these words are not useful and can be removed from the text. NLTK has a list of stopwords for many languages.

In [21]:
import nltk

# Download the stopwords from NLTK
nltk.download('stopwords')

# Import the stopword list
from nltk.corpus import stopwords

# Print the first 10 stopwords
stopwords.words('english')[:10]

[nltk_data] Downloading package stopwords to /home/alex/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]