## __Natural Language Processing__
<font size=3>

Deep learning for sequence processing has made significant advances with the introduction of dense and convolutional layers. However, working with time-series data presents unique challenges, especially in forecasting tasks and maintaining sequential memory to create meaningful long feature chains.

One of the most complex forms of sequential data is [natural language](https://en.wikipedia.org/wiki/Natural_language#:~:text=In%20neuropsychology%2C%20linguistics%2C%20and%20philosophy,without%20conscious%20planning%20or%20premeditation.). Natural language includes any form of human expression, whether in audio, images, or, most commonly for data, texts. Deep learning tasks involving natural language, such as classification, sentiment analysis, question-answering, and translation, require special models to understand and interpret these intricate patterns. In this section, we will explore how deep learning models "read" texts and perform [natural language processing (NLP)](https://en.wikipedia.org/wiki/Natural_language_processing).

In [1]:
import string
import numpy as np
from tensorflow.keras import layers

2024-11-28 13:15:53.171226: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### __Text Vectorization:__
<font size=3>

Since a neural network (NN) model is a non-linear function that computes weights and biases to process data, how are texts handled? Text processing is achieved by mapping strings or characters into numerical vectors - a process known as __vectorization__.

To illustrate, let's use the [Zen of Python](https://peps.python.org/pep-0020/) statements as our dataset/__corpus__ of sentences:

In [2]:
corpus = ["Beautiful is better than ugly",
          "Explicit is better than implicit",
          "Simple is better than complex",
          "Complex is better than complicated",
          "Flat is better than nested",
          "Sparse is better than dense",
          "Readability counts",
          "Special cases aren't special enough to break the rules",
          "Although practicality beats purity",
          "Errors should never pass silently",
          "Unless explicitly silenced",
          "In the face of ambiguity, refuse the temptation to guess",
          "There should be one -and preferably only one- obvious way to do it",
          "Although that way may not be obvious at first unless you're Dutch",
          "Now is better than never",
          "Although never is often better than right now",
          "If the implementation is hard to explain, it's a bad idea",
          "If the implementation is easy to explain, it may be a good idea",
          "Namespaces are one honking great idea -let's do more of those!"]


### __1. Tokenization:__
<font size=3>

To vectorize a sentence, we must first define a strategy for separating characters or words to correlate them to numbers. These pieces of the sentence are commonly referred to as tokens. There are three main forms of tokenization, so, let's consider the first statement for illustration:

<font size=2.5>

   ```python
    sentence = ["Beautiful is better than ugly"]
             
   ```
<font size=3>
    
- __Character to vector__:

<font size=2.5>

   ```python
    tokens = ['B', 'e', 'a', 'u', 't', 'i', 'f', 'u', 'l', 'i', 's', 
              'b', 'e', 't', 't', 'e', 'r', 't', 'h', 'a', 'n', 
              'u', 'g', 'l', 'y']
             
   ```
<font size=3>

- __Word to vector__:

<font size=2.5>

   ```python
    tokens = ['Beautiful', 'is', 'better', 'than', 'ugly']
             
   ```
<font size=3>

- __N-gram to vector__:

<font size=2.5>

   ```python
    # 1-grams/unigram
    tokens = ['Beautiful', 'is', 'better', 'than', 'ugly']

    # 2-grams/bigram
    tokens = ['Beautiful', 'Beautiful is', 'is', 'is better', 
              'better', 'better than', 'than', 'than ugly', 'ugly'] 
    
   ```
<font size=3>

   These forms of [n-gram](https://en.wikipedia.org/wiki/N-gram) are called _bag-of-1-gram_ and _bag-of-2-grams_.

<br/>

The chosen strategy will depend on the task, of course, but we will now work with _word to vector_.

In [3]:
token_list = [text.split() for text in corpus]
token_list[:4]

[['Beautiful', 'is', 'better', 'than', 'ugly'],
 ['Explicit', 'is', 'better', 'than', 'implicit'],
 ['Simple', 'is', 'better', 'than', 'complex'],
 ['Complex', 'is', 'better', 'than', 'complicated']]

<font size=3>
    
After tokenization, sentence vectorization can be performed using two main methods: __one-hot encoding__ and __token/word-embeddings__.

### __2. One-hot encoding__:
<font size=3>

To do so, we will
- define the vocabulary dictionary;
- correlate a word with an integer index;
- transform the sentences into index list;
- one-hot encoding the index list.
  
by handmade and by Keras API. 

#### __2.1 Handmade:__

In [4]:
# defining the vocabulary dictionary:

def reform(word):
    '''
    - Lowercase the words;
    - Removing punctuations.
    '''
    word = word.lower()
    return word.translate(str.maketrans('', '', string.punctuation))

vocab_dict = {}
for text in corpus:
    for word in text.split():
        
        word = reform(word)
        
        if word not in vocab_dict:
            vocab_dict[word] = len(vocab_dict) + 1

vocab_dict

{'beautiful': 1,
 'is': 2,
 'better': 3,
 'than': 4,
 'ugly': 5,
 'explicit': 6,
 'implicit': 7,
 'simple': 8,
 'complex': 9,
 'complicated': 10,
 'flat': 11,
 'nested': 12,
 'sparse': 13,
 'dense': 14,
 'readability': 15,
 'counts': 16,
 'special': 17,
 'cases': 18,
 'arent': 19,
 'enough': 20,
 'to': 21,
 'break': 22,
 'the': 23,
 'rules': 24,
 'although': 25,
 'practicality': 26,
 'beats': 27,
 'purity': 28,
 'errors': 29,
 'should': 30,
 'never': 31,
 'pass': 32,
 'silently': 33,
 'unless': 34,
 'explicitly': 35,
 'silenced': 36,
 'in': 37,
 'face': 38,
 'of': 39,
 'ambiguity': 40,
 'refuse': 41,
 'temptation': 42,
 'guess': 43,
 'there': 44,
 'be': 45,
 'one': 46,
 'and': 47,
 'preferably': 48,
 'only': 49,
 'obvious': 50,
 'way': 51,
 'do': 52,
 'it': 53,
 'that': 54,
 'may': 55,
 'not': 56,
 'at': 57,
 'first': 58,
 'youre': 59,
 'dutch': 60,
 'now': 61,
 'often': 62,
 'right': 63,
 'if': 64,
 'implementation': 65,
 'hard': 66,
 'explain': 67,
 'its': 68,
 'a': 69,
 'bad': 70,
 

<font size=3>
    
The $\mathtt{vocab\_dict}$ associates each token/word with an index. To do that, we lowercase the word to avoid something like

<font size=2.5>

```python
    vocab_dict = {"The":1, "the":2, ...} ,
    
```
<font size=3>

and we remove the punctuation to avoid

<font size=2.5>

```python
    vocab_dict = {..., "better":23, "better!":24, ...} .
    
```

In [5]:
# punctuations to be removed:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

<font size=3>

Since sentences can vary in length and NN architectures require fixed input sizes, we need to define a maximum length, $\mathtt{max\_len}$, for all sentences. Sentences shorter than $\mathtt{max\_len}$ are padded with zeros, while those longer than $\mathtt{max\_len}$ are truncated. Each word in the text is assigned a unique index (different from zero), with the index 0 reserved for the padding. So, let's include the "[PAD]" flag into the $\mathtt{vocab\_dict}$ and reform $\mathtt{corpus}$' sentences.

In [6]:
max_len = 7
vocab_dict["[PAD]"] = 0

corpus_pad = [ ]
for text in corpus:
    token_list = text.split()
    
    if len(token_list) < max_len:
        text += (max_len-len(token_list))*" [PAD]"
        
    else:
        text = " ".join(token_list[:max_len])

    corpus_pad.append(text)

corpus_pad

['Beautiful is better than ugly [PAD] [PAD]',
 'Explicit is better than implicit [PAD] [PAD]',
 'Simple is better than complex [PAD] [PAD]',
 'Complex is better than complicated [PAD] [PAD]',
 'Flat is better than nested [PAD] [PAD]',
 'Sparse is better than dense [PAD] [PAD]',
 'Readability counts [PAD] [PAD] [PAD] [PAD] [PAD]',
 "Special cases aren't special enough to break",
 'Although practicality beats purity [PAD] [PAD] [PAD]',
 'Errors should never pass silently [PAD] [PAD]',
 'Unless explicitly silenced [PAD] [PAD] [PAD] [PAD]',
 'In the face of ambiguity, refuse the',
 'There should be one -and preferably only',
 'Although that way may not be obvious',
 'Now is better than never [PAD] [PAD]',
 'Although never is often better than right',
 'If the implementation is hard to explain,',
 'If the implementation is easy to explain,',
 "Namespaces are one honking great idea -let's"]

In [7]:
vocab_size = max(vocab_dict.values()) # maximum token index

onehot = np.zeros((len(corpus), max_len, vocab_size + 1)) # "+1" includes the padding index

for i, text in enumerate(corpus_pad):
    for j, word in enumerate(text.split()):
        if word != "[PAD]": 
            word = reform(word)

        index = vocab_dict.get(word)        
        onehot[i, j, index] = 1

onehot.shape

(19, 7, 81)

In [8]:
# one sentence example:
i = 0
text = corpus_pad[i].split()

for word, vec in zip(text, onehot[i]):
    if word != "[PAD]": 
        word = reform(word)
    
    print(word, vec, "\n")
    

beautiful [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0.] 

is [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0.] 

better [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0.] 

than [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0.

#### __2.2 Using Keras:__
<font size=3>

To vectorize sentence using Keras API, we will use the [TextVectorization layer](https://keras.io/api/layers/preprocessing_layers/text/text_vectorization/).

In [9]:
max_len = 7

vectorize = layers.TextVectorization(max_tokens=vocab_size,
                                     standardize='lower_and_strip_punctuation',
                                     split='whitespace',
                                     output_mode='int',
                                     output_sequence_length=max_len)

# get vocabulary from corpus:
vectorize.adapt(corpus)

# get token indexes from corpus:
token_ids = vectorize(corpus)

# [UNK] = unknown word
print("Vocabulary size:", vectorize.vocabulary_size())
print("Vocabulary tokens:", vectorize.get_vocabulary())

Vocabulary size: 80
Vocabulary tokens: ['', '[UNK]', 'is', 'than', 'better', 'to', 'the', 'one', 'never', 'idea', 'be', 'although', 'way', 'unless', 'special', 'should', 'of', 'obvious', 'now', 'may', 'it', 'implementation', 'if', 'explain', 'do', 'complex', 'a', 'youre', 'ugly', 'those', 'there', 'that', 'temptation', 'sparse', 'simple', 'silently', 'silenced', 'rules', 'right', 'refuse', 'readability', 'purity', 'preferably', 'practicality', 'pass', 'only', 'often', 'not', 'nested', 'namespaces', 'more', 'lets', 'its', 'in', 'implicit', 'honking', 'hard', 'guess', 'great', 'good', 'flat', 'first', 'face', 'explicitly', 'explicit', 'errors', 'enough', 'easy', 'dutch', 'dense', 'counts', 'complicated', 'cases', 'break', 'beautiful', 'beats', 'bad', 'at', 'arent', 'are']


2024-11-28 13:15:56.214522: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:282] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected


In [10]:
token_ids

<tf.Tensor: shape=(19, 7), dtype=int64, numpy=
array([[74,  2,  4,  3, 28,  0,  0],
       [64,  2,  4,  3, 54,  0,  0],
       [34,  2,  4,  3, 25,  0,  0],
       [25,  2,  4,  3, 71,  0,  0],
       [60,  2,  4,  3, 48,  0,  0],
       [33,  2,  4,  3, 69,  0,  0],
       [40, 70,  0,  0,  0,  0,  0],
       [14, 72, 78, 14, 66,  5, 73],
       [11, 43, 75, 41,  0,  0,  0],
       [65, 15,  8, 44, 35,  0,  0],
       [13, 63, 36,  0,  0,  0,  0],
       [53,  6, 62, 16,  1, 39,  6],
       [30, 15, 10,  7,  1, 42, 45],
       [11, 31, 12, 19, 47, 10, 17],
       [18,  2,  4,  3,  8,  0,  0],
       [11,  8,  2, 46,  4,  3, 38],
       [22,  6, 21,  2, 56,  5, 23],
       [22,  6, 21,  2, 67,  5, 23],
       [49, 79,  7, 55, 58,  9, 51]])>

In [11]:
onehot = np.zeros((len(corpus), max_len, vocab_size + 1))

for i, ids in enumerate(token_ids):
    for j, ID in enumerate(ids):
        onehot[i][j][ID] = 1
        

In [12]:
# one sentence example:
i = 0
text = corpus[i].split()

if len(text) < max_len:
    text += ["[PAD]"]*(max_len-len(text))

for word, vec in zip(text, onehot[0]):
    if word != "[PAD]": 
        word = reform(word)
    
    print(word, vec, "\n")
    

beautiful [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 1. 0. 0. 0. 0. 0. 0.] 

is [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0.] 

better [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0.] 

than [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0.

#### __2.3 In summary:__ one-hot encoding vectorization

In [13]:
# 1. get the sentence:
sentence = corpus[2]
sentence

'Simple is better than complex'

In [14]:
# 2. tokenization and padding:
token_list = sentence.split()

if len(token_list) < max_len:
    token_list += (max_len-len(token_list))*["[PAD]"]

token_list

['Simple', 'is', 'better', 'than', 'complex', '[PAD]', '[PAD]']

In [15]:
# 3. get token IDs:
token_ids = []
for word in token_list:
    
    if word != "[PAD]":
        word = reform(word)

    token_ids.append(vocab_dict.get(word))

token_ids

[8, 2, 3, 4, 9, 0, 0]

In [16]:
# 4. one-hot encoding:
max_len = 7
onehot = np.zeros((max_len, vocab_size+1))

for i, ID in enumerate(token_ids):
    onehot[i][ID] = 1

'''
Now, the onehot tensor is ready to feed the NN model!
'''

onehot

array([[0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0.

### __3. Word-embedding:__
<font size=3>

It's important to note that using one-hot encoding to represent an entire corpus vocabulary results in large and sparse vectors, which can be inefficient in terms of memory and computation. As an alternative, we will use the __word embedding__ method to vectorize the corpus. This approach produces dense and compact representations, making it more efficient and meaningful. We will explore word embeddings in detail in the next notebook.

### __References:__
<font size=3>
    
- [Deep Learning with Python](https://books.google.com.br/books/about/Deep_Learning_with_Python.html?id=Yo3CAQAACAAJ&redir_esc=y).