<a target="_blank" href="https://colab.research.google.com/github/avakanski/Fall-2022-Python-Programming-for-Data-Science/blob/main/Lectures/Theme%203%20-%20Model%20Engineering%20Pipelines/Lecture%2017%20-%20Convolutional%20NN%20with%20PyTorch/Lecture%2017%20-%20Convolutional%20NN%20with%20PyTorch.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

<a name='section0'></a>
# Lecture 18 Natural Language Processing

- [18.1 Introduction to Natural Language Processing (NLP)](#section1)
- [18.2 Preprocessing Text Data](#section2)
- [18.3 Text Tokenization with Keras Tokenizer](#section3)
- [18.4 ](#section4)
- [18.5 ](#section5)
- [18.6 ](#section6)
- [References](#section7)

<a name='section1'></a>

# 18.1 Introduction to Natural Language Processing (NLP)

***Natural Language Processing (NLP)*** is a branch of computer science (and more broadly, a branch of artificial intelligence) that is concerned with providing computers with the ability to understand texts and human language. 

Common tasks in NLP include:

- *Text classification* — assign a class label to text based on the topic discussed in the text, e.g., sentiment analysis (positive or negative movie review), spam detection, content filtering (detect abusive content).
- *Text summarization/reading comprehension* — summarize a long input document with a shorter text.
- *Speech recognition* — convert spoken language to text.
- *Machine translation* — convert text in a source language to a target language.
- *Part of Speech (PoS) tagging* — mark up words in text as nouns, verbs, adverbs, etc. 
- *Question answering* — output an answer to an input question.
- *Dialog generation* — generate the next reply in a conversation given the history of the conversation.
- *Text generation/language modeling* — generate text to complete the sentence or to complete the paragraph.


<a name='section2'></a>

# 18.2 Preprocessing Text Data

In order to perform operations with text data, they first need to be converted into numerical representation. 

Converting text data into numerical form for processing by ML models typically involves the following steps:
- *Standardization* - remove punctuation, convert the text to lower case, lemmatization.
- *Tokenization* - break up the text into tokens (e.g., tokens can be individual words, several consecutive words (n-grams), or individual characters).
- *Indexing* - assign a numerical index to each token in the training set (vocabulary).
- Optional step: *Embedding* - assign a numerical vector to each index (e.g., one-hot encoding or word-embedding using embedding models such as word2vec, or GloVe).


### Text Standardization

***Standardization*** usually includes some or all of the following steps, depending on the application:
- Remove punctuation marks (such as comma, period) or non-alphabetic characters (@, #, {, ]).
- Change all words to lower-case letters, since the model should consider *Text* and *text* as the same word.

Some NLP tasks can apply additional steps, such as:
- Correct spelling errors or replace abbreviations with the full words. 
- Remove stop words, such as *for*, *the*, *is*, *to*, *some*, etc.; if the task is text classification, these words are not relevant for the meaning of the text. 
- Apply stemming and lemmatization, which transform words to their base form, such as changing the word *changing* to *change*, or *grilled* to *grill* since they have a common root. 

Applying text standardization is helpful for training machine learning models, because the models do not need to consider *Text* and *text* as two different words, which will reduce the requirements for large training dataset. However, depending on the application, text standardization may remove information that can be important for some tasks, and this should always be considered when performing text preprocessing.

### Tokenization

***Tokenization*** is breaking up the text into a sequence of representative symbols called tokens. A *token* is simply a piece of data that stands in for another, more valuable piece of information. 

Tokenization can be performed at different levels:
- *Character-level tokenization* - where each character is a token, and it is represented by a unique number. One of the traditional character encoding technique is ASCII (American Standard Code Information Interchange). With ASCII, we can nearly convert any character to a numeric token. One disadvantage of this type of tokenization is that the antigrams (words with same letters in different order, such as *silent* and *listen*) can have the same encoding, which can affect the perofrmance of machine learning models. Character-level tokenization is not widely used in practice..
- *Word-level tokenization* - where each word is a token, represented with a unique number. This type of encoding works well and it is more often used. 
- *N-gram tokenization* - where N consecutive words represent a token. For instance, N-grams consisting of two adjacent words are called bigrams, or three words constitute a trigram, etc.The N-grams tokens preserves the words order and can potentially capture more information in the text. For instance, for spam filtering task using  bigram tokens such as *mailing list* or *bank account* may be provide more helpful information than using word-level tokens. 

An example of text standardization and word-level tokenization is shown in the next figure.

<img src="images/tokenization.png" alt="Drawing" style="width: 500px;"/>

<a name='section3'></a>

# 18.3 Text Tokenization with Keras Tokenizer

Keras provides a text preprocessing function `Tokenizer` for converting a raw text into sequences of tokens. The `Tokenizer` performs both text standardization and tokenization. 

Let's import TensorFlow and Tokenizer to shown how it can be used to to encode text at character and words level. 

In [1]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


The Kears `Tokenizer` has the following arguments:
- *num_words*: the maximum number of words to keep in the input text. It is better to set a high number of ywe are not sure, because if we set a number less than the words in the text, some words will not be tokenized.
- *filters*: by default, all punctuations and special characters in the text will be removed. If we want to change that, we can provide a list of punctuations and characters to keep. 
- *lower*: can be True or False. By default, it is True, and that means all texts will be converted to lower case.
- *split*: separator for splitting words. A default separator is a space(" ").  
- *char_level*: can be True or False. By default it is False and will perform word-level tokenization. If it is True, the function will perform character-level tokenization. 
- *oov_token*: oov stands of Out Of Vocabulary, and it denotes a symbol that will be added to the word_index to replace words that are not available in input text. 

### Character-level Tokens

To use the `Tokenizer` for character-level tokenization, we need to set `char_level` to `True`. We will set the number of tokens to 1,000. 

Let's apply it to the following simple sentence by using  the method `fit_on_texts()`. 

In [4]:
# A sample sentence
sentence = ['TensorFlow is a Machine Learning framework']

In [5]:
tokenizer = Tokenizer(num_words=1000, char_level=True)

# Fitting tokenizer on sentences
tokenizer.fit_on_texts(sentence)

When the `Tokenizer` separates the characters in text, it creates a  dictionary that maps each character to its token. We can inspect the dictionary by using the attribute `word_index`, although since we have set `char_level` to `True` in this case it is the character index. 

Note that the tokens starts at 1.  By default, all letters are convereted to lower-case. The first token is an empty space `' '`, the second is the letter `'e'`, etc. There are 17 unique characters in the sentence, including the empty space. 

In [6]:
char_index = tokenizer.word_index
print(char_index)

{' ': 1, 'e': 2, 'n': 3, 'r': 4, 'a': 5, 'o': 6, 'i': 7, 's': 8, 'f': 9, 'l': 10, 'w': 11, 'm': 12, 't': 13, 'c': 14, 'h': 15, 'g': 16, 'k': 17}


The method `text_to_sequences` outputs the tokens for the text. You can check that the word `TensorFlow` has the tokens 13, 2, 3, 8, 6, 4, 9, 10, 6, 11, where each of the tokens correspond to the letters listed in the `char_index`.

In [9]:
print(tokenizer.texts_to_sequences(sentence))

[[13, 2, 3, 8, 6, 4, 9, 10, 6, 11, 1, 7, 8, 1, 5, 1, 12, 5, 14, 15, 7, 3, 2, 1, 10, 2, 5, 4, 3, 7, 3, 16, 1, 9, 4, 5, 12, 2, 11, 6, 4, 17]]


As we mentioned earlier, character-level tokenization is rarly used, because it is challellenging to 
deal with words that have the same characters in different order(antigrams) since they have the same tokens.

### Word-level Tokens

To use the `Tokenizer` for tokenizing words instead of characters, we need to just change the argument `char_level` to `False`, which is the Default setting, so we may as well as just ommit it.

In [12]:
# Sample sentences
sentences = ['TensorFlow is a Machine Learning framework.',
             'Keras is a well designed deep learning API!',
             'Keras is built on top of TensorFlow!']    

After the text is broken down into individual words, teh `Tokenizer` builds a *vocabulary* of all words that are found in the input text, and assigns a unique integer to each word in the vocabulary. We can inspect the words by using again the attribute `word_index`.

In [13]:
tokenizer = Tokenizer(num_words=1000)

# Fitting tokenizer on sentences
tokenizer.fit_on_texts(sentences)

word_index = tokenizer.word_index
print(word_index)

{'is': 1, 'tensorflow': 2, 'a': 3, 'learning': 4, 'keras': 5, 'machine': 6, 'framework': 7, 'well': 8, 'designed': 9, 'deep': 10, 'api': 11, 'built': 12, 'on': 13, 'top': 14, 'of': 15}


There are 15 unique words in the above sentences. By default, all punctuations are removed and all letters are convereted to lower-case. 

Also, `word_counts` can return the number of times each word appears in the sentences.

In [14]:
word_counts = tokenizer.word_counts
word_counts

OrderedDict([('tensorflow', 2),
             ('is', 3),
             ('a', 2),
             ('machine', 1),
             ('learning', 2),
             ('framework', 1),
             ('keras', 2),
             ('well', 1),
             ('designed', 1),
             ('deep', 1),
             ('api', 1),
             ('built', 1),
             ('on', 1),
             ('top', 1),
             ('of', 1)])

The tokens for the above three sentences are shown below. For instance the first list `[2, 1, 3, 6, 4, 7]` represents the first sentence in the text `TensorFlow is a Machine Learning framework`. 

In [15]:
print(tokenizer.texts_to_sequences(sentences))

[[2, 1, 3, 6, 4, 7], [5, 1, 3, 8, 9, 10, 4, 11], [5, 1, 12, 13, 14, 15, 2]]


### Out of Vocabulary Words

To handle the case when the Tokenizer is applied to text that contains words which were not present in the original documents, we can define a special token `oov_token`. This token will be used to replace these words that are Out Of Vocabulary (OOV).

In the example below, we set the `oov_token`, which has been assigned the token `1`.

In [16]:
tokenizer = Tokenizer(num_words=1000, oov_token='Word Out of Vocab')
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

{'Word Out of Vocab': 1, 'is': 2, 'tensorflow': 3, 'a': 4, 'learning': 5, 'keras': 6, 'machine': 7, 'framework': 8, 'well': 9, 'designed': 10, 'deep': 11, 'api': 12, 'built': 13, 'on': 14, 'top': 15, 'of': 16}


In [17]:
# Converting text to sequences
print(tokenizer.texts_to_sequences(sentences))

[[3, 2, 4, 7, 5, 8], [6, 2, 4, 9, 10, 11, 5, 12], [6, 2, 13, 14, 15, 16, 3]]


Next, if we pass text with new words that the tokenizer was not fit to, the new words will be replaced with the `oov_token`.

In [18]:
new_sentences = ['I like TensorFlow', # I and like are new words
                'Keras is a superb deep learning API'] # superb is a new word 

print(tokenizer.texts_to_sequences(new_sentences))

[[1, 1, 3], [6, 2, 4, 1, 11, 5, 12]]


And also, if we work with a large dataset of documents, we can limit the number of words in the vocubulary to 20,000 or 30,000, and consider the rare words as out of vocabulary words. This can reduce the feature space of the model, by ignoring those words that are present only once or twice in the large database.

Work in progress....