# Recurrent Neural Networks and Natural Language Processing

## Natural Language Processing (NLP)

Natural language processing is a field of computer science and artificial intelligence that focuses on how computers can understand, interpret, and generate human language. It is an advanced form of machine learning where the training data for the algorithm is human text. NLP allows computers to read and understand text, hear and interpret spoken words, and write or speak in a way that makes sense to people. If you have ever used a voice assistant like Siri, an AI chatbot like ChatGPT, or autocorrect software like Grammarly then you have used software that has been trained to do natural langauge processing.

The process of natural language processing has the following steps. First, the text is broken down into words or parts, called tokens, which are converted to numbers. This process, called _tokenization_ is the first step in helping a computer make sense of human text. Computers cannot understand words but they can understand numbers and sequences of numbers. Tokenization is described in more detail in the third section of these notes. After tokenization, the NLP algorithm looks for patterns in the sequences of the numbers so that it can begin to understand grammar and structure. It also looks to assign meaning to the tokens so they can be used in the correct contex. Note that the same word may be assigned a different token in different context. A river _bank_ is significantly different from the _Bank_ of America. After the algorithm is trained it should be able to produce sequeunces of tokens (that are then converted back into text) that make sense in terms of grammar, structure, and word choice. 

Some common tasks that are completed with natural language processing are sentiment analysis (what feeling does the text convey), text summarization, and translation. We will do sentiment analysis this week and in two weeks the homework will involve using natural language processing to build a translator.

## Recurrent Neural Networks for Natural Language Processing

Last week we saw recurrent neural networks (RNNs) used to find patterns in numeric or _time series_ data. RNNs are built for sequenced data, making them a perfect candidate for NLP. Since RNNs have a memory they can remember previous inputs which helps them understand context and long range patterns in the data. Since language is all about order and context this is important. 

There are some limitations on RNNs in NLP applications including the vanishing gradient problem (some of the neural network weights go to zero during the trianing process) and they can be slow to train. Simple RNN layers have a relatively short-term memory so are also not the best options for NLP. However, LSTMs and GRUs were commonly used for NLP in the 2010s since they have a longer memory and are able to train and make predictions faster than simple RNNs. However, most modern applications of NLPs use transformers, which we will cover in the following three weeks. Transofrmers are better able to find and comprehend long range patterns, have algorithms which assign different weights to all words (they do not just remember recent words), and they are parallelizable during training, which speeds up the process. Though RNNs are no longer used extensively in NLP, they were the foundational models in the field and it is important to understand how they handle data before moving onto transformers.

## Tokenization
_Tokenization_ is the process of breaking down text into words, sub-words, characters, or sentences and converting each piece into a unique number. This is a pre-processing step which is needed before an RNN (or transformer) can be trained on text as computers can only comprehend numbers. The data set we will use this week has already been tokenized for you, but this is not always the case so it is important to understand the process.

As an example of tokenization, let's consider the sentence "The dog is playing with its ball." The first way to tokenize this sentence if for the entire sentence to be one token, i.e. a unique identifying number is assigned to the entire setence. The second way to perform tokenization is to break the sentence down into its individual words. Thus "The", "dog", "is", "playing", "with", "its", and "ball" will all be unique tokens and the NLP model will learn the sequence between them. A third way to do tokenization is to break the sentence down into individual characters, with each character recieving a token number. Finally, we can break the sentence down into words and subwords. For example "playing" is not one word, it can be broken into "play" and "ing". "Play" is the verb but "ing" adds context to the verb (aka the event is happening now). This process is called sub-word tokenization and use used to train models like ChatGPT. It is useful because it allows the model to deal with rarely used or unknown words, compound words, and spelling variations. 

Below is an example of the Keras function `Tokenizer` which will take a sentence and then convert it into tokens that can then be used to train a NLP model: 

In [10]:
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = ["The dog is playing with the ball."]

# Create the tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)

# Convert text to sequences
sequences = tokenizer.texts_to_sequences(sentences)

print("Word Index:", tokenizer.word_index)
print("Sequences:", sequences)

Word Index: {'the': 1, 'dog': 2, 'is': 3, 'playing': 4, 'with': 5, 'ball': 6}
Sequences: [[1, 2, 3, 4, 5, 1, 6]]


In the homework you have an example of a dataset which has already been pre-tokenized but I do give you the code to convert between the tokenized form and the original text. The data set contains thousands of movie reviews (which have been tokenized) with the goal being to classify the review as positive or negative (sentiment analysis). I encourage you to play around with the code given in the homework before starting on the assignment to get a feel for the data set and the starting model. Try different values for the parameters and hyperparameters to see how things change.

Note that next week we will discuss much more complex forms of tokenization, but for now this method will work for our initial analysis of natural language processing.