# Overview of Natural Language Processing
__MATH 3480__ - Dr. Michael Olson

Reading:
* Geron, 
* [YouTube: What is NLP (Natural Language Processing)? - IBM Technology](https://www.youtube.com/watch?v=fLvJ8VdHLA0)
* [YouTube: Natural Language Processing in 5 minutes - Simplilearn](https://www.youtube.com/watch?v=CMrHM8a3hqw)

-----
## What is Natural Language Processing?
__Natural Language Processing__ (NLP) is the field of interpretting language. This is difficult as words are not numeric. However, there is still a lot we can do.

When we speak, we share thoughts in *Unstructured* language.
> Add eggs and milk to my shopping list

But computers understand thoughts in a *Structured* language.
```xml
<list shopping>
  <item>eggs</>
  <item>milk</>
</>
```

The connection between Structured and Unstructured language is NLP
* Natural Language Understanding (NLU) - Going from unstructured language to structured language
* Natural Language Generation (NLG) - Going from structured language to unstructured language

Uses:
1. Machine Translation
2. Virtual assistant or Chatbots
3. Sentiment analysis
4. Spam detection

## The Landscape
<img src="https://github.com/drolsonmi/math3480/blob/main/Notes/Images/3480_NLP.png?raw=true" height=300 alt="The Natural Language Processing landscape">

When deep learning (neural networks) are used in NLP, we are in the field of __Deep Natural Language Processing__ (DNLP). There are many examples of DNLP algorithms. One that is commonly used is the __Sequence-to-Sequence__ (Seq2Seq) model.

## The Steps of NLP
1. Segmentation
    * Break document down to sentences
2. Tokenizing
    * Find individual words used
3. Stop words
    * Remove unimportant words that do not add much to the meaning
    * "a", "and", "the"
4. Stemming
    * Some words are different, but have the same root
    * "swim", "swims", "swimming", "swam"
5. Lemmatization
    * Find generalization of each stemmed word
    * ["Am", "Are", "Is"] come from the same verb: "Be"
6. Speech Tagging
    * Where is each token used in a sentence?
    * Label each word as Noun, Verb, Preposition, etc.
7. Named Entity Tagging or Named Entity Recognition
    * Flagging names of locations, movies, people, etc. that may occur in the document
    * Is there an entity associated with a particular token?
        * "Utah" is a state in the United States
        * "Michael Olson" is a professor at Snow College

<img src="https://github.com/drolsonmi/math3480/blob/main/Notes/Images/3480_NLP_Process.png?raw=true" height=400 alt="The steps of Natural Language Processing">

-----


## Some Examples of NLP models

__Basic NLP models:__
* If/else Rules
   * A list of possible questions and the answers to give those questions
   * Early chatbots
   * Long and cumbersome, often took users into areas unrelated to what they really wanted
* Audio Frequency Components Analysis
   * Analyze sound waves and interpret patterns and match those with typical patterns for specific words or phrases
   * Speech Recognition
   * Utilizes Fourier Analysis (Discussed in MATH 3280)
* Bag-of-words model
   * Used for classification
   * A list of words and how frequently they are associated with a certain classification

We'll do an example of the bag-of-words model in a minute.

__Deep Natural Language Processing (DNLP) examples:__
* Bag-of-words
    * Same as above, but we can incorporate it into a Deep Neural Network (DNN)
* Recurrent Neural Networks (RNN) for text recognition
* Convolutional Neural Networks (CNN) for text recognition 
  * https://dennybritz.com/posts/wildml/understanding-convolutional-neural-networks-for-nlp/
* Sequence-to-Sequance (Seq2Seq)
    * Text prediction

### Bag-of-words example

Imagine you have an assignment that is graded and given feedback. Some examples are as follows:

| Grade | Feedback |
| :---: | :------- |
|   A   | Fantastic work! |
|   A   | Great job! |
|   A   | Perfect! |
|   B   | Good job |
|   B   | Almost perfect |
|   B   | So close |
|   C   | Good, but needs work |
|   C   | Needs some help |
|   D   | Poor work |
|   D   | Try harder next time |

Now, we create a list (bag) of these words and count how frequently they are associated with each grade.

| Word      |   A   |   B   |   C   |   D   |
| :-------- | :---: | :---: | :---: | :---: |
| Fantastic |   1   |   0   |   0   |   0   |
| Work      |   1   |   0   |   1   |   1   |
| Great     |   1   |   0   |   0   |   0   | 
| job       |   1   |   1   |   0   |   0   |
| Perfect   |   1   |   1   |   0   |   0   |
| Good      |   0   |   1   |   1   |   0   |
| Almost    |   0   |   1   |   0   |   0   |
| So        |   0   |   1   |   0   |   0   |
| close     |   0   |   1   |   0   |   0   |
| but       |   0   |   0   |   1   |   0   |
| needs     |   0   |   0   |   2   |   0   |
| some      |   0   |   0   |   1   |   0   |
| help      |   0   |   0   |   1   |   0   |
| Poor      |   0   |   0   |   0   |   1   |
| Try       |   0   |   0   |   0   |   1   |
| harder    |   0   |   0   |   0   |   1   |
| next      |   0   |   0   |   0   |   1   |
| time      |   0   |   0   |   0   |   1   |

How big is this array of words? (Do a google search for "How many words are used in the English language?")
* Over 170,000 words available
* Only 20,000 commonly used
* Only 3,000 really needed in everyday speech

We create an empty array of 20,000 words. This is our bag of words.
$$[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,\dots,0]$$

* Each position represents a word
* First two spots in the array are reserved
    * `array[0] = SOS` where SOS = Start of Sentence
    * `array[1] = EOS` where EOS = End of Sentence
    * `array[n]` is the last element of the array, reserved for any of those special words not in our array.
* Fill it with a count of how often those words occur in our example

$$[SOS, EOS, a, of, the, if, is, did, not, and, me, you, have, get, good, perfect, worth, amazing, fantastic, job, work, \dots$$
$$\qquad\dots your, mine, ",", "!", ".", know, question, shall, go, let, \dots, (special~words)]$$

Note that derivations of the same word are grouped together:
* worthy ==> worth
* questions ==> question

Quick example: On one of your assignments, I write the following feedback:
> Fantastic work, John! Your work is worthy of JPL! Let me know if you have questions. Michael.

Putting this into the bag of words,
$$[1,1,0,1,0,1,1,0,0,0,1,1,1,0,0,0,1,0,1,0,2,\dots$$
$$\qquad 1,0,1,2,2,1,1,0,0,1,\dots, 3]$$

We would do this for all feedback given to all students, linked with the grade given to each student. This is used to train the model (logistic regression, naive bayes, DNN, etc.). Then, we would take the array from our example and feed it into our newly-trained model, and it should be able to predict the appropriate grade for the student just based on the feedback!