# **Natural Language Processing**

Natural Language Processing (NLP) is a technique in artificial intelligence that deals with the understanding of human-based language. It involves programming techniques to create a model that can understand language, classify content, and even generate and create new compositions in human-based language.

### **Reference**

*   [**NLP Zero to Hero - YouTube Playlist**](https://www.youtube.com/watch?v=fNxaJsNG3-s&list=PLQY2H8rRoyvzDbLUZkbudP-MFQZwNmU4S)

## **Top NLP Libraries**

*   Natural Language Toolkit (NLTK)
*   Gensim
*   Texthero
*   spaCy
*   TextBlob

### **Encode Language into Numbers.**

Encoding language into numbers can be performed in many ways. The most common way is to encode entire words.

Using this technique, consider a sentence like "I love my dog". We could encode that with the numbers $[1, 2, 3, 4]$. If we then wanted to encode another sentence like "I love my cat", it could be $[1, 2, 3, 5]$. The above two sentences have a similar meaning because they’re similar numerically, i.e., $[1, 2, 3, 4]$ looks a lot like $[1, 2, 3, 5]$. This process is called ***Tokenization***.

*TensorFlow Keras contains a library called **preprocessing** that provides several extremely useful tools to prepare data for machine learning. One of these is a **Tokenizer** that will allow us to take words and turn them into tokens.*


In [1]:
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

In [2]:
sentences = ["I love my dog", "I love my cat"]

"""
In this case, we create a Tokenizer object and specify the number of words that it can tokenize.
This value will be the maximum number of tokens to generate from the corpus of words.
We have a very small corpus here containing only five unique words, so we'll be well under the one hundred specified.
"""

tokenizer = Tokenizer(num_words=100)

# Once we have a tokenizer, calling "fit_on_texts" will create the tokenized word index.
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

# Print a set of key/value pairs for the words in the corpus.
print(word_index)

{'i': 1, 'love': 2, 'my': 3, 'dog': 4, 'cat': 5}


In [3]:
"""
The tokenizer is quite flexible. For example, if we were to expand the corpus with another sentence containing the word "cat"
but with a question mark after it, the results show that it would be smart enough to filter out "cat?" as just "cat".
"""

sentences = ["I love my dog", "I love my cat", "Do you love my cat?"]

"""
This behavior is controlled by the filters parameter to the tokenizer, which defaults to removing all punctuation 
except the apostrophe character. Once we have the words in our sentences tokenized, the next step is to convert the 
sentences into lists of numbers, with the number being the value where the word is the key.
"""

tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

{'love': 1, 'my': 2, 'i': 3, 'cat': 4, 'dog': 5, 'do': 6, 'you': 7}


### **Turning Sentences into Sequences:** *Encode the sentences into sequences of numbers.* 

In [4]:
sentences = ["I love my dog", "I love my cat", "Do you love my cat?"]

tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)
print("\n")

# The tokenizer has a method called "text_to_sequences". Using it to our list of sentences, and will return a list of sequences.
sequences = tokenizer.texts_to_sequences(sentences)

# The output is the sequences representing the three sentences.
print(sequences)

{'love': 1, 'my': 2, 'i': 3, 'cat': 4, 'dog': 5, 'do': 6, 'you': 7}


[[3, 1, 2, 5], [3, 1, 2, 4], [6, 7, 1, 2, 4]]


### **Using Out-Of-Vocabulary (OOV) tokens**

Consider we are training a neural network on a set of data. The typical pattern is that we have a set of data used for training that we know won’t cover 100% of our needs, but we hope covers as much as possible. In the case of NLP, we might have many thousands of words in our training data, used in many different contexts, but we can’t have every possible word in every possible context. So when we show our neural network some new, previously unseen text containing previously unseen words, what might happen? The neural network will get confused because it simply has no context for those words, and, as a result, any prediction it gives will be negatively affected.

One tool to use to handle these situations is an ***out-of-vocabulary (OOV)*** token. This method can help the neural network to understand the context of the data containing previously unseen text. For example, given the previous small example corpus, suppose we want to process sentences like these:

In [5]:
test_data = ["Your dog is beautiful.", "My cat ate your rat?"]

test_sequences = tokenizer.texts_to_sequences(test_data)
print(word_index)
print(test_sequences)

{'love': 1, 'my': 2, 'i': 3, 'cat': 4, 'dog': 5, 'do': 6, 'you': 7}
[[5], [2, 4]]


In [6]:
# We do this by adding a parameter called "oov_token", as shown below.
# We can assign it to any string we like, but make sure it does not appear elsewhere in our corpus.

tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

sequences = tokenizer.texts_to_sequences(sentences)

test_sequences = tokenizer.texts_to_sequences(test_data)
print(test_sequences)

{'<OOV>': 1, 'love': 2, 'my': 3, 'i': 4, 'cat': 5, 'dog': 6, 'do': 7, 'you': 8}
[[1, 6, 1, 1], [3, 5, 1, 1, 1]]


The output has improved a bit. Our tokens list has a new item, "$<OOV>$", and our test sentences maintain their length. The former is much closer to the original meaning. The latter, because most of its words aren't in the corpus, still lacks a lot of contexts, but it's a step in the right direction.

### **Using Padding**

When training neural networks, we typically need all our data to be in the same shape. That is, once we've tokenized the words and converted the sentences into sequences, they can all be in different lengths. To get them to be the same size and shape, we can use ***padding***.

In [7]:
""" To explore padding, let's add another, much longer, sentence to the corpus. """

sentences = [
    "I love my dog",
    "I love my cat",
    "Do you love my cat?",
    "The dog chased the cat and the cat chased the rat.",
]

tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentences)
sequences = tokenizer.texts_to_sequences(sentences)
"""
The output is the sequences representing the four sentences. 
When we sequence that, we'll see that our lists of numbers have different lengths.
"""
print(sequences)

[[5, 3, 4, 6], [5, 3, 4, 1], [8, 9, 3, 4, 1], [2, 6, 7, 2, 1, 10, 2, 1, 7, 2, 11]]


In [8]:
""" If we want to make these sequences into the same length, we can use the "pad_sequences" API. """

from tensorflow.keras.preprocessing.sequence import pad_sequences

# Using the "pad_sequences" API is very straightforward.
# To convert our (unpadded) sequences into a padded set, we simply call "pad_sequences" like this:
padded = pad_sequences(sequences)

# We'll get a nicely formatted set of sequences. They'll also be on separate lines, like this:
print(padded)

[[ 0  0  0  0  0  0  0  5  3  4  6]
 [ 0  0  0  0  0  0  0  5  3  4  1]
 [ 0  0  0  0  0  0  8  9  3  4  1]
 [ 2  6  7  2  1 10  2  1  7  2 11]]


The sequences get padded with 0, which isn't a token in our word list. If you had wondered why the token list began at 1 when typically programmers count from 0, now you know! 

We now have something that's regularly shaped that we can use for training. But before going there, let's explore this API a little because it gives us many options that we can use to improve our data. First, we might have noticed that in the case of the shorter sentences, to get them to be the same shape as the longest one, the requisite number of zeros was added at the beginning. This method is called ***pre-padding***, and it’s the default behavior. We can change this using the padding parameter. For example, if we want our sequences to be padded with zeros at the end, we can use:

In [9]:
padded = pad_sequences(sequences, padding="post")

# The words are at the beginning of the padded sequences, and the 0's characters are at the end.
print(padded)

[[ 5  3  4  6  0  0  0  0  0  0  0]
 [ 5  3  4  1  0  0  0  0  0  0  0]
 [ 8  9  3  4  1  0  0  0  0  0  0]
 [ 2  6  7  2  1 10  2  1  7  2 11]]


The next default behavior we may have observed is that the sentences were all made to be the same length as the longest one. It's a sensible default because it means we don’t lose any data. The trade-off is we get a lot of padding. But what if we don’t want this, perhaps because we have one crazy long sentence that means we would have too much padding in the padded sequences. To fix this, we can use the "$maxlen$" parameter, specifying the desired maximum length when calling "$pad\_sequences$", like this:

In [10]:
padded = pad_sequences(sequences, padding="post", maxlen=6)
print(padded)

[[ 5  3  4  6  0  0]
 [ 5  3  4  1  0  0]
 [ 8  9  3  4  1  0]
 [10  2  1  7  2 11]]


Now the padded sequences are all the same length, and there isn’t too much padding. We have lost some words from our longest sentence, though, and they’ve been truncated from the beginning. What if we don't want to lose the words from the beginning, but instead, want them truncated from the end of the sentence? We can override the default behavior with the truncating parameter, as follows:

In [11]:
padded = pad_sequences(sequences, padding="post", maxlen=6, truncating="post")

# The result will show that the longest sentence is now truncated at the end instead of the beginning.
print(padded)

[[ 5  3  4  6  0  0]
 [ 5  3  4  1  0  0]
 [ 8  9  3  4  1  0]
 [ 2  6  7  2  1 10]]


# **spaCy: Industrial-strength NLP**

> [**spaCy - Official Website**](https://spacy.io/)

> [**spaCy - Wikipedia**](https://en.wikipedia.org/wiki/SpaCy)

> [**spaCy - GitHub**](https://github.com/explosion/spaCy)

> [**spaCy - PyPI**](https://pypi.org/project/spacy/)

In [None]:
!pip install spacy

## **Named Entity Recognition**

**Named Entity Recognition** [**[Wikipedia]**](https://en.wikipedia.org/wiki/Named-entity_recognition) is the process of NLP that deals with identifying and classifying named entities. The raw and structured text gets parsed, and the named entities get classified into persons, organizations, places, money, time, etc. Named Entities are identified and segmented into various pre-defined classes. Named Entity Recognition (NER), also known as entity chunking/extraction, is a popular technique used in information extraction to identify and segment the named entities and classify or categorize them under various pre-defined classes.

NER systems are developed with various linguistic approaches, as well as statistical and machine learning methods. NER has many applications for project or business purposes. NER model first identifies an entity and then categorizes the entity into the most suitable class.

In [13]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")

# Name Entity Recognition Function.
def named_entity_recognition(raw_text):
    NER = nlp(raw_text)
    for word in NER.ents:
        print(word.text, word.start_char, word.end_char, word.label_)
    displacy.render(NER, style="ent", jupyter=True)


sentence = """ The Mars Orbiter Mission (MOM), informally known as Mangalyaan, was launched into Earth orbit on 5 November 2013 
               by the Indian Space Research Organisation (ISRO) and has entered Mars orbit on 24 September 2014. India thus became 
               the first country to enter Mars orbit on its first attempt. It was completed at a record-low cost of $74 million. """

# Function Call.
named_entity_recognition(sentence)

The Mars Orbiter Mission (MOM 1 30 PRODUCT
Mangalyaan 53 63 PERSON
Earth 83 88 LOC
5 November 2013 98 113 DATE
the Indian Space Research Organisation 133 171 ORG
Mars 195 199 LOC
24 September 2014 209 226 DATE
India 228 233 GPE
first 266 271 ORDINAL
Mars 289 293 LOC
$74 million 363 374 MONEY


## **Parts-of-Speech Tagging**

In [14]:
import spacy

nlp = spacy.load("en_core_web_sm")

sentence = """The Indian Space Research Organisation is the national space agency of India, headquartered in Bengaluru."""

# Part-of-Speech Tagging.
doc = nlp(sentence)
for token in doc:
    print(token.text, "|", token.pos_, "|", token.tag_)

The | DET | DT
Indian | PROPN | NNP
Space | PROPN | NNP
Research | PROPN | NNP
Organisation | PROPN | NNP
is | AUX | VBZ
the | DET | DT
national | PROPN | NNP
space | PROPN | NNP
agency | PROPN | NNP
of | ADP | IN
India | PROPN | NNP
, | PUNCT | ,
headquartered | VERB | VBN
in | ADP | IN
Bengaluru | PROPN | NNP
. | PUNCT | .


## **Semantic Textual Similarity**

In [15]:
import spacy
import warnings

warnings.filterwarnings("ignore")

nlp = spacy.load("en_core_web_sm")

doc1 = nlp("I like salty fries and hamburgers.")
doc2 = nlp("Fast food tastes very good.")

# Similarity between the two documents.
print(doc1, "<->", doc2, "<->", doc1.similarity(doc2))

# Similarity of tokens and spans.
french_fries = doc1[2:4]
burgers = doc1[5]
print(french_fries, "<->", burgers, "<->", french_fries.similarity(burgers))

I like salty fries and hamburgers. <-> Fast food tastes very good. <-> 0.27134929909014804
salty fries <-> hamburgers <-> 0.40727245807647705


#### **Measuring Text Similarity Using BERT.**

In [None]:
!pip install sentence-transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("bert-base-nli-mean-tokens")

In [17]:
texts = [
    "Three years later, the coffin was still full of Jello.",
    "The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go.",
    "The person box was packed with jelly many dozens of months later.",
    "He found a leprechaun in his walnut shell.",
]

# Sentence Embedding.
sentence_embeddings = model.encode(texts)
print("Shape of Embeddings is ", sentence_embeddings.shape)

# Similarity of the remaining sentence w.r.t. the first sentence.
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity([sentence_embeddings[0]], sentence_embeddings[1:])

Shape of Embeddings is  (4, 768)


array([[0.33088914, 0.7219258 , 0.5548363 ]], dtype=float32)