# Natural Language Processing

## What is NLP

> "*Natural language processing (NLP) is a collective term referring to automatic computational processing of human languages. This includes both algorithms that take human-produced text as input, and algorithms that produce natural looking text as outputs.*" *(Goldberg, 2017)


## Where NLP is used?

### **Text Classification**  
  *is the process of assigning tags or categories to text according to its content.*

* **Sentiment analysis**  
    *is the process of analyzing emotions within a text and classifying them as positive, negative, or neutral*  
    
      Example: By running sentiment analysis on social media posts, product reviews, NPS surveys, and customer feedback, businesses can gain valuable insights about how customers perceive their brand. See this Zoom customer and product reviews, e.g.:
     <img src="./pics/nlp/sentiment1.png" style="width:500px;">
     <img src="./pics/nlp/sentiment2.png" style="width:500px;">
      
      A sentiment classifier tries to detect the emotions people express with their words, and classifies texts into Positive, Negative, or Neutral. From image above it can understand the nuance of each opinion and automatically tag 1st review as Negative and 2nd one as Positive. 
            
            
* **Topic classification**  
    *is the process of identifying the main themes or topics within a text and assigning predefined tags.*  
    For training a topic classifier, it's needed to be familiar with the data which is analyzing, so to define relevant categories.  
      Example: you might work for a software company, and receive a lot of customer support tickets that mention technical issues, usability, and feature requests.In this case, you might define your tags as Bugs, Feature Requests, and UX/IX.

* **Intent detection**  
    *is the process of identifying the purpose, goal, or intention behind a text.*  
      
      Example: sorting outbound sales email responses by Interested, Need Information, Unsubscribe, Bounce.The tag Interested could help to spot a potential sale opportunity as soon as an email enters an inbox!  
         
         
### Text Extraction
  *is the process of extraction of specific information that is already in the text.*
  
* **Keyword extraction**  
    *is the process of automatically extracting the most important words and expressions within a text.*  
    This can provide with a sort of preview of the content and its main topics, without needing to read each piece. 
    
      Example: check out this feature request, below, processed with MonkeyLearn’s public keyword extractor
     <img src="./pics/nlp/extract.png" style="width:500px;">

* **Named Entity Recognition (NER)**  
    *is the process of extracting the entities such as names of people, companies, places, etc.*  
    
* **Machine Translation**  
    *is the process of translating speech and text to different languages*  
    One of the first problems addressed by NLP researchers. Online translation tools (like *Google Translate*) use different NLP techniques to achieve human-levels of accuracy in translating speech and text to different languages. Custom translators models can be trained for a specific domain to maximize the accuracy of the results. 

* **Topic Modeling**  
    *is the process of finding relevant topics in a text by grouping texts with similar words and expressions.*   
    This is similar to topic classification. 

* **Natural Language Generation (NLG)**  
    *is the process of analyzing unstructured data and using it as an input to automatically create content*   
    Can be used to generate automated answers, write emails, and even books!

## Tasks & Techniques

**Syntactic analysis**  
or parsing/syntax analysis, identifies the syntactic structure of a text and the dependency relationships between words, represented on a diagram called a parse tree.

**Semantic analysis**  
focuses on identifying the meaning of language. However, since language is polysemic and ambiguous, semantics is considered one of the most challenging areas in NLP. Semantic tasks analyze the structure of sentences, word interactions, and related concepts, in an attempt to discover the meaning of words, as well as understand the topic of a text. 

* Tokenization
* Part-of-speech tagging
* Dependency Parsing
* Constituency Parsing
* Lemmatization & Stemming
* Stopword Removal
* Word Sense Disambiguation
* Named Entity Recognition (NER)
* Relationship extraction

### Tokenization  
*is an essential task in NLP used to break up a string of words into semantically useful units called **tokens**.*

Sentence tokenization splits sentences within a text, and word tokenization splits words within a sentence. Generally, word tokens are separated by blank spaces, and sentence tokens by stops. However, you can perform high-level tokenization for more complex structures, like words that often go together, otherwise known as collocations (e.g., *New York*).

    Example of how word tokenization simplifies text:  
    Customer service couldn’t be better! = “customer service” “could” “not” “be” “better”. 
    
    
### Part-of-speech (PoS) tagging

*involves adding a part of speech category to each token within a text.*  

Some common PoS tags are verb, adjective, noun, pronoun, conjunction, preposition, intersection, among others. PoS tagging is useful for identifying relationships between words and, therefore, understand the meaning of sentences.

    Example above would look like this:  
    “Customer service”: NOUN, “could”: VERB, “not”: ADVERB, be”: VERB, “better”: ADJECTIVE, “!”: PUNCTUATION


### Dependency Parsing  

*refers to the way the words in a sentence are connected.* 

    Example: a dependency parser, therefore, analyzes how ‘head words’ are related and modified by other words too understand the syntactic structure of a sentence
   <img src="./pics/nlp/dependency-parsing.png" style="width:900px;">


### Constituency Parsing  

*aims to visualize the entire syntactic structure of a sentence by identifying phrase structure grammar.*  

It consists of using abstract terminal and non-terminal nodes associated to words.

    Example:
   <img src="./pics/nlp/constituency-parsing.png" style="width:600px;">


### Lemmatization & Stemming  

*aims to transform the words back to their root form.*

The word as it appears in the dictionary – its **root form** – is called a **lemma**. When we speak or write, we tend to use inflected forms of a word (words in their different grammatical forms). To make these words easier for computers to understand, NLP uses lemmatization and stemming to transform them back to their root form.
 
    Example: 
    The terms "is, are, am, were, and been,” are grouped under the lemma ‘be.’ So, if we apply this lemmatization to “African elephants have four nails on their front feet,” the result will look something like this:

    African elephants have four nails on their front feet = “African,” “elephant,” “have,” “4”, “nail,” “on,” “their,” “foot”]

This example is useful to see how the lemmatization changes the sentence using its base form (e.g., the word "feet"" was changed to "foot").

When we refer to stemming, the root form of a word is called a stem. Stemming "trims" words, so word stems may not always be semantically correct.

For example, stemming the words “consult,” “consultant,” “consulting,” and “consultants” would result in the root form “consult.”

While lemmatization is dictionary-based and chooses the appropriate lemma based on context, stemming operates on single words without considering the context. 

    Example: “This is better”  
    The word “better” is transformed into the word “good” by a lemmatizer but is unchanged by stemming. Eventhough stemmers can lead to less-accurate results, they are easier to build and perform faster than lemmatizers. But lemmatizers are recommended if seeking more precise linguistic rules.


### Stopword Removal  

*involves filtering out high-frequency words that add little or no semantic value to a sentence, for example, which, to, at, for, is, etc.*  

Removing stop words is an essential step in NLP text processing. Lists of stopwords can be customized in order to include words that you want to ignore.

    Example:  
    Let’s say you want to classify customer service tickets based on their topics. 
    In this example: “Hello, I’m having trouble logging in with my new password”, it may be useful to remove stop words like “hello”, “I”, “am”, “with”, “my”, so you’re left with the words that help you understand the topic of the ticket: “trouble”, “logging in”, “new”, “password”.
    

### Word Sense Disambiguation (WSD)

Depending on their context, words can have different meanings. 

    Example: the word “book”
      * You should read this book; it’s a great novel!
      * You should book the flights as soon as possible.
      * You should close the books by the end of the year.
      * You should do everything by the book to avoid potential complications.

There are two main techniques that can be used for WSD: 
* knowledge-based (or dictionary approach)  
tries to infer meaning by observing the dictionary definitions of ambiguous terms within a text
* supervised approach  
is based on NLP algorithms that learn from training data.


### Named Entity Recognition (NER)  
*involves extracting entities from within a text (e.g., names, places)*  

Entities can be names, places, organizations, email addresses, and more.
<img src="./pics/nlp/ner.png" style="width:700px;">

#### Open-Source named entity recognition APIs

Open-source APIs are for developers: they are free, flexible, and entail a gentle learning curve. Here are a few options:

   * **Stanford Named Entity Recognizer (SNER)**  
       JAVA tool developed by Stanford University is considered the standard library for entity extraction. It’s based on Conditional Random Fields (CRF) and provides pre-trained models for extracting person, organization, location, and other entities. 
   * **SpaCy**  
       Python framework known for being fast and very easy to use. It has an excellent statistical system that you can use to build customized NER extractors.
   * **Natural Language Toolkit (NLTK)**  
       Python lib which is widely used for NLP tasks. NLKT has its own classifier to recognize named entities called ne_chunk, but also provides a wrapper to use the Stanford NER tagger in Python.


### Relationship extraction

*finds relationships between two nouns.*  

    Example:  
    In the phrase “Susan lives in Los Angeles,” a person (Susan) is related to a place (Los Angeles) by the semantic category “lives in.”

## NLP in everyday life

11 of the most common and most powerful uses of natural language processing in everyday life:

* Email filters
* Virtual assistants, voice assistants, or smart speakers  
  * Amazon's Alexa, Yandex's Alice
* Online search engines   
  * Google, Yandex, ...
* Predictive text and autocorrect
* Monitor brand sentiment on social media
* Sorting customer feedback
* Automating processes in customer support
* Chatbots
* Automatic summarization
* Machine translation
* Natural language generation

<img src="./pics/nlp/nltk_main.jpeg" style="width:200px;"><img src="./pics/nlp/spacy_main.png" style="width:200px;"><img src="./pics/nlp/text_main.png" style="width:200px;">

# [**SpaCy**](https://spacy.io/usage)

<img src="./pics/nlp/spacy_installation.png" style="width:500px;">

In [1]:
#!pip install -U spacy
#!python -m spacy download en_core_web_sm
#!python -m spacy validate

import sys
!{sys.executable} -m pip install spacy
!{sys.executable} -m spacy download en_core_web_sm
!{sys.executable} -m spacy validate

[31mnilmtk 0.4.0.dev1-git. has requirement matplotlib==3.1.3, but you'll have matplotlib 2.2.2 which is incompatible.[0m
[31mnilmtk 0.4.0.dev1-git. has requirement pandas==0.25.3, but you'll have pandas 0.23.1 which is incompatible.[0m
[31mnilmtk 0.4.0.dev1-git. has requirement scikit-learn>=0.21.2, but you'll have scikit-learn 0.19.2 which is incompatible.[0m
[33mYou are using pip version 10.0.1, however version 20.3b1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[31mnilmtk 0.4.0.dev1-git. has requirement matplotlib==3.1.3, but you'll have matplotlib 2.2.2 which is incompatible.[0m
[31mnilmtk 0.4.0.dev1-git. has requirement pandas==0.25.3, but you'll have pandas 0.23.1 which is incompatible.[0m
[31mnilmtk 0.4.0.dev1-git. has requirement scikit-learn>=0.21.2, but you'll have scikit-learn 0.19.2 which is incompatible.[0m
[33mYou are using pip version 10.0.1, however version 20.3b1 is available.
You should consider upgrading v

In [2]:
import spacy
nlp = spacy.load("en_core_web_sm")

[SpaCy linguistic features](https://spacy.io/usage/linguistic-features)

In [10]:
# Named Entity Recognition
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY


In [41]:
"Apple is looking at buying U.K. startup for $1 billion"[27:31]

'U.K.'

<img src="./pics/nlp/spacy_ner.png" style="width:900px;">

In [7]:
# Spacy Visualizer
from spacy import displacy
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
displacy.render(doc, style="dep", jupyter=True)
#displacy.serve(doc, style="dep")

In [8]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
displacy.render(doc, style="ent", jupyter=True)
#displacy.serve(doc, style="ent")

In [9]:
# Tokenization
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text)

Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion


# [NLTK](https://www.nltk.org/)

In [33]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('treebank')

[nltk_data] Downloading package punkt to /home/muha/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/muha/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /home/muha/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /home/muha/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package treebank to /home/muha/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.


True

In [19]:
sentence = "Apple is looking at buying U.K. startup for $1 billion"
tokens = nltk.word_tokenize(sentence)
tokens

['Apple',
 'is',
 'looking',
 'at',
 'buying',
 'U.K.',
 'startup',
 'for',
 '$',
 '1',
 'billion']

In [21]:
tagged = nltk.pos_tag(tokens)
tagged

[('Apple', 'NNP'),
 ('is', 'VBZ'),
 ('looking', 'VBG'),
 ('at', 'IN'),
 ('buying', 'VBG'),
 ('U.K.', 'NNP'),
 ('startup', 'NN'),
 ('for', 'IN'),
 ('$', '$'),
 ('1', 'CD'),
 ('billion', 'CD')]

In [43]:
entities = nltk.chunk.ne_chunk(tagged)
#entities

In [45]:
# Display parse trees
from nltk.corpus import treebank

t = treebank.parsed_sents('wsj_0001.mrg')[0]
t.draw()

## Word embeddings

**Word embedding**  
is the collective name for a set of language modeling and feature learning techniques in NLP where words or phrases from the vocabulary are mapped to vectors of real numbers.

Given this vocabulary of 10,000 words, what’s the simplest way to represent each word numerically?

* One-hot vector encoding

<div style="width:image width px; 
            font-size:80%; 
            text-align:center; 
            float: left; padding-left-right-top-bottom:0.5em;  
            border-style: solid; border-color: rgba(211, 211, 211, 0.000);
            background-color: rgba(0, 0, 0, 0.000);">
    <img src="./pics/nlp/emb1.jpeg" 
         alt="alternate text" 
         width=250 
         style="padding-bottom:0.5em;"/>
    <div style="padding: 3px; 
                width: 250px; 
                word-wrap: break-word; 
                text-align:justify;">
        Our vocabulary of 10,000 words. <br> 
        <a href="https://towardsdatascience.com/why-do-we-use-embeddings-in-nlp-2f20e1b632d2" 
           style="float: left;"> 
           Source 
        </a>
    </div>
</div>

<div style="width:image width px; 
            font-size:80%; 
            text-align:center; 
            float: left; padding-left-right-top-bottom:0.5em;  
            border-style: solid; border-color: rgba(211, 211, 211, 0.000);
            background-color: rgba(0,0, 0, 0.000;">
    <img src="./pics/nlp/emb2.jpeg" 
         alt="alternate text" 
         width=315
         style="padding-bottom:0.5em;"/>
    <div style="padding: 3px; 
                width: 315px; 
                word-wrap: break-word; 
                text-align:justify;">
        Our vocabulary with each word assigned an index. <br> 
        <a href="https://towardsdatascience.com/why-do-we-use-embeddings-in-nlp-2f20e1b632d2" 
           style="float: left;"> 
           Source 
        </a>
    </div>
</div>


<p>&nbsp;</p>

Given this word-to-integer mapping, we could then represent a word as a vector of numbers as follows:      
   * Each word will be represented as an n-dimensional vector, where n is the vocabulary size  
   * Each word’s vector representation will be mostly “0”, except there will be a single “1” entry in the position corresponding to the word’s index in the vocabulary.  

So, some examples:  
   * The vector representation for our first vocabulary word “aardvark” will be [1, 0, 0, 0, …, 0], which is a “1” in the first position followed by 9,999 zeroes.
   * The vector representation for our second vocabulary word “ant” will be [0, 1, 0, 0, …, 0], which is a “0” in the first position, a “1” in the second position, and 9,998 afterwards.
   * And so on.

This process is called **one-hot vector encoding**. 

**Example**

Now, say our NLP project is building a translation model and we want to translate the English input sentence **“the cat is black”** into another language. We first need to represent each word with a one-hot encoding. We would first look up the index of the first word, “the”, and find that its index in our 10,000-long vocabulary list is 8676.



<div style="width:image width px; 
            font-size:80%; 
            text-align:center; 
            float: left; padding-left-right-top-bottom:0.5em;  
            border-style: solid; border-color: rgba(211, 211, 211, 0.000);
            background-color: rgba(0, 0, 0, 0.000);">
    <img src="./pics/nlp/emb3.jpeg" 
         alt="alternate text" 
         width=750 
         style="padding-bottom:0.5em;"/>
    <div style="padding: 3px; 
                width: 750px; 
                word-wrap: break-word; 
                text-align:justify;">
        We first look up the index of the first word, “the”, and find that its index in our 10,000-long vocabulary list is 8676. <br> 
        <a href="https://towardsdatascience.com/why-do-we-use-embeddings-in-nlp-2f20e1b632d2" 
           style="float: left;"> 
           Source 
        </a>
    </div>
</div>



<div style="width:image width px; 
            font-size:80%; 
            text-align:center; 
            float: left; padding-left-right-top-bottom:0.5em;  
            border-style: solid; border-color: rgba(211, 211, 211, 0.000);
            background-color: rgba(0, 0, 0, 0.000);">
    <img src="./pics/nlp/emb4.jpeg" 
         alt="alternate text" 
         width=750 
         style="padding-bottom:0.5em;"/>
    <div style="padding: 3px; 
                width: 750px; 
                word-wrap: break-word; 
                text-align:justify;">
        We could then represent the word “the” using a length 10,000 vector, where every entry is a 0 aside from the entry at position 8676, which is a 1. <br> 
        <a href="https://towardsdatascience.com/why-do-we-use-embeddings-in-nlp-2f20e1b632d2" 
           style="float: left;"> 
           Source 
        </a>
    </div>
</div>



<div style="width:image width px; 
            font-size:80%; 
            text-align:center; 
            float: left; padding-left-right-top-bottom:0.5em;  
            border-style: solid; border-color: rgba(211, 211, 211, 0.000);
            background-color: rgba(0, 0, 0, 0.000);">
    <img src="./pics/nlp/emb5.gif" 
         alt="alternate text" 
         width=750
         style="padding-bottom:0.5em;"/>
    <div style="padding: 3px; 
                width: 750px; 
                word-wrap: break-word; 
                text-align:justify;">
        We do this index look-up for every word in the input sentence, and create a vector to represent each input word. The whole process looks a bit like this. <br> 
        <a href="https://towardsdatascience.com/why-do-we-use-embeddings-in-nlp-2f20e1b632d2" 
           style="float: left;"> 
           Source 
        </a>
    </div>
</div>

These one-hot vectors are a quick and easy way to represent words as vectors of real-valued numbers. 
Note that this process has generated a very sparse (mostly zero) feature vector for each input word (here, the terms “feature vector”, “embedding”, and “word representation” are used interchangeably).

**The problems with sparse one-hot encodings**  
   * **The similarity issue.**  
       Ideally we would want similar words like “cat” and “tiger” to have somewhat similar features. But with these one-hot vectors, “cat” is as similar to “tiger” as literally any other word, which isn’t great. A related point is that we might want to do analogy-like vector operations on the word embeddings (e.g. what is “cat” - “small” + “large” equal to? Hopefully, something like a big cat, for instance “tiger” or “lion”). We’d need a sufficiently rich word representation to allow for such operations.
   * **The vocabulary size issue.**  
       With this approach, as you increase your vocabulary by n, your feature size vectors also increase by length n. One-hot vector dimensionality is the same as number of words. There’s reasons why you don’t want your feature size to explode —namely, more features means more parameters to estimate, and you require exponentially more data to estimate those parameters well enough to build a reasonably generalisable model (see: curse of dimensionality). As a rough rule of thumb — you want orders of magnitude more training data than you have features.
   * **The computational issue.**  
       Each word’s embedding/feature vector is mostly zeroes, and many machine learning models won’t work well with very high dimensional and sparse features. Neural networks in particular struggle with this type of data. With such a large feature space, you are also in danger of running into memory and even storage concerns, especially if the models you’re working with don’t play nicely with compressed versions of sparse matrices.  
       
       
**Towards dense, semantically-meaningful representation**  
Let’s now discuss what it means to represent words using dense, semantically-meaningful feature vectors.
If we take 5 example words from our vocabulary (say… the words “aardvark”, “black”, “cat”, “duvet” and “zombie”) and examine their embedding vectors created by the one-hot encoding method discussed above, the result would look like this:

<img src="./pics/nlp/emb6.jpeg" style="width:600px;">  

*Word vectors using one-hot encoding. Each word is represented by a vector that is mostly zeroes, except there is a single “1” in the position dictated by that word’s index in the vocabulary. Note: it’s not that “black”, “cat”, and “duvet” have the same feature vector, it just looks like it here.*

But, as humans speaking some language, we know that words are these rich entities with many layers of connotation and meaning. Let’s hand-craft some semantic features for these 5 words. Specifically, let’s represent each word as having some sort of value between 0 and 1 for four semantic qualities, “animal”, “fluffiness”, “dangerous”, and “spooky”:
<img src="./pics/nlp/emb7.jpeg" style="width:500px;">  
*Hand-crafted semantic features for 5 words in the vocabulary.*

So, to explain a couple of examples:  
   * Given the word “aardvark”, I’ve given it a high value for the feature “animal” (since it’s very much an animal), and relatively low values for “fluffiness” (aarvarks have short bristles), “dangerous” (they’re small, nocturnal burrowing pigs), and “spooky” (they’re charming).
   * Given the word “cat”, I’ve given it a high value for the features “animal” and “fluffiness” (self-explanatory), a medium value for “dangerous” (self-explanatory if you’ve ever had a pet cat), and a medium value for “spooky” (try doing an image search for “sphynx cat”).

**Plotting words based on semantic feature values**

Each semantic feature can be though of as a single dimension in the broader, higher-dimensional semantic space.
   * In the above made-up dataset, there are four semantic features, and we can plot two of these at a time as a 2D scatter plot (see below). Each feature is a different axis/dimension.
   * The coordinates of each word within this space are given by its specific values on the features of interest. For example, the coordinates of the word “aardvark” on the 2D plot of fluffiness vs. animal 2D plot are (x=0.97, y=0.03).

<img src="./pics/nlp/emb8.gif" style="width:750px;">  

*Plotting word feature values on either 2 or 3 axes.*  

   * Similarly, we could consider the three features (“animal”, “fluffiness” and “dangerous”) and plot the position of words in this 3D semantic space. For example, the coordinates of the word “duvet” are (x=0.01, y=0.84, z=0.12), indicating that “duvet” is highly associated with the concept of fluffiness, is maybe slightly dangerous, and not an animal.

This is a hand-crafted toy example, but actual embedding algorithms will of course automatically generate embedding vectors for all the words in an input corpus. If you’d like, you can think of word embedding algorithms like word2vec as unsupervised feature extractors for words.

## References
   [1] Goldberg, Y. (2017). Neural Network Methods for Natural Language Processing. Morgan & Claypool Publishers. [Amazon](https://www.amazon.com/Language-Processing-Synthesis-Lectures-Technologies/dp/1627052984/ref=as_li_ss_tl)  
   [2] https://monkeylearn.com/blog/
   
   
   [3] Natasha Latysheva. Why do we use word embeddings in NLP? Towards Data Science.  
   https://towardsdatascience.com/why-do-we-use-embeddings-in-nlp-2f20e1b632d2