# Introduction to Natural Language Processing with Python


## What is NLP?

Natural language processing is a way for computers to analyze, understand, and derive meaning from human language. With appropriate use and organization, NLP can be used to help developers perform a variety of tasks, including summarization, translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation for a given dataset or group of texts.

## What can we use it for?

Things like chatbots, voice-to-text software, and customer service sentiment analysis are examples of NLP applications. The use of (and development of tools for) NLP has experienced rapid growth over the last decade and is currently being integrated into a variety of fields. 

For example, a retailer on Amazon may run a sentiment analysis on all comments for a certain product. This could reveal general attitudes towards both the company and the item in question, ultimately leading to improvements and adjustments. On another hand, Siri, Apple's personal voice assistant, is the almagamation of years of work in NLP - Siri can recognize, conceptualize, and respond to a wide array of questions and comments. This not only improves the iPhone user experience in general but also increases the accessibility of the product.

## Basic Terminology

__Corpus__ (plural: corpora) is defined as a large collection of liguistic data. In other words, corpora serve as our datasets, or our information to process and train models on.

__Tokenization__ is the process of segmenting text into words, phrases, sentences etc. This is one of the first steps in processing the text into workable components.

__Part-of-speech (POS) tagging__ involves assigning word types (parts of speech) to tokens, like _verb_, _noun_, _preposition_, etc. 

__Dependency Parsing__ is the process of assigning syntactic dependency labels that describe the relations between individual tokens. For example, in the sentence _The brown dog ran through the park_, dependency parsing would recognize that _brown_ is modifying the subject of the sentence, _dog_.

__Lemmatization__ is defined as assigning the base form of words. For example, the lemma of “was” is “be”, and the lemma of “rats” is "rat".

__Sentence Boundary Detection (SBD)__ is responsible for finding and segmenting individual sentences within a text. 

__Named Entity Recognition (NER)__ involves labelling named “real-world” objects, like persons, companies, or locations. For example, in some instances, we want "Amazon" to be recognized as an electronic company as opposed to the rainforest.

__Similarity__ is the process of comparing words, phrases, and documents to see how similar they are to each other. This is usually done using the cosine similarity between two vectors.


# spaCy

`spaCy` describes itself as industrial strength natural language processing in Python. It's designed to work with the rest of the Python AI ecosystem including `TensorFlow`, `PyTorch`, `scikit-learn`, `Gensim`. Though a little less known than NLTK, spaCy tends to be a little faster and functions well for large-scale information processing:

## NLTK vs spaCy Time Comparison
<img src="img/nltk_spacy.png", width=600, height=600>
source: https://blog.thedataincubator.com/2016/04/nltk-vs-spacy-natural-language-processing-in-python
## NLP Package Comparisons
There's a variety of NLP libraries available, each with its own strengths and weaknesses. 
<img src="img/spacy_comp.png", width=400, height=600>
source: https://spacy.io/usage/facts-figures

In [None]:
import spacy
from spacy import displacy

We begin by importing the necessary packages, `spaCy` being the most notable. We then load the English statistical model (which is set up as its own python package) - this is the main program we will be working with. Using a model for a specific language enables `spaCy` to predict linguistic annotations – for example, whether a word is a verb or a noun. Though some of spaCy's features are available without a language model, most of its functions require one. 

In [None]:
nlp = spacy.load('en')

First, we'll walk through some of the basic functions of spaCy, based on the definitions above. We begin by reading in a few simple sentences, just to get a feel for how the package works.

In [None]:
doc = nlp("I'm having such a wonderful day in Ann Arbor! It is sunny out and there are flowers. Do you want to get some ice cream with Ms. Ellen?")

> When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the default models consists of a tagger, a parser and an entity recognizer.

![spacy_pipeline](img/pipeline.svg)

You can create your own customize processing pipeline by removing or adding new processes but we do not cover that in this workshop.

## Tokenizing

When we read something in with `nlp`, spaCy automatically tokenizes it. For example, it breaks _I'm_ into _I_ and _'m_, and each word is its own element. The `doc` object can be indexed to access individual tokens. 

In [None]:
for i, token in enumerate(doc[:15]):
    print(i, token)

We can also see how well the sentence boundary detection works. We view all the individual sentences from the paragraph by iterating through the `sents` attribute.

In [None]:
for sent in doc.sents:
    print(sent)

You can look at the token tags and POS using the `tag_` and `pos_` attributes. If you leave off the `_`, you will get the integer equivalent. A `token` object has a lot of attributes. Here we just look at a few. We'll use `pandas` to put it in a DataFrame for easier visualization. 

In [None]:
import pandas as pd
list_tokens = []
for token in doc:
    list_tokens.append((token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.is_stop, token.is_sent_start))
df = pd.DataFrame(list_tokens, columns=['text','lemma','POS','tag','dependency','stopword','sentence_start'])
df

You can look at the named-entities using the `ent` attributes and iterating through them.

In [None]:
for entity in doc.ents:
   print(entity.text, entity.label_)

### spaCy `explain`

Since `spaCy` is filled with plenty of useful tools, it's easy not to know or lose track of what all the different abbreviations stand for. The `spacy.explain` method can be used to get the description for some of the abbreviations.

In [None]:
part_of_speech = ['GPE','ADP','dobj','cc']
for pos in part_of_speech:
    print('{} : {}'.format(pos, spacy.explain(pos)) )

# Exercise

Print out all the proper nouns (PROPN) and persons (PERSON) in this paragraph.

In [None]:
paragraph = 'Shapovalov arrived in the Spanish capital without an ATP World Tour match win on clay. In fact, he owned just a 1-4 clay-court record on the ATP Challenger Tour. But the Canadian found some of his best tennis to become the youngest quarter-finalist and semi-finalist in event history. Against Zverev, he was attempting to become the youngest Masters 1000 finalist since 18-year-old Richard Gasquet battled to the championship match at Hamburg in 2005.'

In [None]:
# write Solution Here

## Visualization

`spaCy` has a built-in visualization package called `displaCy` that plots sentence dependencies and entity recognition.

### Dependency

Dependency parsing is the process of analyzing a sentence and assigning a syntactic structure to it: labeling the subject, the verb, etc., and how the different elements of the sentence depend on one another.

In [None]:
doc = nlp('The quick brown fox jumped over the lazy dog.')
options={'distance':90}
displacy.render(doc, style='dep', jupyter=True, options=options)

### Entity Recognition

Here is how `displaCy` works for named-entity recognition.

In [None]:
doc2 = nlp("TD Ameritrade, ProQuest, Google, Domino's and the University of Michigan are companies that hire data scientists in Ann Arbor")
colors = {'GPE': 'linear-gradient(0deg, #deebf7, #3182bd)',
         'ORG': 'linear-gradient(90deg, #fee6ce, #e6550d)'}
options = {'colors': colors}
displacy.render(doc2, style='ent', jupyter=True, options=options)

We can see there are a couple of errors. It thinks _TD Ameritrade_ is a person instead of an organization and _Domino_ is a geopolitical entity (i.e. place) instead of an organization. Since the named-entity parser is a statistical model making a prediction, we would need to train it some more to correct these errors.

For additional visualization options, visit https://spacy.io/usage/visualizers. 

## Word Count

Here is a list of tokenized words without the punctuation using a list comprehension. For large text, you should use a generator comprehension to save memory.

In [None]:
doc = nlp('Knox in box. Fox in socks. Knox on fox in socks in box. Socks on Knox and Knox in box.')
w = [token.text for token in doc if token.is_punct == False]

There is no special function in `spaCy` to count words. We just use the `Counter` class in Python with the spaCy `token` object

In [None]:
from collections import Counter
freq = Counter(w)
freq.most_common(10)

# `textacy` and N-grams

There is no built-in method in spaCy to do generalized N-grams. Luckily, other packages have been built upon `spaCy` to do such processing. One of them is `textacy`. We can create a dictionary of N-grams and their frequency. We can also do more than one N-gram at a time with a single call.

In [None]:
import textacy
tdoc = textacy.Doc(doc)
kwargs = {'filter_stops':False}
bow = tdoc.to_bag_of_terms(ngrams=(1,2,3), normalize='lemma', as_strings=True, **kwargs)
bow

We can then use a standard Python `Counter` to keep track of them.

In [None]:
freq = Counter(bow)
freq.most_common(10)

## Exercise

Here is some Python code to read in the Dr. Seuss book, _Fox in Socks_.

In [None]:
import requests
R = requests.get('http://ai.eecs.umich.edu/people/dreeves/Fox-In-Socks.txt')
book = R.text
book = book.replace('\n\n','').replace('\n',' ').replace('  ',' ')

What are the 10 most common words, bigrams, trigrams, or 4-grams in the book, _Fox in Socks_, once you filter out stopwords? 

In [None]:
# Put Solution Here

# Word Similarity

> Similarity is determined by comparing word vectors or "word embeddings", multi-dimensional meaning representations of a word. Word vectors can be generated using an algorithm like word2vec.

In [None]:
# Example here
doc = nlp('two six sick chicks')
words = len(doc)
for i in range(words):
    for j in range(i+1,words):
        token1 = doc[i]
        token2 = doc[j]
        print(token1, token2, token1.similarity(token2))

Here is the word vector for the word chicks.

In [None]:
print(token2.text, token2.vector.shape, token2.vector)

We can get the same value by using the formula for cosine similiarity 

In [None]:
import numpy as np
v1 = token1.vector
v2 = token2.vector
similarity = np.dot(v1,v2) / (np.linalg.norm(v1)*np.linalg.norm(v2))
similarity

**Tip**: spaCy recommends that if you want to use real word vectors, you should download the large language model version (e.g. **en_core_web_lg** for english) using the command `python -m spacy download en_core_web_lg`. 

The large models have already been downloaded on the lab PCs. Just use `nlp = spacy.load('en_core_web_lg')` to load the large language model. It's about 23x larger than the small model so it might take longer to load.

# Exercise

Re-run the word similarity example using the large english language model. Notice how the similiarity measures have changed. Add a fake word to the document. What is the similiarity of a real word to a fake word?  

In [None]:
# Solution
doc = nlp('two six sick chicks') #hint: add in a fake word here

**Tip**: You can check if a word has a vector in the large model using the `has_vector` or `is_oov` (is out-of-vocabulary) attribute:

In [None]:
doc = nlp('fake news word fakenews fakeword')
for token in doc:
    print(token.text, token.has_vector, token.is_oov)

# Gensim

`Gensim` is described as topic modelling for humans, is a powerful vector space modeling and topic modeling toolkit, commonly used for a variety of NLP tasks. 

## Word2Vec

`Word2vec` is a collection of related models that are used to produce word embeddings. `Gensim` requires a list of list representing tokens in sentences in the document to create a `Word2Vec` model

In [None]:
sentences=[]
doc = nlp(book)
for sentence in doc.sents:
    sentences.append([token.text.lower() for token in sentence 
                 if token.is_punct == False and token.is_stop == False and token.text not in ["\n","'s"]])

In [None]:
from gensim.models import Word2Vec
model = Word2Vec(sentences, size=100, window=5, min_count=2, workers=1)

In [None]:
print(model)

Here is the list of words in our vocabulary.

In [None]:
words = list(model.wv.vocab)
print(words)

Here is one of the word embeddings we've created. It is of length 100 as specified.

In [None]:
print(model.wv['chicks'].shape, model.wv['chicks'])

Now we can do some word algebra (using the underlying word vectors) and other NLP word tasks.

`beetle + paddles - puddle = ?`

In [None]:
model.wv.most_similar(positive=['beetle', 'paddles'], negative=['puddle'])

In [None]:
model.wv.doesnt_match("goose duck beetle puddle".split())

In [None]:
model.wv.similarity('sick','chicks')

## Word2Vec Visualization

Lets fit a 2-D PCA model to the word embeddings for visualization purposes.

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
X = model.wv[model.wv.vocab]
pca = PCA(n_components=2)
new_vectors = pca.fit_transform(X)

And plot the resulting vector onto our new 2-dimensional projection from our 100-dimensional vector we started with.

In [None]:
fig = plt.figure(figsize=(12, 12))
plt.scatter(new_vectors[:,0], new_vectors[:, 1])
for i, word in enumerate(model.wv.vocab):
    plt.annotate(word, xy=(new_vectors[i,0], new_vectors[i,1]), size=16)

# Topic Modeling

One branch of NLP is topic modeling - a topic model is a kind of statistical model that is used to uncover the abstract topics and concepts that occur in a collection of documents. Topic modeling is frequently used to discover semantic structures in a text body, and as a data-mining tool to better understand large collections of data.

Now that we've seen how spaCy processes short text segments, let's explore what happens (and what we can work with) when we examine a much larger document. To work with topic modeling, we'll begin by using `spaCy` to tokenize a text file containing US news articles.

In [None]:
%%time
filename = 'en_US.news.txt'
with open(filename, 'r', encoding='utf-8') as fin:
    data = fin.read()
data = data[:900000]

Unfortunately, there are some bugs in spaCy's medium and large language models for English - the stopwords appear to be missing. There are currently a few work-arounds available (visit https://github.com/explosion/spaCy/issues/922 to see the full explanations). For now, to avoid complications related to this issue, we will reload the small language model before tokenizing our data.

In [None]:
nlp = spacy.load('en')

In [None]:
%%time
news = nlp(data)

As with any large text, we can't start analysis until we clean the data. This involves removing unimportant words and punctuation, begin by removing __stopwords__, which are commonly used words that have little value in determining sentiment or analyzing a document. Filtering out words like ‘the’, ‘is’, and ‘are’ helps speed up processes and helps keep the data clean while allowing us to focus on more significant/rarer terms. Luckily, each token has the built-in property `is_stop` to indicate whether or not it is considered a stopword.

Sometimes, in addition to removing stopwords, we'll want to remove punctuation from a piece. We may also choose to convert all words to lowercase or standardize dates and times. Note that the cleaning process will not be the same for every text or even for every analysis of the same text - sometimes we care about capitalization and punctuation as part of the sentiment and topic modeling analysis. For now, though, we'll remove the punctuation. We'll also remove any newline and possessive characters.

We need to create a list of lists for the tokens in each sentence for the next section.

In [None]:
text=[]
for sentence in news.sents:
    text.append([token.text for token in sentence 
                 if token.is_punct == False and token.is_stop == False and token.text not in ["\n","'s","The"]])
text[:5]

Note that one of the work-arounds available for the missing stopwords issue (in the large English model) is to load the `NLTK` stopword list and filter based on that. This involves downloading a separate library (and some additional files) and importing it like so:
> `from nltk.corpus import stopwords
stp = set(stopwords.words('english'))`

From there, you can filter by only seleting tokens based on `not in stp`.

## LDA

Now that we've prepared our information for analysis, we'll use `gensim` to perform topic modelling. 

In [None]:
from gensim import corpora, models

We begin by creating a `gensim` dictionary containing (key, value) pairs which represent (word, integer id) respectively.

In [None]:
dictionary = corpora.Dictionary(text)
dictionary.token2id

> To convert documents to vectors, we’ll use a document representation called bag-of-words. In this representation, each document is represented by one vector where each vector element represents a question-answer pair, in the style of:

> “How many times does the word system appear in the document? Once.”

The method `doc2bow` converts the document to a bag-of-words model by counting the number of occurrences of each distinct word, converts the word to its integer word id and returns the result as a sparse vector.

In [None]:
corpus = [dictionary.doc2bow(txt) for txt in text]
corpus[:3]

Latent Dirichlet Allocation (LDA) is a generative statistical model. It is a pattern recognition and machine learning technique that works to find a linear combination of features that characterizes two or more classes of objects. We use it here to try to identify topics in our corpus.

In [None]:
lda = models.LdaModel(corpus, id2word=dictionary, num_topics=15)

Below, we call `show_topics` to view N topics chosen randomly. Each topic contain the top 10 keywords that the LDA model has found. The word's coefficients represents its weight and the words are listed in descending order by weight.

In [None]:
lda.show_topics()

From these buzzwords, we can infer what general concept each "topic" is talking about. Take a minute to think about what each collection of keywords represents.

## Streaming Text

Above, we abbreviated the text document for time and memory sake. `spaCy` will complain if you try to feed it more than a 1 million character string. We can get around the memory issue by streaming the text one line at a time with the following generator function to achieve the same objective.

In [None]:
def sentence_generator(filename, nlp):
    for line in open(filename, 'r', encoding='utf-8'):              
        news = nlp(line)
        for sentence in news.sents:
            yield [token.text for token in sentence 
                   if token.is_punct == False and token.is_stop == False and token.text not in ["\n","'s","The"]]

We create a generator by calling the function. The function returns a generator object (iterator) which we can iterate over (one value at a time) using `next` or a `for` loop.

In [None]:
memory_friendly_text = sentence_generator(filename,nlp)
memory_friendly_text

Let's print out the contents.

In [None]:
print(next(memory_friendly_text))
print(next(memory_friendly_text))
print(next(memory_friendly_text))
print(next(memory_friendly_text))

To use it when constructing a LDA model replace the variable `text` with the generator object `memory_friendly_text`.

## Visualization for Topic Modeling

Now that we have our topics and our keyword collections, we can present them in a visualization to get a different view on how important each topic is to the overall document, and how closely these topics are related. We will use the `pyLDAvis` library which is a port of the R package LDAvis.

In [None]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

pyLDAvis is compatible with `gensim`, `scikit-learn`, and `GraphLab Create`. Here is how you would use it with `gensim`. We need a `gensim` LDA model, corpus and dictionary - we'll use the ones we've just built.

In [None]:
viz = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
viz

Each bubble on the large plot represents a topic. The larger the bubble, the more frequently that topic is referenced. Ideally, we want large topic bubbles that are well separated and do not overlap with one another. A model with too many topics will have many overlaps and will be made up of many small sized bubbles clustered in one region of the chart. It appears that this particular dataset (with this analysis) doesn't have as clear separations as we'd like, and falls into the "too many topics" category.

The visualization is also interactive - if you hover over a bubble, it highlights the terms that the topic includes, and shows their frequency (both overall and within the selected topic).

Visit https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/ for additional resources and work with topic modelling in `Gensim`.

## Sentiment Analysis

`TextBlob` is another NLP library in Python. It is written entirely in Python, which docks its performance speed and high-processing ability a bit. TextBlob is a bit of a "best of the best" toolkit - it pulls some of the most effective and useful methods from packages like NLTK and Pattern. Here, we'll use it to perform sentiment analysis.

In [None]:
from textblob import TextBlob

In [None]:
flowers = TextBlob('These flowers are beautiful, they brighten up the place so much! I really love them.')
angry = TextBlob("You make me so angry; I just can't stand it.")

The `sentiment` attribute of a TextBlob "returns a tuple of the form (polarity, subjectivity) where polarity is a float within the range [-1.0, 1.0] and subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective."

Let's break this into its two parts: polarity and subjectivity. Polarity analysis takes into account the amount of positive or negative terms that appear in a given sentence - words like "beautiful" and "brighten" likely contribute to a higher score, while words like "angry" and "can't" would shift the polarity in a negative direction. Subjectivity, on the other hand, is almost like an error bound on the sentiment analysis. The subjectivity of words and phrases may depend on their context and an objective document may contain subjective sentences. Words and phrases that are labeled as more subjective may shift their polarity depending on their context, making it more difficult to determine the true sentiment of the phrase.

In [None]:
print(flowers.sentiment)
print(angry.sentiment)

To see the breakdown of the numbers, we can use the `sentiment_assessments` attribute to view which words contributed to the polarity and subjectivity scores. The overall score is the average of each tuple in the assessment.

In [None]:
print(flowers.sentiment_assessments)
print(angry.sentiment_assessments)

## Exercise

Test out a couple of different sentences and examine their sentiment analysis output. Try to write:
* A sentence that scores a neutral polarity (aim for a range in [-0.2, 0.2])
* A sentence that scores as highly _objective_
* A sentence that scores as highly _subjective_

In [None]:
#Solution

### Sentiment Analysis for US News

Now that we've seen how TextBlob sentiment analysis works on a few lines of text, we can see what polarity and subjectivity scores it assigns to the news dataset that we've been working with.

In [None]:
news_txt = TextBlob(data)
sents = news_txt.sentiment_assessments

In [None]:
print(sents.polarity)
print(sents.subjectivity)
[word for word in sents]

# References

* spaCy: https://spacy.io/usage/spacy-101
* textacy: https://textacy.readthedocs.io/en/stable/
* Gensim: https://radimrehurek.com/gensim/intro.html
* TextBlob: http://textblob.readthedocs.io/en/dev/quickstart.html
* pyLDAvis: http://pyldavis.readthedocs.io/en/latest/