# Introduction to Natural Language Processing with spaCy


## What is NLP?

Natural language processing is a way for computers to analyze, understand, and derive meaning from human language. With appropriate use and organization, NLP can be used to help developers perform a variety of tasks, including summarization, translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation for a given dataset or group of texts.

## What can we use it for?

Things like chat bots, voice-to-text software, and customer service sentiment analysis are examples of applications of NLP. The use of (and development of tools for) NLP has experienced rapid growth over the last decade and is currently being integrated into a variety of fields. 

For example, a retailer on Amazon may run a sentiment analysis on all comments for a certain product. This could reveal general attitudes towards both the company and the item in question, ultimately leading to improvements and adjustments. On another hand, Siri, Apple's personal voice assistant, is the almagamation of years of work in NLP - Siri can recognize, conceptualize, and respond to a wide array of questions and comments. This not only improves the iPhone user experience in general but also increases the accessibility of the product.

## Basic Terminology

__Corpus__ (plural: corpora) is defined as a large collection of liguistic data. In other words, corpora serve as our datasets, or our informaiton to process and train models on.

__Tokenization__ is the process of segmenting text into words, phrases, sentences etc. This is one of the first steps in processing the text into workable components.

__Part-of-speech (POS) tagging__ involves assigning word types (parts of speech) to tokens, like _verb_, _noun_, _preposition_, etc. 

__Dependency Parsing__ is the process of assigning syntactic dependency labels that describe the relations between individual tokens. For example, in the sentence _The brown dog ran through the park_, dependency parsing would recognize that _brown_ is modifying the subject of the sentence, _dog_.

__Lemmatization__ is defined as assigning the base form of words. For example, the lemma of “was” is “be”, and the lemma of “rats” is rat.

__Sentence Boundary Detection (SBD)__ is responsible for finding and segmenting individual sentences within a text. 

__Named Entity Recognition (NER)__ involves labelling named “real-world” objects, like persons, companies, or locations. For example, in some instances, we want "Amazon" to be recognized as an electronic company as opposed to the rainforest.

__Similarity__ is the process of comparing words, phrases, and documents to see how similar they are to each other. 


## What is spaCy?

+ Free, open-source software
+ Describes itself as industrial strength natural language processing in python

In [7]:
import spacy
from spacy import displacy

We begin by importing the necessary packages, `spaCy` being the most notable. We then load the English statistical model (which is set up as its own python package) - this is the main program we will be working with. Using a model for a specific language enables spaCy to predict linguistic annotations – for example, whether a word is a verb or a noun. Though some of spaCy's features are available without a language model, most of its functions require one. 

In [8]:
nlp = spacy.load('en')

First, we'll walk through some of the basic functions of spaCy, based on the definitions above. We begin by reading in a few simple sentences, just to get a feel for how the package works.

In [83]:
doc = nlp(u"I'm having such a wonderful day in Ann Arbor! It is sunny out and there are flowers. Do you want to get some ice cream with Ellen?")

> When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the default models consists of a tagger, a parser and an entity recognizer.

![spacy_pipeline](img/pipeline.svg)

You can create your own customize processing pipeline by removing or adding new processes but we do not cover that in this workshop.

## Tokenizing

When we read something in with `nlp`, spaCy automatically tokenizes it - for example, it breaks _I'm_ into _I_ and _'m_, and each word is its own element. The object `doc` can be indexed to access individual tokens. 

In [76]:
for i, token in enumerate(doc[:15]):
    print(i, token)

0 I
1 'm
2 having
3 such
4 a
5 wonderful
6 day
7 !
8 It
9 is
10 sunny
11 out
12 and
13 there
14 are


We can also see how well the sentence boundary detection works. We view all the individual sentences from the paragraph by iterating through the `sents` attribute.

In [77]:
for sent in doc.sents:
    print(sent)

I'm having such a wonderful day!
It is sunny out and there are flowers.
Do you want to get some ice cream?


You can look at the token tags and POS using the `tag_` and `pos_` attributes. If you leave off the `_`, you will get the integer equivalent. A `token` object has a lot of attributes. Here we just look at a few. We'll use `pandas` to put it in a DataFrame for easier visualization. 

In [85]:
import pandas as pd
list_tokens = []
for token in doc:
    list_tokens.append((token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.is_stop, token.is_sent_start))
df = pd.DataFrame(list_tokens, columns=['text','lemma','POS','tag','dependency','stopword','sentence_start'])
df

Unnamed: 0,text,lemma,POS,tag,dependency,stopword,sentence_start
0,I,-PRON-,PRON,PRP,nsubj,False,
1,'m,be,VERB,VBP,aux,False,
2,having,have,VERB,VBG,ROOT,False,
3,such,such,ADJ,PDT,predet,True,
4,a,a,DET,DT,det,True,
5,wonderful,wonderful,ADJ,JJ,amod,False,
6,day,day,NOUN,NN,dobj,False,
7,in,in,ADP,IN,prep,True,
8,Ann,ann,PROPN,NNP,compound,False,
9,Arbor,arbor,PROPN,NNP,pobj,False,


You can look at the named-entities using the `ent` attributes and iterating through them.

In [84]:
for entity in doc.ents:
   print(entity.text, entity.label_)

Ann Arbor GPE
Ellen PERSON


### spaCy `explain`

Since spaCy is filled with plenty of useful tools, it's easy not to know or lose track of what all the different abbreviations stand for. The `spacy.explain` method can be used to get the description for the dependencies.

In [88]:
part_of_speech = ['GPE','ADP','dobj','cc']
for pos in part_of_speech:
    print(pos, spacy.explain(pos))

GPE Countries, cities, states
ADP adposition
dobj direct object
cc coordinating conjunction


# Exercise

Print out all the proper nouns (PROPN) and persons (PERSON) in this paragraph.

In [132]:
paragraph = 'Shapovalov arrived in the Spanish capital without an ATP World Tour match win on clay. In fact, he owned just a 1-4 clay-court record on the ATP Challenger Tour. But the Canadian found some of his best tennis to become the youngest quarter-finalist and semi-finalist in event history. Against Zverev, he was attempting to become the youngest Masters 1000 finalist since 18-year-old Richard Gasquet battled to the championship match at Hamburg in 2005.'

In [137]:
# Solution
tennis = nlp(paragraph)
for entity in tennis.ents:
    if entity.label_ == 'PERSON':
        print(entity.text, entity.label_)
print()
for token in tennis:
    if token.pos_ == 'PROPN':
        print(token.text, token.pos_)

Shapovalov PERSON
Against Zverev PERSON
Richard Gasquet PERSON

Shapovalov PROPN
ATP PROPN
World PROPN
Tour PROPN
ATP PROPN
Challenger PROPN
Tour PROPN
Canadian PROPN
Zverev PROPN
Richard PROPN
Gasquet PROPN
Hamburg PROPN


## Visualization

`spaCy` has a built-in visualization package called `displaCy` that plots sentence dependencies and entity recognition.

### Dependency

Dependency parsing is the process of analyzing a sentence and assigning a syntactic structure to it: labeling the subject, the verb, etc., and how the different elements of the sentence depend on one another.

In [90]:
doc = nlp('The quick brown fox jumped over the lazy dog.')
options={'distance':100}
displacy.render(doc, style='dep', jupyter=True, options=options)

### Entity Recognition

Here is how displaCy works for named-entity recognition.

In [44]:
doc2 = nlp("TD Ameritrade, ProQuest, Google, Domino's and the University of Michigan are companies that hire data scientists in Ann Arbor")
colors = {'GPE': 'linear-gradient(0deg, #deebf7, #3182bd)',
         'ORG': 'linear-gradient(90deg, #fee6ce, #e6550d)'}
options = {'colors': colors}
displacy.render(doc2, style='ent', jupyter=True, options=options)

We can see there are a couple of errors. It thinks _TD Ameritrade_ is a person instead of an organization and _Domino_ is a geopolitical entity (i.e. place) instead of an organization. Since the named-entity is a statistical model making a prediction, we would need to train it some more to correct these errors.

### Word Count

Here is a list of tokenized words without the punctuation using a list comprehension. For large text, you should use a generator comprehension to save memory.

In [92]:
doc = nlp('Knox in box. Fox in socks. Knox on fox in socks in box. Socks on Knox and Knox in box.')
w = [token.text for token in doc if token.is_punct == False]

There is no special function in `spaCy` to count words. We just use the `Counter` class in Python with the spaCy `token` object

In [93]:
from collections import Counter
freq = Counter(w)
freq.most_common(10)

[('in', 5),
 ('Knox', 4),
 ('box', 3),
 ('socks', 2),
 ('on', 2),
 ('Fox', 1),
 ('fox', 1),
 ('Socks', 1),
 ('and', 1)]

## N-Grams

Put n-grams code here

# Exercise

Here is some Python code to read in the Dr. Seuss book, Fox in Socks.

In [139]:
import requests
R = requests.get('http://ai.eecs.umich.edu/people/dreeves/Fox-In-Socks.txt')
book = R.text
book = book.replace('\n\n','').replace('\n',' ').replace('  ',' ')

What are the 5 most common words, bigrams, trigrams in the book, Fox in Socks? You can choose whether to account for capitalization, pronouns, stopwords etc.

In [29]:
# Solution 

[('sir', 36),
 ("'s", 35),
 ('a', 25),
 ('and', 22),
 ('in', 20),
 ('socks', 20),
 ('on', 16),
 ('fox', 15),
 ('knox', 15),
 ('i', 12)]

# Word Similarity

In [None]:
# Example here

> Similarity is determined by comparing word vectors or "word embeddings", multi-dimensional meaning representations of a word. Word vectors can be generated using an algorithm like word2vec and usually look like this:

In [145]:
# vector example here like doc[0].vector

array([  4.21786070e-01,   1.52834129e+00,  -6.49869680e-01,
         2.09754634e+00,   3.07454050e-01,   4.30944157e+00,
        -7.23396778e-01,  -3.03867483e+00,  -9.68035460e-01,
        -2.16291070e+00,  -8.16519499e-01,   1.05893505e+00,
        -8.97945404e-01,  -8.34591746e-01,  -2.29067397e+00,
        -1.77496433e-01,   7.76707709e-01,  -1.63465762e+00,
         1.60864425e+00,  -1.50794339e+00,  -3.20733786e+00,
         3.34711373e-01,   1.04778934e+00,  -1.06742835e+00,
        -4.02404070e+00,   4.31467474e-01,  -1.28863478e+00,
         1.15749598e+00,  -2.30950618e+00,  -1.49187148e-01,
        -4.29153174e-01,  -9.91824329e-01,   5.18571329e+00,
        -3.34975576e+00,  -1.95356190e+00,   2.24396729e+00,
        -8.12131286e-01,   5.15642524e-01,   2.54947376e+00,
        -2.06897402e+00,  -5.81855595e-01,   1.20752525e+00,
         5.31350136e-01,  -1.04145026e+00,  -1.51593256e+00,
         2.38969564e+00,  -6.87073827e-01,   3.03910208e+00,
         7.13527083e-01,

# Topic Modeling

One branch of NLP is topic modeling - a topic model is a kind of statistical model that is used to uncover the abstract topics and concepts that occur in a collection of documents. Topic modeling is frequently used to discover semantic structures in a text body, and as a data-mining tool to better understand large collections of data.

Now that we've seen how spaCy processes short text segments, let's explore what happens (and what we can work with) when we examine a much larger document. To work with topic modeling, we'll begin by using `spaCy` to tokenize a text file containing US news articles.

In [68]:
%%time
import os
filename = r'C:\Users\caoa\Box Sync\coursera\Capstone\SwiftKeyProject\final\en_US\en_US.news.txt'
#filename = 'en_US.news.txt'
with open(filename, 'r', encoding='utf-8') as fin:
    data = fin.read()
data = data[:900000]

Wall time: 1.31 s


In [69]:
%%time
news = nlp(data)

Wall time: 41.7 s


As with any large text, we can't start analysis until we clean the data. This involves removing unimportant words and punctuation, begin by removing __stopwords__, which are commonly used words that have little value in determining sentiment or analyzing a document. Filtering out words like ‘the’, ‘is’, and ‘are’ helps speed up processes and helps keep the data clean while allowing us to focus on more significant/rarer terms. Luckily, each token has the built-in property `is_stop` to indicate whether or not it is considered a stopword.

Sometimes, in addition to removing stopwords, we'll want to remove punctuation from a piece. We may also choose to convert all words to lowercase or standardize dates and times. Note that the cleaning process will not be the same for every text or even for every analysis of the same text - sometimes we care about capitalization and punctuation as part of the sentiment and topic modeling analysis. For now, though, we'll remove the punctuation. We'll also remove any newline and possessive characters.

We need to create a list of lists for the tokens in each sentence for the next section.

In [125]:
text=[]
for sentence in news.sents:
    text.append([token.text for token in sentence 
                 if token.is_punct == False and token.is_stop == False and token.text not in ["\n","'s"]])
text[:5]

[['He', "n't", 'home', 'apparently'],
 ['The', 'St.', 'Louis', 'plant', 'close'],
 ['It', 'die', 'old', 'age'],
 ['Workers',
  'making',
  'cars',
  'onset',
  'mass',
  'automotive',
  'production',
  '1920s'],
 ['WSU', 'plans', 'quickly', 'hot', 'topic', 'local', 'online', 'sites']]

## Gensim

`Gensim` is described as topic modelling for humans, is a powerful vector space modeling and topic modeling toolkit, commonly used for a variety of NLP tasks. Now that we've prepared our information for analysis, we'll use `Gensim` to perform topic modelling. 

In [75]:
from gensim import corpora, models



We begin by creating a `gensim` dictionary containing (key, value) pairs which represent (word, integer id) respectively.

In [76]:
dictionary = corpora.Dictionary(text)
dictionary.token2id

{'He': 0,
 'apparently': 1,
 'home': 2,
 "n't": 3,
 'Louis': 4,
 'St.': 5,
 'The': 6,
 'close': 7,
 'plant': 8,
 'It': 9,
 'age': 10,
 'die': 11,
 'old': 12,
 '1920s': 13,
 'Workers': 14,
 'automotive': 15,
 'cars': 16,
 'making': 17,
 'mass': 18,
 'onset': 19,
 'production': 20,
 'WSU': 21,
 'hot': 22,
 'local': 23,
 'online': 24,
 'plans': 25,
 'quickly': 26,
 'sites': 27,
 'topic': 28,
 'Though': 29,
 'applauded': 30,
 'biomedical': 31,
 'building': 32,
 'center': 33,
 'deplored': 34,
 'loss': 35,
 'new': 36,
 'people': 37,
 'potential': 38,
 'Alaimo': 39,
 'Group': 40,
 'Holly': 41,
 'Mount': 42,
 'Trenton': 43,
 'Water': 44,
 'Works': 45,
 'contract': 46,
 'evaluate': 47,
 'fall': 48,
 'improvements': 49,
 'suggest': 50,
 '$': 51,
 '4,500': 52,
 'But': 53,
 'June': 54,
 'PAC': 55,
 'Partners': 56,
 'Progress': 57,
 'action': 58,
 'campaign': 59,
 'committee': 60,
 'donated': 61,
 'early': 62,
 'employees': 63,
 'finance': 64,
 'political': 65,
 'records': 66,
 'released': 67,
 'to

> To convert documents to vectors, we’ll use a document representation called bag-of-words. In this representation, each document is represented by one vector where each vector element represents a question-answer pair, in the style of:

> “How many times does the word system appear in the document? Once.”

The method `doc2bow` converts the document to a bag-of-words model by counting the number of occurrences of each distinct word, converts the word to its integer word id and returns the result as a sparse vector.

In [84]:
corpus = [dictionary.doc2bow(txt) for txt in text]
corpus[:3]

[[(0, 1), (1, 1), (2, 1), (3, 1)],
 [(4, 1), (5, 1), (6, 1), (7, 1), (8, 1)],
 [(9, 1), (10, 1), (11, 1), (12, 1)]]

Latent Dirichlet Allocation (LDA) is a generative statistical model. It is a pattern recognition and machine learning technique that works to find a linear combination of features that characterizes two or more classes of objects. We use it here to try to identify topics in our corpus.

In [129]:
lda = models.LdaModel(corpus, id2word=dictionary, num_topics=30)

Below, we call `show_topics` to view N topics chosen randomly. Each topic contain the top 10 keywords that the LDA model has found. The word's coefficients represents its weight and the words are listed in descending order by weight.

In [132]:
a = lda.show_topics()

In [133]:
a

[(3,
  '0.012*"game" + 0.011*"It" + 0.009*"district" + 0.008*"additional" + 0.008*"Mountain" + 0.008*"14" + 0.008*"spent" + 0.008*"25" + 0.007*"turned" + 0.007*"ball"'),
 (10,
  '0.013*"said" + 0.011*"The" + 0.010*"bad" + 0.009*"year" + 0.008*"government" + 0.007*"coming" + 0.007*"list" + 0.007*"higher" + 0.006*"president" + 0.006*"placed"'),
 (27,
  '0.016*"New" + 0.015*"’s" + 0.009*"director" + 0.008*"Jersey" + 0.008*"room" + 0.008*"nation" + 0.007*"old" + 0.007*"chief" + 0.007*"executive" + 0.007*"public"'),
 (19,
  '0.059*"I" + 0.028*"n\'t" + 0.019*"know" + 0.016*"said" + 0.012*"want" + 0.011*"think" + 0.011*"boy" + 0.010*"But" + 0.009*"\'ll" + 0.008*"playing"'),
 (20,
  '0.011*"said" + 0.010*"Park" + 0.010*"early" + 0.009*"available" + 0.007*"company" + 0.007*"Mr." + 0.007*"quarter" + 0.006*"Service" + 0.006*"With" + 0.006*"agencies"'),
 (1,
  '0.015*"The" + 0.013*"large" + 0.012*"percent" + 0.011*"way" + 0.009*"One" + 0.009*"10" + 0.009*"oil" + 0.009*"points" + 0.008*"different" 

From these buzzwords, we can infer what general concept each "topic" is talking about. Take a minute to think about what each collection of keywords represents.

## Visualizations for Topic Modeling

Now that we have our topics and our keyword collections, we can present them in a visualization to get a different view on how important each topic is to the overall document, and how closely these topics are related. We will use the `pyLDAvis` library which is a port of the R package LDAvis.

In [126]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

pyLDAvis is compatible with `gensim`, `scikit-learn`, and `GraphLab Create`. Here is how you would use it with `gensim`. We need a `gensim` LDA model, corpus and dictionary - we'll use the ones we've just built.

In [130]:
viz = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
viz

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  topic_term_dists = topic_term_dists.ix[topic_order]


Each bubble on the large plot represents a topic. The larger the bubble, the more frequently that topic is referenced. Ideally, we want large topic bubbles that are well separated and do not overlap with one another. A model with too many topics will have many overlaps and will be made up of many small sized bubbles clustered in one region of the chart. It appears that this particular dataset (with this analysis) doesn't have as clear separations as we'd like, and falls into the "too many topics" category.

The visualization is also interactive - if you hover over a bubble, it highlights the terms that the topic includes, and shows their frequency (both overall and within the selected topic).

Visit https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/ for additional resources and work with topic modelling in `Gensim`.

## Sentiment Analysis

`TextBlob` is another NLP library in Python. It is written entirely in Python, which docks its performance speed and high-processing ability a bit. TextBlob is a bit of a "best of the best" toolkit - it pulls some of the most effective and useful methods from packages like NLTK and Pattern. Here, we'll use it to perform sentiment analysis.

In [135]:
from textblob import TextBlob

In [140]:
#variety = nlp(u'I have never been to Paris, but I would love to go! How often do you travel there?')
flowers = TextBlob('These flowers are beautiful, they brighten up the place so much! I really love them.')
angry = TextBlob('You make me so angry; I just can\'t stand it.')

The `sentiment` attribute of a TextBlob "returns a tuple of the form (polarity, subjectivity) where polarity is a float within the range [-1.0, 1.0] and subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective."

Let's break this into its two parts: polarity and subjectivity. Polarity analysis takes into account the amount of positive or negative terms that appear in a given sentence - words like "beautiful" and "brighten" likely contribute to a higher score, while words like "angry" and "can't" would shift the polarity in a negative direction. Subjectivity, on the other hand, is almost like an error bound on the sentiment analysis. The subjectivity of words and phrases may depend on their context and an objective document may contain subjective sentences. Words and phrases that are labeled as more subjective may shift their polarity depending on their context, making it more difficult to determine the true sentiment of the phrase.

In [138]:
print(flowers.sentiment)
print(angry.sentiment)

Sentiment(polarity=0.675, subjectivity=0.8)
Sentiment(polarity=-0.5, subjectivity=1.0)


To see the breakdown of the numbers, we can use the `sentiment_assessments` attribute to view which words contributed to the polarity and subjectivity scores. The overall score is the average of each tuple in the assessment.

In [139]:
print(flowers.sentiment_assessments)
print(angry.sentiment_assessments)

Sentiment(polarity=0.675, subjectivity=0.8, assessments=[(['beautiful'], 0.85, 1.0, None), (['much', '!', 'really', 'love'], 0.5, 0.6, None)])
Sentiment(polarity=-0.5, subjectivity=1.0, assessments=[(['angry'], -0.5, 1.0, None)])


## Exercise

Test out a couple of different sentences and examine their sentiment analysis output. Try to write:
* A sentence that scores a neutral polarity (aim for a range in [-0.2, 0.2])
* A sentence that scores as highly _objective_
* A sentence that scores as highly _subjective_

In [25]:
net = TextBlob("I mean, it's pretty normal; I guess it's okay enough.")
net.sentiment_assessments

Sentiment(polarity=0.11750000000000001, subjectivity=0.6675, assessments=[(['mean'], -0.3125, 0.6875, None), (['pretty'], 0.25, 1.0, None), (['normal'], 0.15, 0.6499999999999999, None), (['okay'], 0.5, 0.5, None), (['enough'], 0.0, 0.5, None)])

In [26]:
obj = TextBlob("I really don't know...")
obj.sentiment_assessments

Sentiment(polarity=0.2, subjectivity=0.2, assessments=[(['really'], 0.2, 0.2, None)])

In [27]:
sub = TextBlob('Things are getting really funny, kind of interesting...')
sub.sentiment_assessments

Sentiment(polarity=0.45, subjectivity=0.7999999999999999, assessments=[(['really', 'funny'], 0.25, 1.0, None), (['kind'], 0.6, 0.9, None), (['interesting'], 0.5, 0.5, None)])

### Sentiment Analysis for US News

Now that we've seen how TextBlob sentiment analysis works on a few lines of text, we can see what polarity and subjectivity scores it assigns to the news dataset that we've been working with.

In [144]:
news_txt = TextBlob(data)
news_txt.sentiment

Sentiment(polarity=0.10219093467161112, subjectivity=0.43409946705408897, assessments=[(['apparently'], 0.05, 0.35, None), (['old'], 0.1, 0.2, None), (['quickly'], 0.3333333333333333, 0.5, None), (['hot'], 0.25, 0.8500000000000001, None), (['local'], 0.0, 0.0, None), (['most'], 0.5, 0.5, None), (['new'], 0.13636363636363635, 0.45454545454545453, None), (['center'], -0.1, 0.1, None), (['many'], 0.5, 0.5, None), (['potential'], 0.0, 1.0, None), (['last'], 0.0, 0.06666666666666667, None), (['total'], 0.0, 0.75, None), (['political'], 0.0, 0.1, None), (['action'], 0.1, 0.1, None), (['early'], 0.1, 0.3, None), (['more'], 0.5, 0.5, None), (['direct'], 0.1, 0.4, None), (['difficult'], -0.5, 1.0, None), (['absolutely', 'necessary'], 0.0, 1.0, None), (['serious'], -0.3333333333333333, 0.6666666666666666, None), (['enough'], 0.0, 0.5, None), (['definitely', 'not'], -0.0, 0.5, None), (['worse'], -0.4, 0.6, None), (['certain'], 0.21428571428571427, 0.5714285714285714, None), (['few'], -0.2, 0.1, N