# Introduction to Natural Language Processing with spaCy


## What is NLP?

Natural language processing is a way for computers to analyze, understand, and derive meaning from human language. With appropriate use and organization, NLP can be used to help developers perform a variety of tasks, including summarization, translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation for a given dataset or group of texts.

## What can we use it for?

Things like chat bots, voice-to-text software, and customer service sentiment analysis are examples of applications of NLP. The use of (and development of tools for) NLP has experienced rapid growth over the last decade and is currently being integrated into a variety of fields. 

For example, a retailer on Amazon may run a sentiment analysis on all comments for a certain product. This could reveal general attitudes towards both the company and the item in question, ultimately leading to improvements and adjustments. On another hand, Siri, Apple's personal voice assistant, is the almagamation of years of work in NLP - Siri can recognize, conceptualize, and respond to a wide array of questions and comments. This not only improves the iPhone user experience in general but also increases the accessibility of the product.

## Basic Terminology

__Corpus__ (plural: corpora) is defined as a large collection of liguistic data. In other words, corpora serve as our datasets, or our informaiton to process and train models on.

__Tokenization__ is the process of segmenting text into words, punctuation marks, etc. This is one of the first steps in processing the text into workable components.

__Part-of-speech (POS) tagging__ involves assigning word types (parts of speech) to tokens, like _verb_, _noun_, _preposition_, etc. 

__Dependency Parsing__ is the process of assigning syntactic dependency labels that describe the relations between individual tokens. For example, in the sentence _The dog ran through the park_, dependency parsing would recognize that _dog_ is the subject of the sentance, _ran_ is the main verb, and so on and so forth.

__Lemmatization__ is defined as assigning the base form of words. For example, the lemma of “was” is “be”, and the lemma of “rats” is rat.

__Sentence Boundary Detection (SBD)__ is responsible for finding and segmenting individual sentences within a text. 

__Named Entity Recognition (NER)__ involves labelling named “real-world” objects, like persons, companies, or locations. For example, in some instances, we want "Amazon" to be recognized as an electronic company as opposed to a forest.

__Similarity__ is the process of comparing words, text spans, and documents to see how similar they are to each other. This is helpful when trying to determine a common theme, or, on the other hand, divides between content.

__Text Classification__ assigns categories or labels to a whole document or parts of a document. 

~ might delete the below definitions ~

Rule-based Matching: finding sequences of tokens based on their texts and linguistic annotations (similar to regex)

Training: updating and improving a statistical model’s predictions **duh**

Serialization: saving objects to files or byte strings

## What is spaCy?

+ Free, open-source software
+ Designed for advanced, industrial strength natural language processing in python

In [15]:
import spacy
from spacy import displacy

We begin by importing the necessary packages, `spaCy` being the most notable. We then load the English stasticial model (which is set up as its own python package) - this is the main program we will be working with. Using a model for a specific language enables spaCy to predict linguistic annotations – for example, whether a word is a verb or a noun. Though some of spaCy's features are available without a language model, most of its functions require one. 

In [12]:
nlp = spacy.load('en')

First, we'll walk through some of the basic functions of spaCy, based on the definitions above. We begin by reading in a few simple sentences, just to get a feel for how the package works.

In [13]:
doc = nlp(u'I\'m having such a wonderful day! It is sunny out and there are flowers. Do you want to get some ice cream?')

## Tokenizing

When we read something in with `nlp`, spaCy automatically tokenizes it - for example, it breaks _I'm_ into _I_ and _'m_, and each word is its own element. The object `doc` can be indexed to access individual tokens. We can also view all the individual sentences from the paragraph by iterating through the `sents` property.

In [14]:
print(doc[0])
print(doc[1])
print(doc[5])
print(doc[7])

for sent in doc.sents:
    print(sent)

I
'm
wonderful
!
I'm having such a wonderful day!
It is sunny out and there are flowers.
Do you want to get some ice cream?


## Visualization

SpaCy has a built-in visualization package called `displaCy` that plots sentence dependencies, entity recognition, and more.

### Dependency

Dependency parsing is the process of analyzing a sentence and assigning a syntactic structure to it: labeling the subject, the verb, etc., and how the different elements of the sentence depend on one another.

In [15]:
doc = nlp('The quick brown fox jumped over the lazy dog.')
options={'distance':100}
displacy.render(doc, style='dep', jupyter=True, options=options)

### Entity Recognition

In [16]:
doc2 = nlp("TD Ameritrade, ProQuest, Google, Domino's and the University of Michigan are companies that hire data scientists in Ann Arbor")
colors = {'GPE': 'linear-gradient(0deg, #deebf7, #3182bd)',
         'ORG': 'linear-gradient(90deg, #fee6ce, #e6550d)'}
options = {'colors': colors}
displacy.render(doc2, style='ent', jupyter=True, options=options)

**Tip**: For more visualizer options, see https://spacy.io/api/top-level#displacy_options

### Manual Entity Recognition

spaCy also offers the option of defining your own entity recognition:

In [17]:
ex = [{'text': 'But Google is starting from behind.',
       'ents': [{'start': 4, 'end': 10, 'label': 'ORG'}],
       'title': 'This is my title'}]
displacy.render(ex, style='ent', jupyter=True, manual=True)

### spaCy Explain Method

Since spaCy is filled with plenty of useful tools, it's easy to lose track of what all the different abbreviations stand for. The spacy.explain() method can be used to get the description for the string representation of an entity label. For example, spacy.explain("LANGUAGE") will return "any named language".

In [18]:
part_of_speech = ['DET','VERB','NOUN']
for pos in part_of_speech:
    print(pos, spacy.explain(pos))

DET determiner
VERB verb
NOUN noun


### Word Count

In [None]:
from collections import Counter
w = (token.text for token in doc if token.is_stop == False and token.is_punct == False and token.text not in ['\n'] and token.prefix_ != '$')
freq = Counter(w)
freq.most_common(20)

# Topic Modeling

One branch of NLP is topic modeling - a topic model is a kind of statistical model that is used to uncover the abstract topics and concepts that occur in a collection of documents. Topic modeling is frequently used to discover semantic structures in a text body, and as a data-mining tool to better understand large collections of data.

Now that we've seen how spaCy processes short text segments, let's explore what happens (and what we can work with) when we examine a much larger document. To work with topic modeling, we'll begin by using spaCy to tokenize a text file.

In [20]:
#%%time
with open('en_US.news.txt','r', encoding='utf-8') as fin:
    data = fin.read()
data = data[:900000]

In [None]:
news = nlp(data)

As with any large text, we can't start analysis until we clean the data. This invovles removing unimportant words and punctuation, begin by removing __stopwords__, which are commonly used words that have little value in determining sentiment or analyzing a document. Filtering out words like ‘the’, ‘is’, and ‘are’ helps speed up processes and helps keep the data clean while allowing us to focus on more significant/rarer terms. Luckily, each token has the built-in property `is_stop` to indicate whether or not it is considered a stopword.

Sometimes, in addition to removing stopwords, we'll want to remove punctuation from a piece. We may also choose to convert all words to lowercase or standardize dates and times. Note that the cleaning process will not be the same for every text or even for every analysis of the same text - sometimes we care about capitalization and punctuation as part of the sentiment and topic modeling analysis. For now, though, we'll remove the punctuation. We'll also remove any newline and possessive characters.

In [None]:
text=[]
for sentence in news.sents:
    text.append([token.text for token in sentence 
                 if token.is_punct == False and token.is_stop == False and token.text not in ["\n","'s"]])

## Gensim

Gensim is a powerful vector space modeling and topic modeling toolkit, commonly used for a variety of NLP tasks. Now that we've prepared our information for analysis, we'll use Gensim to perform topic modeling. 

In [None]:
from gensim import corpora, models

We begin by creating dictionary containing (key,value) pairs where `key` represents `word` and `value` represents `integer id`.

In [None]:
dictionary = corpora.Dictionary(text)
dictionary.token2id

The function doc2bow() simply counts the number of occurrences of each distinct word, converts the word to its integer word id and returns the result as a sparse vector.

In [None]:
corpus = [dictionary.doc2bow(txt) for txt in text]

Latent Dirichlet allocation (LDA) is a generative statistical model. It is a pattern recognition and machine learning technique that works to find a linear combination of features that characterizes two or more classes of objects. We use it here to try to identify topics in our corpus.

In [None]:
lda = models.LdaModel(corpus, id2word=dictionary, num_topics=30)

Below, we call `show_topics` to view the top 10 most significant words for the top 10 most significant topics that the LDA model has discovered/created. The factors in front of the texts represents each word's impact on the topic it is associated with.

In [None]:
lda.show_topics()

From these buzzwords, we can infer what general concept each "topic" is talking about. Take a minute to write down that you think each collection of keywords represents.

## Visualizations for Topic Modeling

Now that we have our topics and our keyword collections, we can present them in a visualization to get a different view on how important each topic is to the overall document, and how closely these topics are related.

In [8]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

pyLDAvis is compatible with gensim, scikit-learn, and GraphLab Create. Here is how you would use it with gensim. We need a gensim LDA model, corpus and dictionary - we'll use the ones we just built.

In [10]:
viz = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
viz

NameError: name 'lda' is not defined

Each bubble on the large plot represents a topic. The larger the bubble, the more frequently that topic is referenced. Ideally, we want large topic bubbles that are well separated and do not overlap with one another. 

Visit https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/ for additional resources and work with topic modelin in Gensim.

## Sentiment Analysis

TextBlob is another NLP library in Python. Unlike spaCy, however, it is built entirely in Python, which docks its performance speed and high-processing ability a bit. TextBlob is a bit of a "best of the best" toolkit - it pulls some of the most effective and useful methods from packages like NLTK and Pattern. Here, we'll use it to perform sentiment analysis.

In [1]:
from textblob import TextBlob

In [12]:
#variety = nlp(u'I have never been to Paris, but I would love to go! How often do you travel there?')
flowers = TextBlob('These flowers are beautiful, they brighten up the place so much! I really love them.')
angry = TextBlob('You make me so angry; I just can\'t stand it.')

The `sentiment` attribute of a TextBlob "returns a tuple of form (polarity, subjectivity ) where polarity is a float within the range [-1.0, 1.0] and subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective."

Let's break this into its two parts: polarity and subjectivity. Polarity analysis takes into account the amount of positive or negative terms that appear in a given sentence - words like "beautiful" and "brighten" likely contribute to a higher score, while words like "angry" and "can't" would shift the polarity in a negative direction. Subjectivity, on the other hand, is almost like an error bound on the sentiment analysis. The subjectivity of words and phrases may depend on their context and an objective document may contain subjective sentences. Words and phrases that are labeled as more subjective may shift their polarity depending on their context, making it more difficult to determine the true sentiment of the phrase.

In [14]:
print(flowers.sentiment)
print(angry.sentiment)

Sentiment(polarity=0.675, subjectivity=0.8)
Sentiment(polarity=-0.5, subjectivity=1.0)


To get even _more_ specific, we can use the `sentiment_assessments` feature to view which words contributed to the polarity and subjectivity scores.

In [13]:
print(flowers.sentiment_assessments)
print(angry.sentiment_assessments)

Sentiment(polarity=0.675, subjectivity=0.8, assessments=[(['beautiful'], 0.85, 1.0, None), (['much', '!', 'really', 'love'], 0.5, 0.6, None)])
Sentiment(polarity=-0.5, subjectivity=1.0, assessments=[(['angry'], -0.5, 1.0, None)])


## Exercise

Test out a couple of different sentences and examine their sentiment analysis output. Can you write:
* A sentence that scores a neutral polarity (aim for a range in [-0.2, 0.2])
* A sentence that scores as highly _objective_
* A sentence that scores as highly _subjective_