# Natural Language Processing (NLP) with TextaCy

In part 1, we focused on the preprocessing side of NLP. However, there are much more tasks with NLP. In this tutorial, we are going to cover some of these tasks, including:

### Table of Contents
- [Getting Started](#Getting-Started)
- [Loading Data](#Loading-Data)
- [More on Named Entities](#More-on-Named-Entities)
- [Part-of-Speech (POS) Tagging](#Part-of-Speech-Tagging)
- [NP & VP Chunking](#NP-&-VP-Chunking)
- [Term Extraction](#Term-Extraction)
- [Topic Modeling](#Topic-Modeling)
- [Conclusion](#Conclusion)

## Getting Started

We start with loading the `textacy` package again.

In [None]:
import textacy

## Loading Data

In this part, we will use the `CapitolWords()` dataset, which comes with TextaCy. We filter speeches made by Ms. Clinton and President Obama.

In [None]:
import textacy.datasets  # note the import
cw = textacy.datasets.CapitolWords()
#cw.download()
records = cw.records(limit=600)
text_stream, metadata_stream = textacy.io.split_records(records, 'text')
corpus = textacy.Corpus('en', texts=text_stream, metadatas=metadata_stream)
print(corpus)

We can look at the basic statistics of the `corpus`.

In [None]:
corpus.n_docs, corpus.n_sents, corpus.n_tokens

[Back to Top](#Table-of-Contents)

## More on Named Entities

In part 1, we touched upon named entities, however, we did not dig deep into them. Here are more on them ...

Named entities are about different objects in the real world, which include: time/date (`TIME/DATE`), location (`GPE`), organization (`ORG`), people/person (`NORP/PERSON`), number (`CARDINAL`), money (`MONEY`), ...

let's use the following example to show aforementioned types.

__NOTE__: do you know how TextaCy knows about these entities? Answer is TextaCy relies on a pre-trained __classification__ model to "guess"! Since it is classification, sometimes the named entities will be mis-classified.

In [None]:
doc = corpus[1]
for ent in textacy.extract.named_entities(doc, drop_determiners=True):
    print(ent.text, ent.label_)

In NLP practices, named entities can be very important (for instance, to determine the context of a document, in [information retrieval](https://en.wikipedia.org/wiki/Information_retrieval)) to not important at all (for instance, in [topic modeling](https://en.wikipedia.org/wiki/Topic_model)). So based on different tasks, we may use different strategies:
- If they are important, extract them in a list;
- If they are not important, replace them with their respective labels (e.g. `PERSON` for 'Clinton').

### YOUR TURN HERE
Extract all `PERSON` entities from `doc`, and store them in a list namely `person_lst`.

In [None]:
#### Complete your code here


[Back to Top](#Table-of-Contents)

# Part-of-Speech Tagging

After tokenization, spaCy/textaCy can parse and tag a given `Doc`. This is where the statistical model comes in, which enables spaCy/textaCy to make a prediction of which tag or label most likely applies in this context. A model consists of binary data and is produced by showing a system enough examples for it to make predictions that generalize across the language – for example, a word following “the” in English is most likely a __noun__.

This is done using the `pos_` attribute provided with `token` objects.

In [None]:
for t in doc:
    print(t, t.pos_)

### YOUR TURN HERE
Extract all non-stop verbs (`.pos_ == 'VERB'`) in its lemma form (`.lemma_`) from `doc`.

__HINT__: non-stop words can be filtered using the `is_stop` attribute of any token `t`.

In [None]:
#### Complete your code here


### Additional Task

From above results, you can observe that many verbs are duplicated in the list. How can you remove the duplicates from the list? Can you return the number of unique (non-duplicate) verbs in `doc`?

__HINT__: Which of Python's data types forbids duplicates?

In [None]:
#### Complete your code here


[Back to Top](#Table-of-Contents)

## NP & VP Chunking

From common sense, we know that words (tokens) may not be the most useful linguistic unit in text. Sometimes, phrases formed by words contain inseparable senses in text. Identifying phrases in text is called phrase chunking. Phrase chunking is a natural language process that separates and segments a sentence into its subconstituents, such as noun, verb, and prepositional phrases. [Source: Wikipedia](https://en.wikipedia.org/wiki/Phrase_chunking).

In practices, we focus mainly on Noun Phrases (NP) and Verb Phrases (VP).

For NP Chunking, textaCy provides a built-in method (`textacy.extract.noun_chunks()`):

In [None]:
for np in textacy.extract.noun_chunks(doc, drop_determiners=True):
    # this is to guarantee we are getting multi-word phrases not individual words
    if len(np.text.split()) > 1: 
        print(np.text.lower())

For VP chunking, it is a little bit more complicated. You will use regular expression matching on textaCy's built-in VP patterns. See example below:

In [None]:
pattern = textacy.constants.POS_REGEX_PATTERNS['en']['VP']
for vp in textacy.extract.pos_regex_matches(doc, pattern):
    # this is to guarantee we are getting multi-word phrases not individual words
    if len(vp.text.split()) > 1:
        print(vp.text.lower())

Combining phrase (NP & VP) chunking with named entity extraction, you can extract more complicated linguistic patterns from text data.

Below code can extract multi-word named entities from `doc`:

In [None]:
for ent in textacy.extract.named_entities(doc, drop_determiners=True):
    # this is to guarantee we are getting multi-word phrases not individual words
    if len(ent.text.split()) > 1:
        print(ent.text, ent.label_)

Looks like with the help of machine learning, machines can understand a _little bit_ of text data, right? 

[Back to Top](#Table-of-Contents)

Next, we are going to demonstrate advanced text analytics techniques.

## Term Extraction

We already learned how to extract words, named entities, or phrases from text. However, in text analytics, we do not treat every word/phrases equally - some of them are more important than others. We name these 'important' words/phrases as __terms__ (short for _terminologies_). Extracting terms from texts is an important NLP task.

TextaCy provides several term extraction methods.

In [None]:
# Load Keyterms for TextRank & Srank
# make sure you import the sub-package
import textacy.keyterms
# SGRank
textacy.keyterms.sgrank(doc, ngrams=(1, 2, 3, 4, 5, 6), 
                        normalize='lemma', window_width=1500, n_keyterms=10, idf=None)


In [None]:
# Single Rank
textacy.keyterms.singlerank(doc, normalize='lemma', n_keyterms=10)

In [None]:
# Text rank
textacy.keyterms.textrank(doc, normalize='lemma', n_keyterms=10)

From these terms, can you get an understanding regarding the `doc`?

[Back to Top](#Table-of-Contents)

## Topic Modeling

The most advanced technique for document understanding is named __Topic Modeling__, which relies on the (co-)occurrences of words/tokens/terms.

TextaCy provides a method (`textacy.tm.topic_model.TopicModel`) for topic modeling purposes. To creating topic modeling, we need to generate word vectors, in which each word in represented using a vector. This functionality is built on scikit-learn.

<img src='https://cdn-images-1.medium.com/max/1080/1*2r1yj0zPAuaSGZeQfG6Wtw.png' />

In [None]:
from textacy.vsm import Vectorizer
 
tokenized_docs = (doc.to_terms_list(ngrams=1, entities=True, as_strings=True) 
                  for doc in corpus[:500])

In [None]:
vectorizer = Vectorizer(apply_idf=True, norm='l2', min_df=3, max_df=0.95, idf_type='smooth', 
                       tf_type='linear', max_n_terms=100000)

In [None]:
doc_term_matrix = vectorizer.fit_transform(tokenized_docs)

In [None]:
vectorizer.terms_list[:5]

With a vectorized corpus (i.e. document-term matrix) and corresponding vocabulary (i.e. mapping of term strings to column indices in the matrix), we can then initialize and train a topic model:

In [None]:
model = textacy.tm.TopicModel('nmf', n_topics=20)
model.fit(doc_term_matrix)
model

Now let's transform the corpus and interpret our model:

In [None]:
doc_topic_matrix = model.transform(doc_term_matrix)

In [None]:
for topic_idx, top_terms in model.top_topic_terms(vectorizer.id_to_term):
    print('topic', topic_idx, ':', '   '.join(top_terms))

We can then associate topics with docs in our corpus.

In [None]:
for topic_idx, top_docs in model.top_topic_docs(doc_topic_matrix, topics=[0,1,2], top_n=2):
    print('topic', topic_idx, ':', '   '.join(top_terms))
    for j in top_docs:
        print(corpus[j].metadata['title'])

How about the top-10 best match docs and topics?

In [None]:
for doc_idx, topics in model.top_doc_topics(doc_topic_matrix, docs=range(10), top_n=2):
    print(corpus[doc_idx].metadata['title'], ':', topics)

We can also in return look at the topic loading on the whole corpus, which can be used to determine the importance of each topic (the __higher__, the __better__).

In [None]:
for i, val in enumerate(model.topic_weights(doc_topic_matrix)):
    print(i, val)

We can also visualize the topics

In [None]:
model.termite_plot(doc_term_matrix, vectorizer.id_to_term,
                  topics=-1,  n_terms=25, sort_terms_by='seriation')

You can save the trained model for future use.

In [None]:
model.save('nmf-20topics.pkl')

## Conclusion

In this tutorial, we learned some advanced text analytics techniques, these techniques are either used to extract (semi-structured) information from text, or summarizing text using most important terms or topics.

The techniques you learned in part 1 & 2 cover the most important NLP tasks in the field of text mining. Feel free to try them on your own. 

### Have fun text mining!

### Useful Links
- [TextaCy API references](https://chartbeat-labs.github.io/textacy/api_reference.html#)
- [Natural Language Processing is Fun!](https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e)
- [spaCy 101: Everything you need to know](https://spacy.io/usage/spacy-101)

__PLEASE complete both parts of the tutorial and submit back using GitHub classroom.__