1\. Named Entity Recognition
----------------------------

00:00 - 00:05

Welcome to our first video on named entity recognition!

2\. What is Named Entity Recognition?
-------------------------------------

00:05 - 00:40

Named Entity Recognition or NER for short is a natural language processing task used to identify important named entities in the text -- such as people, places and organizations -- they can even be dates, states, works of art and other categories depending on the libraries and notation you use. NER can be used alongside topic identification, or on its own to determine important items in a text or answer basic natural language understanding questions such as who? what? when and where?

3\. Example of NER
------------------

00:40 - 01:30

For example, take this piece of text which is from the English Wikipedia article on Albert Einstein. The text has been highlighted for different types of named entities that were found using the Stanford NER library. You can see the dates, locations, persons and organizations found and extract infomation on the text based on these named entities. You can use NER to solve problems like fact extraction as well as which entities are related using computational language models. For example, in this text we can see that Einstein has something to do with the United States, Adolf Hitler and Germany. We can also see by token proximity that Betrand Russel and Einstein created the Russel-Einstein manifesto -- all from simple entity highlighting.

4\. nltk and the Stanford CoreNLP Library
-----------------------------------------

01:30 - 02:20

NLTK allows you to interact with named entity recognition via it's own model, but also the aforementioned Stanford library. The Stanford library integration requires you to perform a few steps before you can use it, including installing the required Java files and setting system environment variables. You can also use the standford library on its own without integrating it with NLTK or operate it as an API server. The stanford CoreNLP library has great support for named entity recognition as well as some related nlp tasks such as coreference (or linking pronouns and entities together) and dependency trees to help with parsing meaning and relationships amongst words or phrases in a sentence.

5\. Using nltk for Named Entity Recognition
-------------------------------------------

02:20 - 02:49

For our simple use case, we will use the built-in named entity recognition with NLTK. To do so, we take a normal sentence, and preprocess it via tokenization. Then, we can tag the sentence for parts of speech. This will add tags for proper nouns, pronouns, adjective, verbs and other part of speech that NLTK uses based on an english grammar. When we take a look at the tags, we see New and York are tagged NNP which is the tag for a proper noun, singular.

6\. nltk's ne_chunk()
---------------------

02:49 - 03:31

Then we pass this tagged sentence into the ne_chunk function, or named entity chunk, which will return the sentence as a tree. NLTK Tree's might look a bit different than trees you might use in other libraries, but they do have leaves and subtrees representing more complex grammar. This tree shows the named entities tagged as their own chunks such as GPE or geopolitical entity for New York, or MOMA and Metro as organizations. It also identifies Ruth Reichl as a person. It does so without consulting a knowledge base, like wikipedia, but instead uses trained statistical and grammatical parsers.

7\. Let's practice!
-------------------

03:31 - 03:37

Now it's your turn to practice some named entity recognition using nltk.

NER with NLTK
=============

You're now going to have some fun with named-entity recognition! A scraped news article has been pre-loaded into your workspace. Your task is to use `nltk` to find the named entities in this article. 

What might the article be about, given the names you found?

Along with `nltk`, `sent_tokenize` and `word_tokenize` from `nltk.tokenize` have been pre-imported.

Instructions
------------

-   Tokenize `article` into sentences.
-   Tokenize each sentence in `sentences` into words using a list comprehension.
-   Inside a list comprehension, tag each tokenized sentence into parts of speech using `nltk.pos_tag()`.
-   Chunk each tagged sentence into named-entity chunks using `nltk.ne_chunk_sents()`. Along with `pos_sentences`, specify the additional keyword argument `binary=True`.
-   Loop over each sentence and each chunk, and test whether it is a named-entity chunk by testing if it has the attribute `label`, and if the `chunk.label()` is equal to `"NE"`. If so, print that chunk.

In [None]:
sentences = nltk.sent_tokenize(article)

# Tokenize each sentence into words: token_sentences
token_sentences = [nltk.word_tokenize(sent) for sent in sentences]

# Tag each tokenized sentence into parts of speech: pos_sentences
pos_sentences = [nltk.pos_tag(sent) for sent in token_sentences] 

# Create the named entity chunks: chunked_sentences
chunked_sentences = nltk.ne_chunk_sents(pos_sentences, binary=True)

# Test for stems of the tree with 'NE' tags
for sent in chunked_sentences:
    for chunk in sent:
        if hasattr(chunk, "label") and chunk.label() == "NE":
            print(chunk)

Charting practice
=================

In this exercise, you'll use some extracted named entities and their groupings from a series of newspaper articles to chart the diversity of named entity types in the articles.

You'll use a `defaultdict` called `ner_categories`, with keys representing every named entity group type, and values to count the number of each different named entity type. You have a chunked sentence list called `chunked_sentences` similar to the last exercise, but this time with non-binary category names.

You can use `hasattr()` to determine if each chunk has a `'label'` and then simply use the chunk's `.label()` method as the dictionary key.

Instructions 1/3
----------------

-   Create a `defaultdict` called `ner_categories`, with the default type set to `int`.

In [None]:
from collections import defaultdict

# Create the defaultdict: ner_categories
ner_categories = defaultdict(int)

Instructions 2/3
----------------

-   Fill up the dictionary with values for each of the keys. Remember, the keys will represent the `label()`.
    -   In the outer `for` loop, iterate over `chunked_sentences`, using `sent` as your iterator variable.
    -   In the inner `for` loop, iterate over `sent`. If the condition is true, increment the value of each key by 1. 
    -   *Remember to use the chunk's `.label()`method as the key!*
-   For the pie chart labels, create a list called `labels` from the keys of `ner_categories`, which can be accessed using `.keys()`.

In [None]:
ner_categories = defaultdict(int)

# Create the nested for loop
for sent in chunked_sentences:
    for chunk in sent:
        if hasattr(chunk, 'label'):
            ner_categories[chunk.label()] += 1
            
# Create a list from the dictionary keys for the chart labels: labels
labels = list(ner_categories.keys())

# Create a list of the values: values
values = [ner_categories.get(l) for l in labels]

# Create the pie chart
plt.pie(values, labels=labels, autopct='%1.1f%%', startangle=140)

# Display the chart
plt.show()

Instructions 3/3
----------------

-   Use a list comprehension to create a list called `values`, using the `.get()` method on `ner_categories` to compute the values of each label `v`.
-   Use `plt.pie()` to create a pie chart for each of the NER categories. Along with `values` and `labels=labels`, pass the extra keyword arguments `autopct='%1.1f%%'` and `startangle=140`to add percentages to the chart and rotate the initial start angle. 
    -   *This step has been done for you.*
-   Display your pie chart. Was the distribution what you expected?

In [None]:
from collections import defaultdict
import matplotlib.pyplot as plt

# Create the defaultdict: ner_categories
ner_categories = defaultdict(int)

# Create the nested for loop
for sent in chunked_sentences:
    for chunk in sent:
        if hasattr(chunk, 'label'):
            ner_categories[chunk.label()] += 1
            
# Create a list from the dictionary keys for the chart labels: labels
labels = list(ner_categories.keys())

# Create a list of the values: values
values = [ner_categories.get(v) for v in labels]

# Create the pie chart
plt.pie(values, labels=labels, autopct='%1.1f%%', startangle=140)

# Display the chart
plt.show()

Stanford library with NLTK
==========================

When using the Stanford library with NLTK, what is needed to get started?

##### Answer the question

#### Possible Answers

Select one answer

[/] -   A normal installation of NLTK.

-   An installation of the Stanford Java Library.

-   Both NLTK and an installation of the Stanford Java Library.

-   NLTK, the Stanford Java Libraries and some environment variables to help with integration.

1\. Introduction to SpaCy
-------------------------

00:00 - 00:07

In this video, we'll take a look at SpaCy, another great library for natural language processing.

2\. What is SpaCy?
------------------

00:07 - 00:31

SpaCy is a NLP library similar to Gensim, but with different implementations, including a particular focus on creating NLP pipelines to generate models and corpora. SpaCy is open-source and has several extra libraries and tools built by the same team, including Displacy - a visualization tool for viewing parse trees which uses Node-js to create interactive text.

3\. Displacy entity recognition visualizer
------------------------------------------

00:31 - 00:55

For example, if we use the displacy entity recognition visualizer which has a live demo online, we can enter the sentence used in the last video. Here, we can see the SpaCy has identified three named entities and tagged them with the appropriate entity label -- such as location or person. SpaCy also has tools to build word and document vectors from text.

4\. SpaCy NER
-------------

00:55 - 02:19

To start using spacy for Named entity recognition, we must first install it and download all the appropriate pre-trained word vectors. You can also train vectors yourself and load them; but the pretrained ones let us get started immediately. We can load those into an object, NLP, which functions similarly to our Gensim dictionary and corpus. It has several linked objects, including entity which is an Entity Recognizer object from the pipeline module. This is what is used to find entities in the text. Then we load a new document by passing a string into the NLP variable. When the document is loaded, the named entities are stored as a document attribute called ents. We see Spacy properly tagged and identified the three main entities in the sentence. We can also investigate the labels of each entity by using indexing to pick out the first entity and the label_ attribute to see the label for that particular entity. Here we see the label for Berlin is GPE or Geopolitical entity. Spacy has several other language models available, including advanced German and Chinese implementations. It's a great tool especially if you want to build your own extraction and natural language processing pipeline quickly and iteratively.

5\. Why use SpaCy for NER?
--------------------------

02:19 - 02:49

Why use Spacy for NER? Outside of being able to integrate with the other great Spacy features like easy pipeline creation, it has a different set of entity types and often labels entities differently than nltk. In addition, Spacy comes with informal language corpora, allowing you to more easily find entities in documents like Tweets and chat messages. It's a quickly growing library, so it might even have more languages supported by the time you are watching this video!

6\. Let's practice!
-------------------

02:49 - 02:55

For now, however, you can get started using Spacy for named entity recognition!

Comparing NLTK with spaCy NER
=============================

Using the same text you used in the first exercise of this chapter, you'll now see the results using spaCy's NER annotator. How will they compare?

The article has been pre-loaded as `article`. To minimize execution times, you'll be asked to specify the keyword argument `disable=['tagger', 'parser', 'matcher']`when loading the spaCy model, because you only care about the `entity` in this exercise.

Instructions
------------

-   Import `spacy`.
-   Load the `'en_core_web_sm'` model using `spacy.load()`. Specify the additional keyword arguments `disable=['tagger', 'parser', 'matcher']`.
-   Create a `spacy` document object by passing `article` into `nlp()`.
-   Using `ent` as your iterator variable, iterate over the entities of `doc` and print out the labels (`ent.label_`) and text (`ent.text`).

In [None]:
# Import spacy
import spacy

# Instantiate the English model: nlp
nlp = spacy.load('en_core_web_sm', disable=['tagger', 'parser', 'matcher'])

# Create a new document: doc
doc = nlp(article)

# Print all of the found entities and their labels
for ent in doc.ents:
    print(ent.label_, ent.text)

spaCy NER Categories
====================

Which are the *extra* categories that `spacy` uses compared to `nltk` in its named-entity recognition?

Instructions
------------

### Possible answers

GPE, PERSON, MONEY

ORGANIZATION, WORK*OF*ART

[/] NORP, CARDINAL, MONEY, WORK*OF*ART, LANGUAGE, EVENT

EVENT_LOCATION, FIGURE