### Codio Activity 18.2: Named Entities

**Expected Time = 45 minutes**

**Total Points = 30**

This activity focuses on extracting named entities from text.  The named entities will be extracted using the `nltk` library.  You will read in the full text of Newton's *Principia* and identify the entities labeled as places.  

#### Index

- [Problem 1](#-Problem-1)
- [Problem 2](#-Problem-2)
- [Problem 3](#-Problem-3)
- [Problem 4](#-Problem-4)
- [Problem 5](#-Problem-5)
- [Problem 6](#-Problem-6)

In [49]:
import nltk
from nltk import word_tokenize
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/codio/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/codio/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /home/codio/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /home/codio/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package stopwords to /home/codio/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

[Back to top](#-Index)

### Problem 1

#### Opening a `.txt` file.

**5 Points**

Use the `open` function to open the text file with the Principia by Isaac Newton using the filepath given below.  Assign the text using the `readlines()` function to assign the text as a list of lines to the variable `principia` below. 

In [50]:
filepath = 'data/Philosophiae_Naturalis_Principia_Mathematica.txt'

In [51]:
### GRADED
with open(filepath) as f:
    principia = ''

    
### BEGIN SOLUTION
with open(filepath) as f:
    principia = f.readlines()
### END SOLUTION

### ANSWER CHECK
print(type(principia))

<class 'list'>


In [52]:
### BEGIN HIDDEN TESTS
with open(filepath) as f:
    principia_ = f.readlines()
#
#
#
assert principia == principia_
### END HIDDEN TESTS

[Back to top](#-Index)

### Problem 2

#### Tokenizing the text. 

**5 Points**

Using the `principia` variable from problem 1, combine the `' '.join()` function with `word_tokenize` to create a list of tokens named `tokens` below.

In [53]:
### GRADED
tokens = ''

    
### BEGIN SOLUTION
tokens = word_tokenize(' '.join(principia))
### END SOLUTION

### ANSWER CHECK
print(type(tokens))
print(tokens[:5])

<class 'list'>
['Philosophiae', 'Naturalis', 'Principia', 'Mathematica', 'Isaacus']


In [54]:
### BEGIN HIDDEN TESTS
with open(filepath) as f:
    principia_ = f.readlines()
    
tokens_ = word_tokenize(' '.join(principia_))
#
#
#
assert tokens == tokens_
### END HIDDEN TESTS

[Back to top](#-Index)

### Problem 3

#### Part of Speech Tags 

**5 Points**

Use the `pos_tag` function to create the part of speech tagged corpus of the principia text.  Assign the tagged text to the variable `words_pos` below.

In [55]:
### GRADED
words_pos = ''

    
### BEGIN SOLUTION
words_pos = nltk.pos_tag(tokens)
### END SOLUTION

### ANSWER CHECK
print(type(words_pos))
print(words_pos[:5])

<class 'list'>
[('Philosophiae', 'NNP'), ('Naturalis', 'NNP'), ('Principia', 'NNP'), ('Mathematica', 'NNP'), ('Isaacus', 'NNP')]


In [56]:
### BEGIN HIDDEN TESTS
with open(filepath) as f:
    principia_ = f.readlines()
    
tokens_ = word_tokenize(' '.join(principia_))
words_pos_ = nltk.pos_tag(tokens_)
#
#
#
assert words_pos == words_pos_
### END HIDDEN TESTS

[Back to top](#-Index)

### Problem 4

#### Named Entities

**5 Points**

Use the tagged words in `words_pos` to create a list of tuples in the form (word, entity type) if the word has a named entity label.  Assign these tuples to the list `named_entities` below.

In [58]:
### GRADED
named_entities = []


    
### BEGIN SOLUTION
named_entities = []
for word in nltk.ne_chunk(words_pos):
    if hasattr(word, 'label'):
        named_entities.append((' '.join(c[0] for c in word.leaves()), word.label()))
### END SOLUTION

### ANSWER CHECK
print(type(named_entities))
print(named_entities[:5])

<class 'list'>
[('Philosophiae', 'GSP'), ('Naturalis Principia Mathematica Isaacus Newtonus', 'PERSON'), ('Wikisource', 'GPE'), ('INDEX Tituli', 'ORGANIZATION'), ('Auctoris', 'GPE')]


In [59]:
### BEGIN HIDDEN TESTS
with open(filepath) as f:
    principia_ = f.readlines()
    
tokens_ = word_tokenize(' '.join(principia_))
words_pos_ = nltk.pos_tag(tokens_)
named_entities_ = []
for word in nltk.ne_chunk(words_pos_):
    if hasattr(word, 'label'):
        named_entities_.append((' '.join(c[0] for c in word.leaves()), word.label()))
#
#
#
assert named_entities == named_entities_
### END HIDDEN TESTS

[Back to top](#-Index)

### Problem 5

#### Removing People

**5 Points**

Use the `named_entities` list to include only entities labeled `GPE` and create a list of these words lowercased as `places` below.

In [60]:
### GRADED
places = []


    
### BEGIN SOLUTION
places = [i[0].lower() for i in named_entities if i[1] == 'GPE']
### END SOLUTION

### ANSWER CHECK
print(type(places))
print(places[:5])

<class 'list'>
['wikisource', 'auctoris', 'umbilico', 'orbibus', 'orbibus']


In [61]:
### BEGIN HIDDEN TESTS
places_ = [i[0].lower() for i in named_entities_ if i[1] == 'GPE']
#
#
#
assert places == places_
### END HIDDEN TESTS

[Back to top](#-Index)

### Problem 6

#### Removing stopwords

**5 Points**

Use the list `places` to remove all stopwords.  Assign these words as a list to `no_stops` below.

In [62]:
from nltk.corpus import stopwords

In [63]:
### GRADED
no_stops = ''


    
### BEGIN SOLUTION
no_stops = [i for i in places if not i in stopwords.words('english')]
### END SOLUTION

### ANSWER CHECK
print(type(no_stops))
print(no_stops)

<class 'list'>
['wikisource', 'auctoris', 'umbilico', 'orbibus', 'orbibus', 'superficiebus', 'mediis', 'fluida']


In [64]:
### BEGIN HIDDEN TESTS
places_ = [i[0].lower() for i in named_entities_ if i[1] == 'GPE']
no_stops_ = [i for i in places_ if not i in stopwords.words('english')]
#
#
#
assert no_stops == no_stops_
### END HIDDEN TESTS