# Lab 8: Natural Language Processing

Write your name below [3 points]

## Name:          

The code cells below extract the plain text Wikipedia entry for *Eastern Connecticut State University*. The details for this step is beyond the scope of our class, but if you are curious I can explain more about what this code does. To complete the assignment, you just need to understand the text of the Wikipedia article is stored in the variable *text*, which you will first analyze using *TextBlob*.

In [None]:
# download the wikipedia entry
import requests
from bs4 import BeautifulSoup

page = requests.get('https://en.wikipedia.org/w/api.php?action=query&format=json&titles=Eastern_Connecticut_State_University&prop=extracts&explaintext')

if page.status_code != 200 :
    print('Error: Page Not Found. Try again or ask Dr. Dancik for help')

In [None]:
# result is in json format, which we convert to a dictionary using json.loads
import json
j = json.loads(page.text)

In [None]:
# extract the text

# first find the pageId
pageId = list(j['query']['pages'].keys())[0]
pageId

# get the extract for this pageId, and store the text in the variable 'text'
text = j['query']['pages'][pageId]['extract']
print(text)

## *TextBlob* questions

The code below imports *TextBlob* and creates a *blob* from the Wikipedia text.

In [None]:
from textblob import TextBlob
blob = TextBlob(text)

### Question 1  <span style = 'font-size: 80%'>(5 points each = 15 points)</span>
(a) How many words are in the text?

(b) How many sentences are in the text?

(c) What is the 3rd sentence?

### Question 2 <span style = 'font-size: 80%'>(5 points each = 10 points)</span>
(a) Print out all the sentences that contain 'Eastern Connecticut State University'. Print a blank line after each sentence.

(b) A programmer may want to repeat this kind of analysis for multiple words or phrases. This is where a *function* can be very useful.
Write a function that has the following format:

```python
def printSentences(blob, search) :
    # prints out all sentences in the 'blob' that contain the 'search' term.
```

Then use this function to print out all sentences that contain 'Willimantic'.


### Question 3 <span style = 'font-size: 80%'>(5 points each = 10 points)</span>
Recall that the word counts are stored in the *blob.word_counts* dictionary. 

(a) How many times does the word 'eastern' appear in the text? 

(b) How many times does the word 'student' appear?

### Question 4 <span style = 'font-size: 80%'>(7 points each = 21 points)</span>
(a) Use list comprehension to create a list of word-frequency pairs for each word. Then sort this list from the most frequent word to the least frequent word. The cell below imports the *itemgetter* function which is useful for sorting lists when each item in the list is a tuple.

In [None]:
from operator import itemgetter

(b) It is very common to convert a word count dictionary to a sorted list of word-frequency pairs (tuples) when doing natural language processing. Write a function called *getWordCounts* that takes a *blob* as an input and returns a list of word-frequency pairs, where the words are sorted from the most frequent to the least frequent. Then use this function to get a list of word counts as was done for (a).

(c) Your results above should show that 'the' is the most common word. Let's now remove common words like 'the' and 'and'. The code below creates a set of stopwords that are stored in *sw*. Create a new list of word-frequency pairs but with the stopwords removed. Display the word counts, either as a list of tuples or as a data frame.

In [None]:
from nltk.corpus import stopwords
sw = set(stopwords.words('english'))
sw

### Question 5 <span style = 'font-size: 80%'>(7 points)</span>

Generate a word cloud directly from the text, but convert all text to lowercase

### Question 6 <span style = 'font-size: 80%'>(5 points)</span>

Use TextBlob's *translate* function to translate the first sentence of the Wikipedia entry into Spanish.

## Spacy questions

Run the code below to load the *en_core_web_sm* language model and carries out natural language processing using spacy, storing the results in *doc*.

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)

### Question 7 <span style = 'font-size: 80%'>(5 points each = 10 points)</span>

Recall that *doc* is a sequence of tokens (words). 

(a) Use slicing to view the first 20 words.

(b) Use slicing to view the last 20 words.

### Question 8 <span style = 'font-size: 80%'>(5 points)</span>

Use displacy to view the text where the named entities are highlighted.

### Question 9 <span style = 'font-size: 80%'>(7 points)</span>

Iterate through each token and output all of the dates. Note that if a token is a date then its entity label *ent.label_* will be equal to 'DATE'

In *Spacy*, the sentences from text are stored in *doc.sents*, and we can iterate through each sentence by using 

```python
for sentence in doc.sents :
    # do something for each sentence
```

The code below creates a function named *containsDate* that returns *True* if a sentence *x* contains a date; otherwise the function returns *False*. The code demonstrates this function by testing it on the first and second sentences.

In [None]:
# check all named entities in 'x' and return True if at least one is 'DATE'
def containsDate(x) :
    for ent in x.ents:
        if ent.label_ == 'DATE' :
            return True
    return False   


# get a list of sentences
sentences = list(doc.sents)

# look at first sentence, which does not contain a date
s = sentences[0]
print (s)
print('contains date:', containsDate(s))
print()

        
# look at second sentence, which does contain a date
s = sentences[1]
print (s)
print('contains date:', containsDate(s))
print()

### Question 10 <span style = 'font-size: 80%'>(7 points)</span>

Use the *containsDate* function above to output all the sentences that contain a date. Print a blank line after each sentence.