# In this notebook:

1. **Part-of Speech Tagging Text**
2. **Named Entity Recognition**
3. extra: Python **Functions**


# 1. Part-of Speech Tagging Text

#### Questions & Objectives:

- How can I extract words that have a particular part of speech (POS) such as a noun or a verb?
- How can I visualise those extracted words?
- To understand what a part-of-speech (POS) is.
- To use a POS tagger to label a corpus.
- To extract words with a specific POS.
- To visualise the extracted words using a using a frequence distribution graph and a word cloud.

#### Key Points

- We will use a NLTK’s part-of-speech tagger, ```averaged_perceptron_tagger```, to label each word with part-of-speech providing information on tense, number (plural/singular) and case.
- We will use the text from the US Presidential Inaugural speeches, in particular that from the last speech (Trump's).
- We will then extract all nouns both plural (NNS) and singular (NN).
- We will  visualise the nouns extracted using a frequency distribution graph and a word cloud.

In [None]:
# run this cell now. It's the usual imports of text mining libraries

import nltk
import numpy
import string
import matplotlib.pyplot as plt
from nltk.tokenize import word_tokenize
nltk.download('punkt')

In text mining it can be useful to extract words that have a particular part of speech (POS) such as a noun or a verb. For example extracting all proper nouns can give use names and locations. This is done using a POS-tagger. 

**The POS-tag of a word is a label of the word indicating its part of speech as well as grammatical categories such as tense, number (plural/singular) and case. POS tagging is the process of automatically determining the POS-tags of the tokens in a corpus.**

In this lesson, we will use NLTK’s `averaged_perceptron_tagger` as the POS-tagger. It uses the perceptron algorithm to predict which POS-tag is most likely given the word. We need to download the tagger in order to use it.

The POS-tagger outputs tokens tagged with their POS-tag. It uses the [Penn Treebank POS tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) which is widely used for POS-tagging text.

POS-tagging text is very useful when analysing a corpus or document and will allow us to do more in-depth analysis and visualisations. In order to `pos-tag` using NLTK, you also have to `import pos_tag` from the tag package.

We are going to use the text from the US Presidential Inaugural speeches. This is a data set that we can download from NLTK.

In [None]:
# let's import the needed libraries
import nltk
nltk.download('averaged_perceptron_tagger')
from nltk.tag import pos_tag
nltk.download('inaugural')
from nltk.corpus import inaugural
!pip install wordcloud

This corpus comes in raw format but also is pre-tokenised. Therefore we can call the words() method to retrieve the tokenised text of all speeches.

In [None]:
inaugural_tokens=inaugural.words()
print(inaugural_tokens)

We can look at the tokens from the last inaugural, the one made by President Trump, by looking at the last member of the list of speeches using the fileids() method.

In [None]:
inaugural_tokens_trump = inaugural.words(inaugural.fileids()[0:-1])
print(inaugural_tokens_trump)

We can assign POS-tags to all speeches using NLTK’s pos_tag() method and view the first 20:

In [None]:
tagged_inaugural_tokens = nltk.pos_tag(inaugural_tokens)
tagged_inaugural_tokens[:20]

But what do these shortcuts mean? Official list is here https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

But here's a cheat sheet with English examples for all of us who do not have a degree in linguistics (yet): 

```
CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there (like: “there is” … think of it like “there exists”)
FW foreign word
IN preposition/subordinating conjunction
JJ adjective ‘big’
JJR adjective, comparative ‘bigger’
JJS adjective, superlative ‘biggest’
LS list marker 1)
MD modal could, will
NN noun, singular ‘desk’
NNS noun plural ‘desks’
NNP proper noun, singular ‘Harrison’
NNPS proper noun, plural ‘Americans’
PDT predeterminer ‘all the kids’
POS possessive ending parent’s
PRP personal pronoun I, he, she
PRP$ possessive pronoun my, his, hers
RB adverb very, silently,
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO, to go ‘to’ the store.
UH interjection, errrrrrrrm
VB verb, base form take
VBD verb, past tense took
VBG verb, gerund/present participle taking
VBN verb, past participle taken
VBP verb, sing. present, non-3d take
VBZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-abverb where, when
```

it's from  https://medium.com/@gianpaul.r/tokenization-and-parts-of-speech-pos-tagging-in-pythons-nltk-library-2d30f70af13b

We can then set up lists to hold specific parts of speech such as nouns. Firstly we set up an empty list and the we search for the nouns, NN singular and NNS plural. We can then print the first 20:

In [None]:
nouns = [] 
nouns = [word for (word, pos) in tagged_inaugural_tokens if (pos == 'NN' or pos == 'NNS')] 
nouns[:20]

Now that we have created this list of nouns, we can plot their counts in the corpus as we did yesterday (see lesson on frequency counts).

In [None]:
from nltk.probability import FreqDist
fdist = FreqDist(nouns)
fdist.plot(20,title='Frequency distribution for 20 most common nouns in the inaugural corpus')

### 🐛Minitask 1.1

Plot the same but for the 30 most frequent nouns.  Take a look at the graph and spot any errors.

<details><summary style='color:blue'>CLICK HERE TO SEE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>
    
    ### BEGIN SOLUTION
    fdist.plot(30,title='Frequency distribution for 30 most common nouns in the inaugural corpus')
    ### END SOLUTION
    
Notice one of the frequent nouns listed is apparently "s".  What could this be?

We can also plot the nouns as a word cloud like we did yesterday (see lesson on Counting tokens in text):

In [None]:
from wordcloud import WordCloud
%matplotlib inline
import matplotlib.pyplot as plt
cloud = WordCloud(max_font_size=60,colormap="hsv").generate(' '.join(nouns))
plt.rcParams["figure.figsize"] = (16,12)
plt.imshow(cloud, interpolation='bilinear')
plt.axis('off')
plt.show()

### 🐛Minitask 1.2

Change the code above to create a frequency list for the 10 most common adjectives in the inaugural corpus. The POS-tag for adjective ‘JJ’.

In [None]:
# Write your solution here



<details><summary style='color:blue'>CLICK HERE TO SEE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>
    
    ### BEGIN SOLUTION
    adjectives = []
    adjectives = [word for (word, pos) in tagged_inaugural_tokens if (pos == 'JJ')]
    fdist = FreqDist(adjectives)
    fdist.plot(10,title='Frequency distribution for 10 most common adjectives in the inaugural corpus')
    ### END SOLUTION
</details>

### 🐛Minitask 1.3

Plot a word cloud of the adjectives in the inaugural corpus.

In [None]:
# Write your solution here




<details><summary style='color:blue'>CLICK HERE TO SEE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>

    ### BEGIN SOLUTION
    cloud = WordCloud(max_font_size=60,colormap="hsv").generate(' '.join(adjectives))
    plt.rcParams["figure.figsize"] = (16,12)
    plt.imshow(cloud, interpolation='bilinear')
    plt.axis('off')
    plt.show()
    ### END SOLUTION

</details>

### 🐛Minitask 1.4

You can do the same for another POS-tag. For the full list of Penn Treebank POS tags above in this notebook

In [None]:
# Write your solution here



# 2. Named Entity Recognition


#### Questions & Objectives:

- How can I identify all the names in a corpus?
- How can I visualise those names?
- To understand what a named entity recognition is.
- To understand how to use a named entity recogniser.
- To extract named entities from text.
- To visualise the extracted entities.

#### Key Points

- We will use spaCy's named entity recogniser to extract named entities from text.
- We will continue to use the text from the US Presidential Inaugural speeches.
- We will then extract certain entity types (e.g. all person names) and will see some errors in the output.
- We will visualise the entities extraced using a word cloud.



Named entity recognition (NER) or named entity tagging is a way of locating and labelling named entities in text (e.g. names of people, locations and organisations). NER can be used to identity networks of people mentioned in data or as the first step towards geo-parsing text, i.e. extracting and disambiguating locations mentioned in text.
There are several off-the-shelf NER taggers that we can use. Here we will be using the [spaCy](https://spacy.io) tagger.

To use it, you need to first install it and load the model required to run it. You also need to import spaCy, its models and load it. You should get a bunch of messages and if it worked it will say that the download and installation was successful.

In [None]:
!pip install -U spacy
!python3 -m spacy download en_core_web_sm
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()

We have already loaded the inaugural corpus from NLTK as an example. So we will use its raw text (```inaugural.raw()```) as input to tag it with named entities (in this case names of people, organisations and locations). To do that we just call ```nlp()``` on that list of word tokens. This can take a few seconds as it has to process all of the inaugural corpus. The results are stored in the variable ```doc```.

In [None]:
text=inaugural.raw()
doc=nlp(text)

We can then print the first 10 entities found in the text and their entity tags using a for loop:

In [None]:
for i in range(10):
    print(doc.ents[i].text + ", " + doc.ents[i].label_)

We can save the entities separately in a list of tuples by extracting them for the output ```doc``` and then print the first 20 entries of that list:

In [None]:
ents=doc.ents
named_entities = [(ent.text, ent.label_) for ent in ents] 
named_entities[0:20]

Check out all the different entity types that have been found here (ORG, DATE, ORDINAL, GPE = geopolitical entity etc. etc.). The complete named entity tagset can be found in the spaCy [documentation](https://spacy.io/api/annotation#named-entities).

### 🐛Minitask 2.1

Print the last 20 entities in the inaugural corpus.

In [None]:
# Write your solution here



<details><summary style='color:blue'>CLICK HERE TO SEE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>

    ### BEGIN SOLUTION
    # you need to specify the start counter to be 20 from the end (end minus 20)
    # you can determine the end by using the len() method on the list
    # the end counter can be left unspecified
    # so the following statement returns the last 20 elements of the list
    named_entities[len(named_entities)-20:]
    ### END SOLUTION

</details>

At this point we will introduce the concept of function to make this easily repeatable for different types of entities.

## 3. Functions: the most powerful tool in programmer's hands

__Function:__ A function is a block of code that can be called to complete a task.

We already used a lot of functions that python creators have created for us. like `print()` or `len()`. Notice what `len()` does:

`length_of_list = len( [4,5,6] )`

It's like a magic spell that Harry Potter has learned, and now he can use it in different contexts and on different things.
 
For example, once Harry learned how to freeze things with the `glacius` spell, he can then use it on various objects: 

`ice = glacius( water )`

`frozen_malfoy = glacius( malfoy )`

etc. Notice that this spell **takes** some **ARGUMENTS** which are sort of like inputs, and **returns** something new or changed.

The syntax of using a function is:

`result = function_name( argument )` or even 

`result = function_name( argument1, argument2, argument3etc )`


#### Creating our own functions:

The most amazing thing is that we can create our own functions:

In Python a function is defined using the ```def``` keyword. The name of the function is specified after ```def``` and the parameters passed to the function are specified in round brackets after the name.  The indented bit of code inside a function specifies everything that the function computes. A function can return data as a result of the computation it has performed using the ```return``` keyword.

`def my_own_function( argument1, argument2, argument3):
    do something with arguments
    return some_value`

The important detail is: when you define a function, it is like LEARNING a spell. You did not yet CALL/TRIGGER/CAST that spell. For example, it is possible to learn freezing spell without actually freezing things around you.

- define a function: teach computer what to do `def something:`
- call a function: use the skill you learned earlier `something()`

For example:

In [None]:
# here I define the function - notice nothing happened yet when you run this cell!
def add_numbers( number1, number2):
    print("Adding numbers now!")
    return number1+ number2

# "adding numbers now!" is NOT printed yet, because function is defined, but not used (or called) yet!

In [None]:
# here I call the function, it will return the value
print("I'm about to do the adding")
my_result = add_numbers(3,5)
print("I'm done adding")
print(my_result)

Here's another example of a function:

In [None]:
def is_first_number_larger( number1, number2):
    if number1 > number2:
        return True
    else:
        return False

In [None]:
my_result = is_first_number_larger(3,5)
print(my_result)
my_result = is_first_number_larger(12,5)
print(my_result)

In [None]:
# you could even put the return value of a function right into a print statement:

print(is_first_number_larger(12,5))

### 🐛Minitask 3.1

Write a function that tells you if a word is longer (in terms of its length of characters) than a number, like ```is_word_longer_than(word, number)``` that you could call like example below:

In [None]:
# write your function here, so that below code (which calls your function many time) works.



# tests:
print(is_word_longer_than("banana", 4)) # should be true
print(is_word_longer_than("banana", 12)) # should be false
print(is_word_longer_than("plum", 3)) # should be true
print(is_word_longer_than("plum", 4)) # should be false
print(is_word_longer_than("plum", 5)) # should be false

<details><summary style='color:blue'>CLICK HERE TO SEE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>

    ### BEGIN SOLUTION
    def is_word_longer_than(word, number):
        if len(word) > number:
            return True
        else:
           return False
    ### END SOLUTION
    
</details>

### Back into text mining using functions to help us with our work

Below is the ```get_entities_of_type()``` function.  

It contains code that extracts named entities of a specific type from the NER-tagged output.  This function takes as input the chosen ```type``` of entity and  ```named_entities``` list. It will use them to filter entities and return only the entities of the specified type marked up in the text.

In [None]:
def get_entities_of_type(type_we_seek, entities):
    ents = [string for (string, tag) in entities if (tag == type_we_seek)] 
    # you can add indentation to the above line if you like
    print("Number of strings tagged as " + type_we_seek + " " + str(len(ents)))
    return ents

To run the function, for example to select all entities tagged as ```PERSON```, you need to specify the type and the list of named entities to select from.  The output list can be stored in a new variable (```people```). You can then inspect the first few person entities found in the inaugural speeches and you will see that the tagger is not always correct (e.g. North, Providence).

In [None]:
type="PERSON"
people = get_entities_of_type(type, named_entities)
people[:20]

Notice the difference.

Before we used functions:

`type="PERSON"
persons = [string for (string, tag) in named_entities if (tag == type_we_seek)]
type="ORG"
organisations = [string for (string, tag) in named_entities if (tag == type_we_seek)]`

with a function:

`type="PERSON"
persons = get_entities_of_type(type, named_entities)
type="ORG"
organisations = get_entities_of_type(type, named_entities)`

It's cleaner and prettier. But also right now our calculations took just one line. Imagine we needed 10 lines of calculations in the top bit ... these two examples would take us 20 lines of code

Using a function you only need to define the code once and can execute it multiple times in different ways.

## Rule of thumb: if you do something a lot, and catch yourself copy-pasting a lot of code, create a function for it, and extract all the 'variables' (things that vary) into function arguments.


We can plot the entities extracted by their frequency or create word clouds for them (as we did in the previous lesson).  For example, we can create a word cloud just for the person names mentioned in the corpus.

To do that we need to create a Counter dictionary containing the counts for each repetition of an entity.  A Counter is a subclass of a Dictionary and elements are stored as key and their counts as values.

As you can see, spaCy some person names in the inaugural speeches more than others.

In [None]:
from collections import Counter
people_dict=Counter(people)
print("Number of unique person names: " + str(len(people_dict)))
print(people_dict)

In [None]:
cloud = WordCloud(max_font_size=50,colormap="hsv").generate_from_frequencies(people_dict)
plt.figure(figsize=(16,12))
plt.imshow(cloud, interpolation='bilinear')
plt.axis('off')
plt.show()

### 🐛Minitask 3.2

Create a similar word cloud but for the geopolitical entities (GPE), i.e. countries, cities or states, in the corpus. You can reuse the function we created earlier to select GPE entities.  If you re-use code, remember to adjust the variable names.

In [None]:
# Write your solution here



<details><summary style='color:blue'>CLICK HERE TO SEE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>

    ### BEGIN SOLUTION
    gpes = get_entities_of_type("GPE", named_entities)
    gpe_dict=Counter(gpes)
    cloud = WordCloud(max_font_size=50,colormap="hsv").generate_from_frequencies(gpe_dict)
    plt.figure(figsize=(16,12))
    plt.imshow(cloud, interpolation='bilinear')
    plt.axis('off')
    plt.show()
    ### END SOLUTION
    
</details>

### 🦋 Extra task (optional):

look at the most common parts of code that we used in the notebooks and reused, and turn them into functions.

These functions could be your tools that you will use later, like a personal crafted cheat-sheet.

In [None]:
# Write your solution here

