# Who Are Our Rebels

In this notebook I'm going to use some simple NLP to try to explore who were our favorite rebels. In the process I hope to demonstrate some of the data-wrangling challenges that go along with NLP.

### Get Data from Canvas

Canvas has a RESTful API. I'm going to use it to pull down the responses to the homework assignments.

By the way, you can also use the Canvas API to access your data.

The cell below contains the code I used to get the data from Canvas.

```Python
with open(os.path.join(os.path.expanduser("~"), ".canvaslms", "quiz_token")) as f:
    token = f.read()
    
from canvasapi import Canvas
from bs4 import BeautifulSoup
import unicodedata

API_URL = "https://canvas.lms.unimelb.edu.au/"
canvas = Canvas(API_URL, token)
bec = canvas.get_user(canvas.get_current_user().id)
ehealth = canvas.get_course(110024)

# This is the id number for the assignment
rebel_id = 139157

rebels = ehealth.get_assignment(rebel_id)

rebel_submissions = rebels.get_submissions()

responses = [(b.user_id, b.body) for b in rebel_submissions]


len(responses)

len(set([r[0] for r in responses]))

rebel_text = [unicodedata.normalize("NFKC", BeautifulSoup(r[1]).getText()) for r in responses if r[1]]

with open("rebel_text.json", "w") as f:
    json.dump(rebel_text, f)
```

In [None]:
import os

from collections import Counter
import json
# get token
import random
import matplotlib.pyplot as plt


### 

In [None]:
with open("rebel_text.json", "r") as f:
    rebel_text = json.load(f)

In [None]:
rebel_text

### We are going to use the very popular [Spacy](https://spacy.io/) NLP package.

If you are interested in learning more about Spacy, we have some notebooks [here](https://github.com/Melbourne-BMDS/md3nlp_20020) that you can run online with binder to learn more.

In [None]:
import spacy
from IPython.display import SVG, YouTubeVideo
from spacy import displacy
nlp = spacy.load("en_core_web_sm")

#### Entity Recognition

Spacy will parse the sentences and then try to recognize different entitites that are named in the text, such as people or organizations or diseases. Let's see how it works.

In [None]:
for txt in rebel_text:
    doc = nlp(txt)
    displacy.render(doc, style="ent")
    print('-'*72)

### Spacy seems to do OK
#### But there are some consistent failures

Sometimes the solitary surnames are recognized as `ORG`s (organizations). This is not surprising because

- Dyson is a vaccum cleaner
- Tesla is a car company
    
The answer about Nikola Tesla is particularly problematic where we see Tesla as an organization, a piece of art, and a product---everything except a person.

![Tesla labels](tesla_edison.png)

### Filtering Entities

Let's reduce the number of the recognized entities by only keeping entities that might conceivably be one of our rebels, which in the Tesla case is a problem. Eventually my algorithm is going to count the number of times a name is mentioned to guess that the most frequently named person is the identified hero.

In [None]:
rebels = []
labels = ['ORG', 'PERSON', 'WORK_OF_ART', 'PRODUCT']
for txt in rebel_text:
    doc = nlp(txt)
    rebels.append([ent for ent in doc.ents if ent.label_ in labels and ent.string != 'Freeman' and ent.string != 'Dyson'])


In [None]:
rebels

### Sort identified entities

I want to sort the identified entities for each document from longest to shortest. This is so that I can combine entities such as "Albert Einstein" and "Einstein". 

In [None]:
for r in rebels:
    r.sort(key=lambda x:len(x.string), reverse=True)


### With our sorted lists, we can try to replace partial names with full names

In [None]:
def get_full_names(r):
    n = len(r)
    for i in range(n-1):
        for j in range(i,n):
            if r[j].string in r[i].string:
                r[j] = r[i]
    return None

Let's use `get_full_names` to replace all partial names (e.g. 'Albert' or 'Einstein' with the full name e.g. 'Albert Einstein').

In [None]:
for i in range(len(rebels)):
    r = rebels[i]
    print(i)
    print("Before")
    print(r)
    get_full_names(r)
    print("After")
    print(r)
    print('-'*20)

### How well did it work?

Most of the substitutions worked reasonably well, but cases 5 (Venter) and 6 (Tesla) clearly failed. Let's examine those to see what is happening.

We are comparing the `string` attributes (`r[j].string in r[i].string`), so let's look at the strings

In [None]:
for ent in rebels[5]:
    print("'%s'"%ent.string)

In [None]:
for ent in rebels[6]:
    print("'%s'"%ent.string)

### Extra Spaces!

We can see that the `Venter` and `Tesla` strings have an extra space after them so our comparison 'Venter ' in 'John Craig Venter' fails. Similarly with 'Tesla '. If we use the Python `strip` method, we can delete leading and trailing white spaces.

In [None]:
def get_full_names2(r):
    n = len(r)
    for i in range(n-1):
        for j in range(i,n):
            if r[j].string.strip() in r[i].string.strip():
                r[j] = r[i]
    return None

In [None]:
with open("rebel_text.json", "r") as f:
    rebel_text = json.load(f)

rebels = []
labels = ['ORG', 'PERSON', 'WORK_OF_ART', 'PRODUCT']
for txt in rebel_text:
    doc = nlp(txt)
    rebels.append([ent for ent in doc.ents if ent.label_ in labels and ent.string.strip() != 'Freeman' and ent.string.strip() != 'Dyson' and ent.string.strip() != 'Freeman Dyson'])

for r in rebels:
    r.sort(key=lambda x:len(x.string), reverse=True)
    
for i in range(len(rebels)):
    r = rebels[i]
    print(i)
    print("Before")
    print(r)
    get_full_names2(r)
    print("After")
    print(r)
    print('-'*20)

### Count the identified Entities

In [None]:
counted=[Counter(r) for r in rebels]

In [None]:
for c in counted:
    print(c.most_common(5))

### How did our counting work?

Again, pretty well, but sometimes we have a name that is counted with the same frequency as a non-name entity (e.g. `(Madame Curie, 2), (a Nobel Prize, 2)`. So let's start by selecting the entities that are counted at the top-frequency and then see if we can select entities that are a `PERSON'.

In [None]:
def most_frequent(counted):
    count = counted[0][1]
    return [c for c in counted if c[1] == count]

top_counted = [most_frequent(c.most_common(5)) for c in counted if c]
top_counted

### Return the top `PERSON`

If there is more than one `PERSON`, we'll just return the first one.

In [None]:
def get_top_person(counted):
    try:
        return [ent for ent in counted if ent[0].label_ == 'PERSON'][0]
    except:
        return None
    

In [None]:
top_counted_persons = [c[0] if len(c) == 1 else get_top_person(c) for c in top_counted]

In [None]:
top_counted_persons

In [None]:
identified_rebels = [e[0] for e in top_counted_persons if e]
identified_rebels

In [None]:
identified_rebels.sort(key=lambda x:len(x.string), reverse=True)

In [None]:
identified_rebels

In [None]:
get_full_names2(identified_rebels)


In [None]:
identified_rebels

In [None]:
counted_identified_rebels = Counter(identified_rebels)
counted_identified_rebels.most_common(60)

In [None]:
f, axs = plt.subplots(1,figsize=(15,15))
pd.DataFrame([x.string.strip() for x in identified_rebels])[0].value_counts().head(60).plot.barh(axes=axs)
axs.set_xlabel("Counts")


In [None]:
f.savefig("identified_rebels.png")

## Discussion

I took a fairly simplistic approach to identifying the named rebels. The technique was not robust to several textual features, such as typos and misspellings possessive form. Because I was counting mentions of names, if someone used a lot of pronouns to refer to the rebel I might not have identified them properly. Identify the answer you submitted. Did I correctly find your rebel? If not, can you think of things in your writing that could be edited to make the identification task easier?