# Introduction to Natural Language Processing with Spacy: Practice Challenge
Apparently, Harry Potter [fan-fiction](https://en.wikipedia.org/wiki/Fan_fiction) is worthy of [serious](http://www.the-leaky-cauldron.org/features/essays/issue4/genretheory/) [academic](https://wesscholar.wesleyan.edu/cgi/viewcontent.cgi?article=2031&context=etd_hon_theses) [study](https://dspace.library.uu.nl/bitstream/handle/1874/315567/BAThesis%20Janieke%20Koning%203858863.pdf;sequence=1). We're going to build a machine that analyses Harry Potter fan-fiction, so no-one else has to read any if they don't want to. You'll need to install [spaCy](https://spacy.io/usage/), probably using [Anaconda](https://anaconda.org/conda-forge/spacy). You'll also need the [English model](https://spacy.io/usage/models).

In [None]:
from collections import Counter
from unittest import TestCase, TestLoader, TextTestRunner
from itertools import  chain, count, takewhile, tee

from bs4 import BeautifulSoup
import requests
import spacy

def runTest(case):
    suite = TestLoader().loadTestsFromModule(case())
    TextTestRunner().run(suite)   

# Make sure you've installed the English model first.
# python3 -m spacy download en
nlp = spacy.load('en')

Here are some functions for grabbing stories from [www.fanfiction.net](https://www.fanfiction.net):

In [None]:
def chapter_urls(url):
    chunks = url.split('/')
    yield from ('/'.join(chunks[:5]+[str(c)]+[chunks[-1]]) for c in count(1))
    
def chapter_content(urls):
        content = (requests.get(u).content for u in urls)
        soups = (BeautifulSoup(c, "html5lib").find_all("div", class_="storytext") for c in content)
        texts = chain.from_iterable(takewhile(lambda s: len(s) > 0, soups))
        yield from map(lambda t: t.text, texts)
        
def get_fanfic(url):
    return '\n'.join(chapter_content(chapter_urls(url)))

We're going to load [this story](https://www.fanfiction.net/s/12616526/1/Fight-or-Flight) into spacy first.

In [None]:
url = 'https://www.fanfiction.net/s/12616526/1/Fight-or-Flight'
fanfic = get_fanfic(url)
nlp_fanfic = nlp(fanfic)

In [None]:
print(fanfic[0:256])

# Named Entity Recognition
The extraction and classification of proper nouns is called named entity recognition. Look at the [types of entities](https://spacy.io/api/annotation#named-entities) that spacy's model has been trained to recognise. We'll use the following tags as possible locations in the story.

In [None]:
location_tags = ['NORP', 'FACILITY', 'ORG', 'LOC']

The document we generated from the story has an "*ents*" attribute that holds all the named entities. We'll collect all the entities whose "*label_*" attribute is in "*location_tags*".

In [None]:
location_entities = [ent for ent in nlp_fanfic.ents if ent.label_ in location_tags]

Let's look at the first 10. Not *too* bad. Apparrently "Lorcan" is a character. Your humble narrator had to look that up.

In [None]:
location_entities[0:10]

The "root" of a word is called it's *lemma*. Look at the "*lemma_*" attribute of the 5th entity.

In [None]:
location_entities[4].lemma_

The "*subtree*" attribute of an entitity helps describe its context.

In [None]:
tree = list(location_entities[4].subtree)
print(tree)

# Part of Speech Tagging

Each token in the tree has a "*pos_*" attribute that contains its part of speech. "Gryffindor" is correctly identified as a propper noun. "Potter" should be too, but it's been described as a regular noun.

In [None]:
(tree[1].pos_, tree[4].pos_)

Here we use the ["collections.Counter" class](https://docs.python.org/3/library/collections.html#collections.Counter) to get the frequencies of the location entities. We get a dictionary-like object whose keys are the lemmas, and whose values are the number of times each lemma appears.

In [None]:
location_freqs = Counter([e.lemma_ for e in location_entities])
location_freqs['gryffindor']

## Part 1.
Your turn, complete the function "*filter_freq_count*", that removes all the lemmas that have a frequency less than the given threshold. You *might* want to use the built in [filter](https://docs.python.org/3/library/functions.html#filter) function.

In [None]:
def filter_freq_count(freq, threshold=2):
    pass


## Part 2.
Complete the "*sorted_freq_count*" function. It should return a list of tuples, the first value is the lemma, the second is the frequency. The highests frequency should be returned first, you'll need the [sorted](https://docs.python.org/3/library/functions.html#sorted) built-in function.

In [None]:
def sorted_freq_count(freq):
    pass

Take a look at the top 10 locations:

In [None]:
sorted_freq_count(location_freqs)[0:10]

## Part 3.
Populate the list "*character_entities*" so that it contains all the entities from our document whose label is "*Person*".

In [None]:
character_entities = []

## Part 4.
Now complete the "*character_lemmas*" list.

In [None]:
character_lemmas = []

If you've got everything right, this should give us the top 15 most frequently occuring characters:

In [None]:
sorted_freq_count(Counter(character_lemmas))[0:15]

## Part 5.
Given a list of entities and a lemma, the "*lemma_filter*" function should return only those entities that have that lemma.

In [None]:
def lemma_filter(lemma, ents):
    pass

## Part 6
Complete the "*entity_adjectives*" function, which takes a sequence of entities. It should search for all the tokens in all the entities whose "*pos_*" attribute is "ADJ" for adjective. Return the sequence of lemmas for the adjectives with your "*sorted_freq_count*" function applied to it.

In [None]:
def entity_adjectives(entities):
    pass

Let's get the entities for some of the most frequently occuring characters: 

In [None]:
roxanne = list(lemma_filter('roxanne', character_entities))
james = list(lemma_filter('james', character_entities))
violet = list(lemma_filter('violet', character_entities))
fred = list(lemma_filter('fred', character_entities))
lysander = list(lemma_filter('lysander', character_entities))

Now we can use your function to see how they're described. Some of the results are quite reasonable:

In [None]:
for char in ['roxanne', 'james', 'violet', 'fred', 'lysander']:
    print('adjectives for '+char+': ', entity_adjectives(locals()[char])) 