# NLE Assessed Coursework 3: Question 3

For this assessment, you are expected to complete and submit 4 notebook files.  There is 1 notebook file for each question (to speed up load times).  This is notebook 3 out of 4.

Marking guidelines are provided as a separate document.

In order to provide unique datasets for analysis by different students, you must enter your candidate number in the following cell.

In [1]:
candidateno=184521 #this MUST be updated to your candidate number so that you get a unique data sample


In [2]:
#preliminary imports
import sys
sys.path.append(r'resources')
import re
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from itertools import zip_longest
from sussex_nltk.corpus_readers import ReutersCorpusReader
import random
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import math
import spacy
nlp=spacy.load('en_core_web_sm')
from nltk.corpus import gutenberg

Sussex NLTK root directory is resources


## Question 3: Named Entity Recognition and Linking (25 marks)

The code below will run the SpaCy system on the text from Persuasion by Jane Austen.  `mysample` contains a 50% sample which is unique to your candidate number.

In [3]:
#Do NOT change the code in this cell.

#preparing corpus

def clean_text(astring):
    #replace newlines with space
    newstring=re.sub("\n"," ",astring)
    #remove title and chapter headings
    newstring=re.sub("\[[^\]]*\]"," ",newstring)
    newstring=re.sub("VOLUME \S+"," ",newstring)
    newstring=re.sub("CHAPTER \S+"," ",newstring)
    newstring=re.sub("\s\s+"," ",newstring)
    #return re.sub("([^\.|^ ])  +",r"\1 .  ",newstring).lstrip().rstrip()
    return newstring.lstrip().rstrip()


def get_sample(sentslist,seed=candidateno):
    random.seed(seed)
    random.shuffle(sentslist)
    testsize=int(len(sentslist)/2)
    return sentslist[testsize:]
    
persuasion=clean_text(gutenberg.raw('austen-persuasion.txt'))
nlp_persuasion=list(nlp(persuasion).sents)

mysample=get_sample(nlp_persuasion)

In [4]:
type(mysample[0])

spacy.tokens.span.Span

a) **Write code** and **extract**:
* the 30 most common strings referring to PEOPLE in `mysample`.
* the 30 most common strings referring to PLACES in `mysample`.

\[6 marks\]

In [5]:
people=[]
places=[]
def doc_freq(doclist):
    #df set is initialised
    df={}
    #iterate through each document in give list as well as each word in the doc 
    for doc in doclist:
        for w in doc:
            #gives the frequency of an item in a doc
            df[w]=df.get(w,0)+1
    #doc freq var is returned
    return df

for sentence in mysample:
    for t in sentence:
        if (t.ent_type_=='PERSON'):
            people.append([t.text,t.ent_type_])
        if (t.ent_type_=='LOC'):
            places.append([t.text,t.ent_type_])
peopleFreq=doc_freq(people)
placeFreq=doc_freq(places)
peopleOrdered = {k: v for k, v in sorted(peopleFreq.items(), reverse=True, key=lambda item: item[1])}
placeFreq = {k: v for k, v in sorted(placeFreq.items(), reverse=True, key=lambda item: item[1])}

In [6]:
people=[]
places=[]
x=0
y=0
for k in peopleFreq:
    if(x<=30):
        people.append(k)
        x=x+1
        
for k in placeFreq:
    if(y<=30):
        places.append(k)
        y=y+1 

In [7]:
headings=["People"]
(pd.DataFrame(people,columns=headings))

Unnamed: 0,People
0,Nurse
1,PERSON
2,Rooke
3,Mrs
4,Musgrove
5,Captain
6,Wentworth
7,Mr
8,Shepherd
9,'s


In [8]:
headings=["Places"]
(pd.DataFrame(places,columns=headings))

Unnamed: 0,Places
0,LOC
1,the
2,Plymouth
3,Cape
4,Mrs
5,Wallis
6,Indies
7,Laconia
8,Mackenzie
9,Western


b) Making reference to specific examples from the text in `mysample`, **discuss** the different types of errors made by the named entity recogniser. \[6 marks\]

One main error that is obvious from part a is that the tags themselves are marked by their own tag such as 'LOC' is tagged by 'LOC' and 'PERSON' is tagged by 'PERSON'. This can easily be removed by adding an 'if' statement that looks and removes the tag from the list when looking for the tag. 
Another error that can be seen from the output to part a is the incorrect tagging of propositions as locations aka 'LOC'. These can be seen in the Location table with 'of', 'to' and 'the' which are used with Location entities but the entities themselves have nothing to do with Locations. This means that they should be untagged like wth other propositions in 'mysample', interestingly though no propositions are incorrectly tagged in the People table. However both ''s' and 'all' are tagged as 'person' even though ones a suffix and the others an adjective. Due to this they should be untagged like propisitions are in the rest of 'mysample' e.g. left untagged.
Although I have focused on the examples I've shown in part a there are other examples of propositions and indefinite articles that have been tagged in correctly, for example when going through 'mysample' 'a' has been seen to be classed as a 'TIME' and 'CARDINAL' entity. This therefore shows that the error of not being able to class propositions, indefinite articles and suffixes correctly is an error that runs throughout the entity recogniser and not just a few anomalies in the sample. One reason for this error maybe that these words keep finding themselves before or after words which are correctly paired with theses entity tags and part of the algorithm partners these tags with the propositions and articles. It is to be noted that this doesn't occur all the time and is quite rare to happen but it does still happen.

c) **Design** and **implement** a system to track the locations of characters throughout the story.  For a given PERSON named entity, your system should return a list of time-ordered LOCATIONS for that character.  Test your system using the complete text of "Persuasion" (**not** `mysample`) for at least 3 major characters.   \[13 marks\]

In [9]:
track=[]
c=0
#go through each word of each sentence and extract person, location and chapter
for sentence in nlp_persuasion:
    for t in sentence:
        if (t.ent_type_=='PERSON'):
            track.append([t.text,t.ent_type_])
        if (t.ent_type_=='LOC'):
            track.append([t.text,t.ent_type_])
        #if (t.text=='Chapter'):
         #   c=1
        #elif (c==1):
         #   c=0
          #  track.append(['Chapter',t])

In [10]:
elliot=[]
anne=[]
liz=[]

def getLoc(l,name):
    currentLoc='N/A'
    temp=[]
    for entry in l:
        #Change location
        if(entry[1]=='LOC'):
            currentLoc=entry[0]   
        #Get Location
        if(entry[0]==name):
            #if theres a previous location check its not the same as current
            if(len(temp)>1):
                if(temp[len(temp)-1]!=currentLoc):
                    temp.append(currentLoc)               
            #otherwise add it anyway
            else:
                temp.append(currentLoc)
        #if(entry[0]=='Chapter'):
         #   temp.append(entry[1])
        
    return temp;
    



In [40]:
elliot=getLoc(track,'Elliot') 
anne=getLoc(track,'Anne') 
liz=getLoc(track,'Elizabeth') 

locations=[]
large=0

#following code is to make sure that lists are the same length
if(len(elliot)>large):
    large=len(elliot)
elif(len(anne)>large):
    large=len(anne)
elif(len(liz)>large):
    large=len(liz)
large=large

def sameLength(l,large):
    temp=l
    while(len(temp)<large+1):
        temp.append("")
    return temp;

elliot=sameLength(elliot,large)
anne=sameLength(anne,large)
liz=sameLength(liz,large)

#output table with all locations in
(pd.DataFrame({'Elliot': elliot,'Anne' : anne, 'Elizabeth' : liz}))

Unnamed: 0,Elliot,Anne,Elizabeth
0,,,
1,,,Sound
2,Sound,Sound,Laconia
3,Laconia,Laconia,Musgrove
4,Musgrove,Musgrove,Streights
5,Streights,Streights,Dugdale
6,Dugdale,Dugdale,'s
7,'s,'s,Cape
8,Cape,Cape,Musgrove
9,Musgrove,Musgrove,Temple
