# NLE Assessed Coursework 2

For this assessment, you are expected to complete and submit this notebook file.  When answers require code, you may import and use library functions (unless explicitly told otherwise).  All of your own code should be included in the notebook rather than imported from elsewhere.  Written answers should also be included in the notebook.  You should insert as many extra cells as you want and change the type between code and markdown as appropriate.

In order to avoid misconduct, you should not talk about these coursework questions with your peers.  If you are not sure what a question is asking you to do or have any other questions, please ask me or one of the Teaching Assistants.

Marking guidelines are provided as a separate document.

In order to provide unique datasets for analysis by different students, you must enter your candidate number in the following cell.

In [1]:
candidateno=184521 #this MUST be updated to your candidate number so that you get a unique data sample

SyntaxError: invalid syntax (<ipython-input-1-b71e1386593e>, line 2)

In [2]:
#preliminary imports
import sys
sys.path.append('resources')

import nltk
import re
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from itertools import zip_longest
from nltk.tokenize import word_tokenize

from sussex_nltk.corpus_readers import AmazonReviewCorpusReader
import random
from nltk.corpus import stopwords
nltk.download('semcor')

from nltk.corpus import wordnet as wn
from nltk.corpus import wordnet_ic as wn_ic

Sussex NLTK root directory is resources


[nltk_data] Downloading package semcor to
[nltk_data]     C:\Users\chris\AppData\Roaming\nltk_data...
[nltk_data]   Package semcor is already up-to-date!


## Question 1: Document Similarity (25 marks)
The objective of this question is to investigate whether incorporating lexical knowledge from WordNet might improve document similarity methods.  For example, knowing that both *tiger* and *leopard* are hyponyms of *big_cat* should increase the similarity between a document mentioning a *tiger* and a document mentioning a *leopard*.

The code below will generate two document collections, both in bag-of-words format, one from the Medline Corpus and one from the Wall Street Journal corpus.

In this question, there are marks available for the quality of your code and the quality of your explanations.

In [3]:
from sussex_nltk.corpus_readers import MedlineCorpusReader
from sussex_nltk.corpus_readers import WSJCorpusReader
from nltk.stem.wordnet import WordNetLemmatizer

def normalise(tokenlist):
    tokenlist=[token.lower() for token in tokenlist]
    tokenlist=["NUM" if token.isdigit() else token for token in tokenlist]
    tokenlist=["Nth" if (token.endswith(("nd","st","th")) and token[:-2].isdigit()) else token for token in tokenlist]
    tokenlist=["NUM" if re.search("^[+-]?[0-9]+\.[0-9]",token) else token for token in tokenlist]
    return tokenlist

def filter_stopwords(tokenlist):
    stop = stopwords.words('english')
    return [w for w in tokenlist if w.isalpha() and w not in stop]

def stem(tokenlist):
    st=WordNetLemmatizer()
    return [st.lemmatize(token) for token in tokenlist]

   
def make_bow(somestring):
    rep=word_tokenize(somestring)  #step 1
    rep=normalise(rep)   #step 2
    rep=stem(rep)   #step 3
    rep=filter_stopwords(rep)  #step 4
    dict_rep={}
    for token in rep:
        dict_rep[token]=dict_rep.get(token,0)+1  #step 5
    return(dict_rep)

wsj=WSJCorpusReader()
medline=MedlineCorpusReader()

collectionsize=50
collections={"wsj":[],"medline":[]}

for key in collections.keys():
    if key=="wsj":
        generator=wsj.raw()
    else:
        generator=medline.raw()
    while len(collections[key])<collectionsize:
        collections[key].append(next(generator))

bow_collections={key:[make_bow(doc) for doc in collection] for key,collection in collections.items()}

a). For each step in the `make_bow()` function, **explain** what it does and why it is applicable when creating document representations for document similarity methods. \[8 marks\]

Step 1: rep=word_tokenize(somestring)
This step tokenises the input sting by removing all punctuation from the input string and then assigns this now tokenised string to the variable 'rep'.

Step 2: rep=normalise(rep)
This step normalises the value of the variable 'rep', which is the tokenised input string. Normalising the string equates to putting everything into lowercase before scanning through and changes ints to num strings and nth's to strings aswell. This leaves 'rep' with the value of s sting with no ints and all in lowercase. 

Step 3: rep=stem(rep)
The 'stem' function removes any tenses of the words in the given string to give the default present tense of the word, for example if given "streamed" it would return "stream". This stem'd sting is then passed as the value for 'rep'. 

Step 4: rep=filter_stopwords(rep)
This step makes use of the "filter_stopwords" which takes a string and removes any stop words, such as 'the', 'is' and 'are', which have to be removed before the string can be processed by a classifier. This step tehrfore assigns a normalised, tokesnised and present tense string without any stop words to the variable of 'rep'.

Step 5: dict_rep[token]=dict_rep.get(token,0)+1
This step takes place within a for loop where for each token in the variable 'rep' step 5 is executed, with 'token' being a key in a dictionary that is taken from the string 'rep' and passed to the new dictionary of 'dict_rep'. This line and for loop is therefore building a dictionary of the collection of the input string given at the start of the function 'make_bow'.

b). Apply a TF-IDF weighting to the representations and then compute: 
* the average cosine similarity of medline documents to each other, 
* the average cosine similarity of WSJ documents to each other,
* the average cosine similarity of medline documents to WSJ documents
\[8 marks\]

In [4]:
import math
#assumptions;Comparing a document against itself results in a simalarity of 1 so this may cause results to be incorrect
#            It would be easy enough to remove the similarites equal to 1 but there may be some which do have a similarity
#            of 1 that aren't the same document which could make the values even more incorrect
#            Don't need to compare all documents neccassariy and could do a sample if there was enough variety
#           

#tf function
def get_tf(value,N):
    #return the value divided by the length
    return value/N

#idf function
def get_idf(value,N):
    #return the value divided by the length logged
    return math.log(N/value)
        
#tfidf function
def get_tfidf(doclist):
    #initialise iteratiors
    y=0
    x=0
    #initialise sets
    tfidf={}
    tfidf_temp={}
    #for loop to iterate through the doc list
    for z in range (len(doclist)):
        #length of current doc set to N
        N=len(doclist[x])
        doc=doclist[x]
        x+=1
        #for loop to iterate through each key and value in a doc
        for k,v in doc.items(): 
            #temporary set is filled with tf and idf values if each item
            tfidf_temp[k]=get_tf(doc[k],N)*get_idf(doc[k],N)
            y+=1
        #temporary set is copied into actual set
        tfidf[x]=tfidf_temp
        #temporary set is cleared
        tfidf_temp={}
    #return tfidf values in the set
    return tfidf

#comparison functions
#dot product function
def dot(docA,docB):
    #initialise my sum
    s= 0
    #for loop to iterate through each key and value of a doc
    for (k,v) in docA.items():
        #add the current sum to itself with docA's value multiplied by docB's
        s+=(v*docB.get(k,0))
    #sum is returned
    return s

#function for cosine sim
def cos_sim(docA,docB):
    #variable is set to dot product of docA and docB
    sim=dot(docA,docB)
    #denominator is set to the square root of the dot product of A and A multiplied by the dot product of B and B
    denominator=math.sqrt(dot(docA,docA)*dot(docB,docB))
    #the sim var is then divided by the denominator
    sim=sim/denominator
    #return the final similarity
    return sim

In [5]:
#get tfidf for wsj
wsj_tfidf=get_tfidf(bow_collections['wsj'])

#get tfidf for medline
med_tfidf=get_tfidf(bow_collections['medline'])

#comparison function
def compare(colA,colB):
    #initialise the similarity set 
    sims=[]
    #initialise the sum variable
    sum=0
    #iterate through the given collections of tfidf values
    for kA in colA.keys():
        for kB in colB.keys():
            #compare the currently selected tfidf values using the cos_sim function
            sims.append(cos_sim(colA[kA],colB[kB]))
    #iterate through each similarity
    for x in range(len(sims)):
        #add all similarities together
        sum=sum+sims[x]
    #divde the total by the number of values giving the average value
    sum=sum/len(sims)
    #return the average value
    return sum
    
#output wsj to wsj
print("Avr similarity for WSJ to WSJ =",compare(wsj_tfidf,wsj_tfidf))
#output med to med
print("Avr similarity for Medline to Medline =",compare(med_tfidf,med_tfidf))
#output wsj to med
print("Avr similarity for Medline to WSJ =",compare(wsj_tfidf,med_tfidf))

Avr similarity for WSJ to WSJ = 0.07721004853891898
Avr similarity for Medline to Medline = 0.11859917330434114
Avr similarity for Medline to WSJ = 0.03754868388759906


assumptions;Comparing a document against itself results in a simalarity of 1 so this may cause results to be incorrect
           It would be easy enough to remove the similarites equal to 1 but there may be some which do have a similarity
           of 1 that aren't the same document which could make the values even more incorrect
           Don't need to compare all documents neccassariy and could do a sample if there was enough variety

import maths library

tf function:returns the value divided by the length
idf function:returns the value divided by the length logged

tfidf function:initialises iterators and sets then uses a for loop to iterate through the doc list.
               Length of current doc set to N and another for loop  uses this to iterate through each key and value in a doc.
               Temporary set is filled with tf and idf values of each item and temporary set is copied into actual set before                   being cleared.
               Return all tfidf values

Comparison functions
dot product function:initialise my sum
                     For loop used to iterate through each key and value of a doc
                     Add current sum to itself with docA's value multiplied by docB's and return it
function for cosine sim: a variable is set to dot product of docA and docB
                         Denominator becomes the square root of the dot product of A & A multiplied by the dot product of B & B
                         The sim variable is then divided by the denominator before returning the final similarity

get tfidf is called for medline and wsj and stored in variables

comparison function:initialise the similarity set and sum variable
                    iterate through the given collections of tfidf values
                    compare the currently selected tfidf values using cos_sim function
                    iterate through each similarity and add them together
                    divide the total by the number of values giving the average value which is returned

output similarities of wsj to wsj, med to med and wsj to med

c). Expand the document representations by adding **synonyms** and **hypernyms** for each **noun** in the document.  For example, 2 occurrences of the word *tiger* should add 2 occurrences of each of the following **lemma_names** found in the WordNet hypernym hierarchy above *tiger*:
* \['tiger', 'Panthera_tigris'\]
* \['big_cat', 'cat'\]
* \['feline', 'felid'\]
* \['carnivore'\]
* \['placental', 'placental_mammal', 'eutherian', 'eutherian_mammal'\]
* \['mammal', 'mammalian'\]
* \['vertebrate', 'craniate'\]
* \['chordate'\]
* \['animal', 'animate_being', 'beast', 'brute', 'creature', 'fauna'\]
* \['organism', 'being'\]
* \['living_thing', 'animate_thing'\]
* \['whole', 'unit'\]
* \['object', 'physical_object'\]
* \['physical_entity'\]
* \['entity'\]

Recompute the similarities calculated in part b).  Discuss your results. \[9 marks\]

In [6]:
#function to get the nouns of a document
def getNouns(doc):
    #initialise noun array
    sym={}
    nouns=[]
    #iterate through each sentence in a document
    for k in range(len(doc)):
        sentences=doc[k]
        #iterate through each word in the sentence
        for  w,v in sentences.items():
            word=w
            #add them to nouns array if a noun
            if(len(wn.synsets(word))>0):
                temp = wn.synsets(word)[0].pos()
                if (temp=='n'):
                    nouns.append(word)
    return nouns
                    
#function to get hypernyms of a given noun
def getHypernyms(n):
    #initialises hypernym array
    hyp=[]
    #makes sure it only includes words with hypernyms
    if(len(wn.synsets(n))>1):
        #hypernyms are put into a variable
        hypernyms=wn.synsets(n)[1].hypernym_paths()
        for h in hypernyms[0]:
            #hypernym names are extarcted and returned
            hyp.append(h.lemma_names())
    return hyp

#function to get symonyms of a given noun
def getSymonyms(n):
    #initialises symonym set
    sym=[]
    #gets the synset of the word
    synset=wn.synsets(n)
    #makes sure it only includes words with symonyms
    if(len(synset)>1):
        for s in synset:
            for w in s.lemmas():
                #symonym's names are put into a variable
                sym.append(s.lemma_names())
    return sym

#pair a word with its symonyms
def pair(w,s,h):
    #put symonyms in a set
    syms={'symonyms':s}
    #put hypernyms in a set
    hyp={'hypernyms':h}
    #syms and hyps combined
    symHyp=(syms,hyp)
    #syms and hypes added to set with key of the word
    newSet={w:symHyp}
    #return the set
    return newSet
    
def getSymHyp(nouns):
    #initialise new doc dictionary
    newDoc=[]
    #iterate through each word in the nouns array and get their symonyms and hypernyms
    i=0
    for i in range(len(nouns)):
        word=nouns[i]
        s=getSymonyms(word)
        h=getHypernyms(word)
        #put the word and its symonyms and hypernyms into the pair function and add this set to the newDoc variable
        newDoc.append(pair(word,s,h))
    #return the completed newDoc
    return newDoc

#fucntion to output a word and symonyms and hypernyms
def outputs(newDoc):
    i=0
    #iterate through each word in a doc and output its symonyms and hypernyms
    for w in newDoc:
        for x in newDoc[i]:
            print('The symonyms and hypernyms for:',w)
        i+=1
    return newDoc

Assumptions:
There is no need to compute the tfidf values for the new collections ass there will only be a slight increase in their similarities. This increase shall be due to multiple words having the same hypernyms and symonyms making them seem more similar than they actually are. This increase is true as it would now take in to account them containing sentences with similar meanings instead of just if they contain the same words and cutting down to just the nouns decreases the amount of comparisons needed and therefore would increase similarity. However I don't think the similarity needs to be documented and as it will not increase by much.

In [7]:
#call the above functions to get the needed answer
med_nouns=getNouns(bow_collections['medline'])
newDoc=outputs(getSymHyp(med_nouns))
wsj_nouns=getNouns(bow_collections['wsj'])
newDoc=outputs(getSymHyp(wsj_nouns))

The symonyms and hypernyms for: {'case': ({'symonyms': [['case', 'instance', 'example'], ['case', 'instance', 'example'], ['case', 'instance', 'example'], ['event', 'case'], ['event', 'case'], ['lawsuit', 'suit', 'case', 'cause', 'causa'], ['lawsuit', 'suit', 'case', 'cause', 'causa'], ['lawsuit', 'suit', 'case', 'cause', 'causa'], ['lawsuit', 'suit', 'case', 'cause', 'causa'], ['lawsuit', 'suit', 'case', 'cause', 'causa'], ['case'], ['case'], ['case'], ['subject', 'case', 'guinea_pig'], ['subject', 'case', 'guinea_pig'], ['subject', 'case', 'guinea_pig'], ['case'], ['case'], ['case', 'caseful'], ['case', 'caseful'], ['case', 'grammatical_case'], ['case', 'grammatical_case'], ['case'], ['character', 'eccentric', 'type', 'case'], ['character', 'eccentric', 'type', 'case'], ['character', 'eccentric', 'type', 'case'], ['character', 'eccentric', 'type', 'case'], ['font', 'fount', 'typeface', 'face', 'case'], ['font', 'fount', 'typeface', 'face', 'case'], ['font', 'fount', 'typeface', 'face

The symonyms and hypernyms for: {'fourth': ({'symonyms': [['fourth'], ['one-fourth', 'fourth', 'one-quarter', 'quarter', 'fourth_part', 'twenty-five_percent', 'quartern'], ['one-fourth', 'fourth', 'one-quarter', 'quarter', 'fourth_part', 'twenty-five_percent', 'quartern'], ['one-fourth', 'fourth', 'one-quarter', 'quarter', 'fourth_part', 'twenty-five_percent', 'quartern'], ['one-fourth', 'fourth', 'one-quarter', 'quarter', 'fourth_part', 'twenty-five_percent', 'quartern'], ['one-fourth', 'fourth', 'one-quarter', 'quarter', 'fourth_part', 'twenty-five_percent', 'quartern'], ['one-fourth', 'fourth', 'one-quarter', 'quarter', 'fourth_part', 'twenty-five_percent', 'quartern'], ['one-fourth', 'fourth', 'one-quarter', 'quarter', 'fourth_part', 'twenty-five_percent', 'quartern'], ['fourth'], ['fourth', '4th', 'quaternary'], ['fourth', '4th', 'quaternary'], ['fourth', '4th', 'quaternary'], ['fourthly', 'fourth'], ['fourthly', 'fourth']]}, {'hypernyms': [['entity'], ['abstraction', 'abstract_en

The symonyms and hypernyms for: {'pierre': ({'symonyms': []}, {'hypernyms': []})}
The symonyms and hypernyms for: {'year': ({'symonyms': [['year', 'twelvemonth', 'yr'], ['year', 'twelvemonth', 'yr'], ['year', 'twelvemonth', 'yr'], ['year'], ['year'], ['class', 'year'], ['class', 'year']]}, {'hypernyms': [['entity'], ['abstraction', 'abstract_entity'], ['measure', 'quantity', 'amount'], ['fundamental_quantity', 'fundamental_measure'], ['time_period', 'period_of_time', 'period'], ['year']]})}
The symonyms and hypernyms for: {'old': ({'symonyms': [['old'], ['old'], ['old'], ['old'], ['old', 'older'], ['old', 'older'], ['erstwhile', 'former', 'old', 'onetime', 'one-time', 'quondam', 'sometime'], ['erstwhile', 'former', 'old', 'onetime', 'one-time', 'quondam', 'sometime'], ['erstwhile', 'former', 'old', 'onetime', 'one-time', 'quondam', 'sometime'], ['erstwhile', 'former', 'old', 'onetime', 'one-time', 'quondam', 'sometime'], ['erstwhile', 'former', 'old', 'onetime', 'one-time', 'quondam', 

function to get the nouns of a document:iterate through each sentence in a document
                                        iterate through each word in the sentence
                                        add them to nouns array if a noun

function to get hypernyms of a given noun:initialise hypernym array and makes sure it only includes words with hypernyms
                                          hypernyms are put into a variable and are then extarcted and returned

function to get symonyms of a given noun:initialises symonym set and gets the synset of a word
                                         makes sure it only includes words with symonyms
                                         symonym's names are put into a variable and returned

pair a word with its symonyms:put symonyms and hypernyms into sets which are then combined
                              syms and hypes added to set with a key being the word which is returned
                              
def getSymHyp(nouns):initialise new doc dictionary
                     iterate through each word in the nouns array and get their symonyms and hypernyms
                     put the word, symonyms and hypernyms into the pair function and add this to the newDoc variable
                     return the completed newDoc

def outputs(newDoc):
fucntion to output a word and symonyms and hypernyms:iterate through each word in a doc and output its symonyms and hypernyms

getnouns and output is the called for each collection


## Question 2: Supervised Methods for WSD (25 marks)
The objective of this question is to build and evaluate a word sense disambiguation (WSD) system for words with multiple senses.  

a).  For each word occurring in the medline corpus (defined above), **write code** to find how many senses it has according to WordNet.  Print a list of the 10 most frequently occurring words with 2 senses (in this corpus). \[4 marks\]

In [8]:
#import counter library for later use
from collections import Counter

#function to extract  number of senses of a given word
def sensesPerWord(w):
    #variable = number of synsets e.g. length of synset list
    numSenses=len(wn.synsets(w))
    #return number of senses
    return numSenses

#fuction that finds the distribution of senses for every word in a list of sentences 
def twoSense(w):
    #initialise senses set
    w2Senses={}
    #iterate through each word
    for k in w.keys():
        #if they have two senses add it to the set
        if sensesPerWord(k)==2:
            w2Senses[k]=w[k]
    #return all words with 2 senses
    return w2Senses

#function that gives document frequently
def doc_freq(doclist):
    #df set is initialised
    df={}
    #iterate through each document in give list as well as each word in the doc 
    for doc in doclist:
        for w in doc.keys():
            #gives the frequency of an item in a doc
            df[w]=df.get(w,0)+1
    #doc freq var is returned
    return df

#function to get medlines top 10 with 2 senses
def Top10(col):
    #frequency of medline document
    freq=doc_freq(col)
    #variable given words with only two senses
    filteredDoc=twoSense(freq)
    #the top 10 most frequent words in the lsit are applied to the top 10 variable
    topTen=dict(Counter(filteredDoc).most_common(10))
    #top ten are printed through iteration
    i=1
    for k in topTen.keys():
        print("In place", i ,"with", topTen[k], "occurrences is:", k)
        i+=1
    
#top ten function called
Top10(bow_collections['medline'])

In place 1 with 10 occurrences is: membrane
In place 2 with 7 occurrences is: temperature
In place 3 with 7 occurrences is: molecular
In place 4 with 6 occurrences is: uptake
In place 5 with 6 occurrences is: data
In place 6 with 6 occurrences is: molecule
In place 7 with 5 occurrences is: may
In place 8 with 5 occurrences is: amino
In place 9 with 5 occurrences is: phenomenon
In place 10 with 4 occurrences is: ratio


import counter library for later use

def sensesPerWord(w):
function to extract  number of senses of a given word:variable = number of synsets e.g. length of synset list which is returned

def twoSense(w):
function that finds the distribution of senses for every word in a list of sentences:
                                                                    initialise senses set and iterate through each word
                                                                    if they have two senses add it to the set and return it

function that gives document frequently:iterate through each document in the given list as well as each word in the doc
                                        give the frequency of an item in a doc and return it
def Top10(col):
function to get medlines top 10 with 2 senses:frequency of medline document is gotten
                                              variable given words with only two senses
                                              top ten are printed through iteration
                                              
top ten function called

b). A *supervised* WSD algorithm derives model(s) from *sense-annotated corpus data* in order to predict senses of ambiguous words in un-annotated data.  Using the entire document as context, **implement** a supervised word sense disambiguation algorithm to determine the most likely sense of each occurrence of the 3 most frequently occuring words identified in part a). \[8 marks\]

In [26]:
#fucntion to take words that have been tagged to a sentence
def extract_tags(taggedsentence):
    #initialise array alist
    alist=[]
    #for every word in taggedsentence check if its a Tree adn if labels=Lemmas and whether it has 1 leaf
    for item in taggedsentence:
        if isinstance(item,nltk.tree.Tree):
            if isinstance(item.label(),nltk.corpus.reader.wordnet.Lemma) and len(item.leaves())==1:
                #if so add to the output list
                alist.append((item.leaves()[0].lower(),item.label().synset()))
    return alist
            
#apply extract_tags to all sentences
def extract_senses(fileid_list):
    sentences=[]
    #goes through each file in the list
    for fileid in fileid_list:
        #show which file is currently being processed
        print("Processing {}".format(fileid))
        #add all extracted tags to a sentence
        sentences+=[extract_tags(taggedsentence) for taggedsentence in semcor.tagged_sents(fileid,tag='sem')]
    return sentences

#check whether a string is in a given sentence
def contains(sentence,astring):
    #filters out any sentences that are empty
    if len(sentence)>0:
        #takes all words in from a sentence
        tokens,tags=zip(*sentence)
        #returns the string if it is in the sentence
        return astring in tokens
    else:
        return False

#get the synset label for the word from the sentence
def get_label(sentence,word):
    count=0
    label="none"
    #for every word in the sentence get the label and return true if you can
    for token,tag in sentence:
        if token==word:
            count+=1
            label=str(tag)
    if count !=1:
        pass
    return label

#select sentences containing words and make labelled data sets where sentences are represented using Bernouilli's event model
def get_word_data(sentences,word):
    selected_sentences=[sentence for sentence in trainingData if contains(sentence,word)]
    word_data=[({token:True for (token,tag) in sentence},get_label(sentence,word)) for sentence in selected_sentences] 
    return word_data

In [30]:
from nltk.corpus import semcor
#extact all senses from the files in semcor for use as training data
trainingData=extract_senses(semcor.fileids())

Processing brown1/tagfiles/br-a01.xml
Processing brown1/tagfiles/br-a02.xml
Processing brown1/tagfiles/br-a11.xml
Processing brown1/tagfiles/br-a12.xml
Processing brown1/tagfiles/br-a13.xml
Processing brown1/tagfiles/br-a14.xml
Processing brown1/tagfiles/br-a15.xml
Processing brown1/tagfiles/br-b13.xml
Processing brown1/tagfiles/br-b20.xml
Processing brown1/tagfiles/br-c01.xml
Processing brown1/tagfiles/br-c02.xml
Processing brown1/tagfiles/br-c04.xml
Processing brown1/tagfiles/br-d01.xml
Processing brown1/tagfiles/br-d02.xml
Processing brown1/tagfiles/br-d03.xml
Processing brown1/tagfiles/br-d04.xml
Processing brown1/tagfiles/br-e01.xml
Processing brown1/tagfiles/br-e02.xml
Processing brown1/tagfiles/br-e04.xml
Processing brown1/tagfiles/br-e21.xml
Processing brown1/tagfiles/br-e24.xml
Processing brown1/tagfiles/br-e29.xml
Processing brown1/tagfiles/br-f03.xml
Processing brown1/tagfiles/br-f10.xml
Processing brown1/tagfiles/br-f19.xml
Processing brown1/tagfiles/br-f43.xml
Processing b

Processing brownv/tagfiles/br-a38.xml
Processing brownv/tagfiles/br-a39.xml
Processing brownv/tagfiles/br-a40.xml
Processing brownv/tagfiles/br-a41.xml
Processing brownv/tagfiles/br-a42.xml
Processing brownv/tagfiles/br-a43.xml
Processing brownv/tagfiles/br-a44.xml
Processing brownv/tagfiles/br-b01.xml
Processing brownv/tagfiles/br-b02.xml
Processing brownv/tagfiles/br-b03.xml
Processing brownv/tagfiles/br-b04.xml
Processing brownv/tagfiles/br-b05.xml
Processing brownv/tagfiles/br-b06.xml
Processing brownv/tagfiles/br-b07.xml
Processing brownv/tagfiles/br-b08.xml
Processing brownv/tagfiles/br-b09.xml
Processing brownv/tagfiles/br-b10.xml
Processing brownv/tagfiles/br-b11.xml
Processing brownv/tagfiles/br-b12.xml
Processing brownv/tagfiles/br-b14.xml
Processing brownv/tagfiles/br-b15.xml
Processing brownv/tagfiles/br-b16.xml
Processing brownv/tagfiles/br-b17.xml
Processing brownv/tagfiles/br-b18.xml
Processing brownv/tagfiles/br-b19.xml
Processing brownv/tagfiles/br-b21.xml
Processing b

In [31]:
#create training data from the top three frequent words and previous training data
memTrain=get_word_data(trainingData,'membrane')
tempTrain=get_word_data(trainingData,'temperature')
moleTrain=get_word_data(trainingData,'molecular')

from nltk.classify.naivebayes import NaiveBayesClassifier
#train classifier using the created testing data
memClassifier=NaiveBayesClassifier.train(memTrain)
tempClassifier=NaiveBayesClassifier.train(tempTrain)
moleClassifier=NaiveBayesClassifier.train(moleTrain)

In [78]:
for d in range(len(bow_collections['medline'])):
    memData+=memClassifier.classify(bow_collections['medline'][d])
for i in range(len(bow_collections['medline'])):
    tempData=tempClassifier.classify(bow_collections['medline'][i])
for x in range(len(bow_collections['medline'])):
    moleData=moleClassifier.classify(bow_collections['medline'][x])


In [99]:
print(memData)
print(tempData)
print(moleData)
print((wn.synsets('membrane')[1]).lemma_names())
print((wn.synsets('temperature')[1]).lemma_names())
print((wn.synsets('molecular')[1]).lemma_names())

Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('membrane.n.01')Synset('mem

def extract_tags(taggedsentence):
function to take words that have been tagged to a sentence:initialise array alist
                                                           For every word in taggedsentence check if its a Tree and if                                                                     labels=Lemmas and whether it has 1 leaf 
                                                           If so add to the output list
def extract_senses(fileid_list):
Apply extract_tags to all sentences, show which file is being processed, add all extracted tags to a sentence

def contains(sentence,astring):
Check whether a string is in a sentence, filter out empty sentences and take all words in from a sentence and return the string if it is in the sentence

def get_label(sentence,word):
Get the synset label for the word from the sentence, for every word in the sentence get the label and return true if you can

def get_word_data(sentences,word):
Select sentences containing words and make labelled data sets where sentences are represented using Bernouilli's event model, extract all senses from the files in semcor for use as training data, create training data from the top three frequent words and previous training data

import NaiveBayesClassifier
train classifier using the created testing data
collect senses for each word
print these senses

c). Evaluate the performance of your WSD system.  How accurate is it for each of the 3 words? **Comment** on the strengths and weaknesses of your WSD system.\[8 marks\] 

My system is as accurate as it can be as it only returns a single value for membrane and one for temperature and molecular also, this is as when looking for senses by hand I get the exact same results as my WSD system. Due to this there is not much room to get more accurate answers for these three words. However, this is only due to each word being used on its own classifier in the WSD system. This is as the way it operates at the moment is that each word has its own classifier to find its own senses and each classifier is only trained with the extracted senses of that word. This means when that word is given it shall give the words senses back which were previously extracted. This can also be a weakness of my WSD system as it means each word must have its own classifier trained in order to function and find that words senses as using anonther words classifer would result in no senses being found.

d) How could you extend or improve your WSD system?  You are **not** expected to code any of these extensions or improvements, but your answer should give sufficient details to make it clear how they might be carried out in practice. \[5 marks\]

Train a single classifier with all of the words in the collections and all of their senses so it can reliably give the intended senses as it will have a larger more accurate set of training data to pull from. This is instead of the current training data which uses the data of a single word to help classify the classifier. This means the classifier would be more accurate with multiple words but may be less accurate with single words. Another way of improving my WSD system would be to incoporate the lesk algorithm or boot strapping into the WSD to improve the assignment of senses with words that have common meanings.

Use the code below to verify that the length of your submission does not exceed 2000 words.

In [95]:
##This code will word count all of the markdown cells in the notebook saved at filepath
##Running it before providing any answers shows that the questions have a word count of 1202

import io
from nbformat import current

filepath="a2.ipynb"
question_count=626

with io.open(filepath, 'r', encoding='utf-8') as f:
    nb = current.read(f, 'json')

word_count = 0
for cell in nb.worksheets[0].cells:
    if cell.cell_type == "markdown":
        word_count += len(cell['source'].replace('#', '').lstrip().split(' '))
print("Submission length is {}".format(word_count-question_count))

Submission length is 2430
