# NLE Assessed Coursework 2

For this assessment, you are expected to complete and submit this notebook file.  When answers require code, you may import and use library functions (unless explicitly told otherwise).  All of your own code should be included in the notebook rather than imported from elsewhere.  Written answers should also be included in the notebook.  You should insert as many extra cells as you want and change the type between code and markdown as appropriate.

In order to avoid misconduct, you should not talk about these coursework questions with your peers.  If you are not sure what a question is asking you to do or have any other questions, please ask me or one of the Teaching Assistants.

Marking guidelines are provided as a separate document.

In order to provide unique datasets for analysis by different students, you must enter your candidate number in the following cell.

In [3]:
candidateno=184514 #this MUST be updated to your candidate number so that you get a unique data sample


In [11]:
#preliminary imports
import sys
sys.path.append(r'\\ad.susx.ac.uk\ITS\TeachingResources\Departments\Informatics\LanguageEngineering\resources')
sys.path.append(r'/Users/juliewe/resources')
sys.path.append(r'/Users/Bayley/Documents/resources')

import operator
import re
import nltk
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from itertools import zip_longest
from nltk.tokenize import word_tokenize
import random
import math
import copy
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn


## Question 1: Document Similarity (25 marks)
The objective of this question is to investigate whether incorporating lexical knowledge from WordNet might improve document similarity methods.  For example, knowing that both *tiger* and *leopard* are hyponyms of *big_cat* should increase the similarity between a document mentioning a *tiger* and a document mentioning a *leopard*.

The code below will generate two document collections, both in bag-of-words format, one from the Medline Corpus and one from the Wall Street Journal corpus.

In this question, there are marks available for the quality of your code and the quality of your explanations.

In [12]:
from sussex_nltk.corpus_readers import MedlineCorpusReader
from sussex_nltk.corpus_readers import WSJCorpusReader
from nltk.stem.wordnet import WordNetLemmatizer

def normalise(tokenlist):
    tokenlist=[token.lower() for token in tokenlist]
    tokenlist=["NUM" if token.isdigit() else token for token in tokenlist]
    tokenlist=["Nth" if (token.endswith(("nd","st","th")) and token[:-2].isdigit()) else token for token in tokenlist]
    tokenlist=["NUM" if re.search("^[+-]?[0-9]+\.[0-9]",token) else token for token in tokenlist]
    return tokenlist

def filter_stopwords(tokenlist):
    stop = stopwords.words('english')
    return [w for w in tokenlist if w.isalpha() and w not in stop]

def stem(tokenlist):
    st=WordNetLemmatizer()
    return [st.lemmatize(token) for token in tokenlist]

   
def make_bow(somestring):
    rep=word_tokenize(somestring)  #step 1
    rep=normalise(rep)   #step 2
    rep=stem(rep)   #step 3
    rep=filter_stopwords(rep)  #step 4
    dict_rep={}
    for token in rep:
        dict_rep[token]=dict_rep.get(token,0)+1  #step 5
    return(dict_rep)


wsj=WSJCorpusReader()
medline=MedlineCorpusReader()

collectionsize=50
collections={"wsj":[],"medline":[]}
        
for key in collections.keys():
    if key=="wsj":
        generator=wsj.raw()
    else:
        generator=medline.raw()
    while len(collections[key])<collectionsize:
        collections[key].append(next(generator))

Sussex NLTK root directory is /Users/Bayley/Documents/resources


In [13]:
bow_collections={key:[make_bow(doc) for doc in collection] for key,collection in collections.items()}

a). For each step in the `make_bow()` function, **explain** what it does and why it is applicable when creating document representations for document similarity methods. \[8 marks\]

**Step 1** : Takes a string and splits the sentence up into individual elements, which is then added to a list. The list would be composed of either `words or punctuation`. So if the string: *"Hello there, would you like some soup?"* was inputted into the function make_bow(), the list `['Hello', 'there', ',', 'would', 'you', 'like', 'some', 'soup', '?']`. What this does is remove any white space, and only store words, and punctuation. This is used when comparing sentences as you compare words each separate word to see how similar they are. Splitting the sentences up word by word, makes it easier to compare the sentences.

**Step 2** : Takes the list of words and punctuation and *normalizes* it. This is done by first making all letters in each word lowercase. If words where not converted to the same case, comparisons of words 'Hello' and 'hello' would return false, when they are both the same word. Converting all words to same casing would stop errors like this. Numbers are replaced with the string **NUM**. This is done as numbers in a sentence are not required to check the similarity. This is because they have no meaning on their own, therefore, so we specify that there is a number but not what that number is. Therefore tokens `'1' and '4'` would both be normalized to **NUM**. In addition, ordinal indicators are normalized to **Nth** as again they have no meaning on their own. This means tokens such as, '1st' and '2nd' would both be normalized to **'Nth'**. The final line of the normalize function removes positive and negative decimals and replaces them with 'NUM'. This is the same as before as numbers have no meaning alone, so all positive and negative decimals will be normalized. I.e. `-88.8, +5.2234 and 5.6` would all be normalized to **`'NUM'`**. The regular expression is designed to work for a decimal with any number of decimal places along with a decimal without a type (positive (+) or negative (-)) in front.

**Step 3** : This step lemmatizes each word within the list. This means to replace a word with its dictionary head word. For example the words `*playing*, *'played'* and *'plays'`* would all be replaced with **play** as this is the root/head of each of the words. We know this as *plays* is the plural of play, *played* is the past tense of play and *playing* is the present tense of play. Makes comparisons more accurate as words in their initial form may not be comparable, but have similar/the same head words. Overall, makes comparisons easier and more accurate.

**Step 4** : This step is the removal of stop-words from the list. Stop-words are common words such as *'the'* and are defined by **stopwords.words('english')**. When comparing sentences, stop-words are not looked at as these do not effect similarity. These words would be used in all documents, and have no value when comparing similarity. I.e. the sentence 'The smallest man is eating cheese' and 'The house is 10 feet tall' are not similar at all, and only have two common words **The** and **is**. These are stop-words so if used when calculating similarity would calculate a much higher similarity than if not used, which would be more accurate. This method also implements **punctuation removal**. This is done using **for w in tokenlist if w.isalpha()**. As punctuation is not alphabetic character *(a-z)*, the punctuation elements are removed from the token list. This is needed as punctuation is not relevant when comparing documents. This means the list `['the', 'old', 'man', 'is', 'very', 'slow', '.']` would become `['the', 'old', 'man', 'is', 'very', 'slow']`  after punctuation removal and **`['old', 'man', 'very', 'slow']`** after removing stop-words.

**Step 5** : Takes each token stored in the list **rep** and stores it as the key within dictionary **dict_rep**. The value is  set to zero and is incremented every time the word is found within list *rep*. Used to calculate `term frequency` which is used to calculate the `tf-idf` values, so is needed when calculating document similarity.

b). Apply a TF-IDF weighting to the representations and then compute: 
* the average cosine similarity of medline documents to each other, 
* the average cosine similarity of WSJ documents to each other,
* the average cosine similarity of medline documents to WSJ documents
\[8 marks\]

In [14]:
def dot(docA,docB):
    outproduct = 0
    for A in docA:
        if docB.get(A) != None:
            outproduct += docA[A]*docB[A]
    return outproduct

def cos_sim(docA, docB):
    return (dot(docA,docB)/(math.sqrt(dot(docA,docA)*dot(docB, docB))))


def doc_freq(doc_list):
    df={}
    for doc in doc_list:
        for feat in doc.keys():
            df[feat]=df.get(feat,0)+1
            
    return df

def idf(doc_list):
    df = doc_freq(doc_list)
    idf = dict()
    for feat in df:
        idf[feat] = math.log(len(doc_list)/df[feat])
    return idf

def convert_to_tfidf(doc_list, idfValues):
    out = copy.deepcopy(doc_list)
    for doc in out:
        for feat in doc:
            doc[feat] = doc[feat]*idfValues[feat]
    return out

**Dot function** : The dot product formula returns a value for `how similar two documents are`. The `higher the value outputted`, the `more alike` the documents are. It is a function which has been designed to take two documents (represented as *dictionaries*) and returns their dot product. 

The dot product of two vectors, A and B, is defined as:

\begin{eqnarray*}
A.B = \sum_{\mbox{f}} \mbox{weight}(A,f)\times \mbox{weight}(B,f) 
\end{eqnarray*}


where $\mbox{weight}(X,f)$ tells us the value associated with feature $f$ in the vector representation of $X$

Below the dot product of the `first, second and third terms` in the medline collection is calculated. The results show that the first and second document are similar in some way as they have a dot result of 70, while the second and third document, and first and third document have a result of 0. This means that the third document is **not similar** to the first or second document.

In [15]:
A = bow_collections["medline"][0]
B = bow_collections["medline"][1]
C = bow_collections["medline"][2]
print("Dot product of first and second document in the 'medline' collection: ", dot(A,B))
print("Dot product of second and third document in the 'medline' collection: ", dot(B,C))
print("Dot product of first and third document in the 'medline' collection: ", dot(A,C))

Dot product of first and second document in the 'medline' collection:  70
Dot product of second and third document in the 'medline' collection:  0
Dot product of first and third document in the 'medline' collection:  0


**Cosine Similarity function** : The more similar two documents are, the smaller the angle θ between the vectors. This is shown through `cos(0) = 1` - The same and `cos(90) = 0` - Not similar. This means that the function **cos_sim(docA, docB) = cos(θ)**. The function `cos_sim` takes two documents (represented as dictionaries) and returns their cosine similarity. 
The denominator of the cosine calculation normailises the lengths of the vectors, so we get an answer between 0 and 1. This can be defined in terms of the dot products of vectors:

\begin{eqnarray*}
\mbox{sim}_{\mbox{cosine}}(A,B) = \frac{A.B}{\sqrt{A.A \times B.B}}
\end{eqnarray*}

Below the cosine similarity of the first few documents in the medline colleciton are calculated below. From the results we can conclude that `documents 1 and 2 are similar`, but `document 3 is not similar to either of the first two documents`.

In [16]:
A = bow_collections["medline"][0]
B = bow_collections["medline"][1]
C = bow_collections["medline"][2]
print("Cosine Similarity of first and second document in the 'medline' collection: ", cos_sim(A,B))
print("Cosine Similarity of second and third document in the 'medline' collection: ", cos_sim(B,C))
print("Cosine Similarity of first and third document in the 'medline' collection: ", cos_sim(A,C))

Cosine Similarity of first and second document in the 'medline' collection:  0.20437194002395656
Cosine Similarity of second and third document in the 'medline' collection:  0.0
Cosine Similarity of first and third document in the 'medline' collection:  0.0


**Document Frequency Function** : A commonly used weight is tf-idf which stands for term frequency, inverse document frequency

\begin{eqnarray*}
\mbox{tf-idf}(D_i,f) = tf(D_i,f) \times idf(D_i,f)
\end{eqnarray*}

where $tf(D_i,f)$ is simply the frequency of feature f in document $D_i$
and

\begin{eqnarray*}
idf(D_i,f) = log \frac{N}{df(f)}
\end{eqnarray*}

where $N$ is the total number of documents and $\mbox{df}(f)$ is the number of documents containing $f$.  

As the question asks for a **TF-IDF** weighting to be applied to the representations before computing the average cosine similarity the document frequency function would be needed to help with calculating the `inverse document frequency`. The function will take a list of documents `(doc_list)` (represented as dictionaries) and compute the `document frequency for each feature`, which is the `number of documents in which a term occurs`.
Below the code calculates the document frequency for each term for both collections. From the results we can see terms such as `NUM` (numbers) are very frequent in documents along with the term `year` in the `WSJ collection`. In the `Meldine collection` again terms such as `NUM, early and case` occur in more that one document in the collection.

In [125]:
Med_Doc_Freq = doc_freq(medline)
WSJ_Doc_Freq = doc_freq(wsj)
df = pd.DataFrame(WSJ_Doc_Freq, index = ["Document Frequency WSJ"])
print(df.transpose())
dfs = pd.DataFrame(Med_Doc_Freq, index = ["Document Frequency Medline"])
print(dfs.transpose())

        Document Frequency WSJ
pierre                       1
vinken                       2
NUM                         20
year                         8
old                          3
...                        ...
fee                          1
boost                        1
simple                       2
fell                         1
slid                         1

[353 rows x 1 columns]
                 Document Frequency Medline
NUM                                      37
early                                     3
case                                      3
organ                                     1
transplantation                           1
...                                     ...
radiolabeled                              1
receptor                                  1
adsorbent                                 1
posse                                     1
strikingly                                1

[1611 rows x 1 columns]


**Inverse Document Frequency Function** : If there are *N* documents in total, inverse document frequency (idf) is given by:

\begin{eqnarray*}
idf(D_i,f) = log \frac{N}{df(f)}
\end{eqnarray*}

This is important as the **idf function** is used within the **tf-idf** formula. It is used because we want terms that are *rarer* (in less documents) to have a higher/larger weight. The code below displays the idf values for each term in each collection. The results show that the funciton works as terms such as `NUM` iun both collections now have the lowest weights, since they where shown to be common in the *df function*, while in the WSJ collection the term `pierre` which was shown to have a *low document frequency*, is shown to have a *high idf weight*, since it is a **rare word**.

In [124]:
med_idf = idf(medline)
wsj_idf = idf(wsj)
df = pd.DataFrame(wsj_idf, index = ["Inverse Document Frequency WSJ"])
print(df.transpose())
dfs = pd.DataFrame(med_idf, index = ["Inverse Document Frequency Medline"])
print(dfs.transpose())

        Inverse Document Frequency WSJ
pierre                        3.912023
vinken                        3.218876
NUM                           0.916291
year                          1.832581
old                           2.813411
...                                ...
fee                           3.912023
boost                         3.912023
simple                        3.218876
fell                          3.912023
slid                          3.912023

[353 rows x 1 columns]
                 Inverse Document Frequency Medline
NUM                                        0.301105
early                                      2.813411
case                                       2.813411
organ                                      3.912023
transplantation                            3.912023
...                                             ...
radiolabeled                               3.912023
receptor                                   3.912023
adsorbent                               

**Covert to tf-idf function** : Is a function  that takes two arguments:

    * a list of documents (represented as dictionaries)
    * a dictionary containing idf values
    
and outputs a list of documents with tfidf weights (i.e., dictionaries). The function creates a copy of the collection and calculates the tf-idf values for each feature within the each document of the collection. The function uses tghe copy of the collection to override the value stored for each feature to the tf-idf value. Copy is used so that the collection of the tf-idf values can be returned and does not affect the original collection. The term frequency is calculated and multiplied by the idf value calculated using the function and the result is stored for each individudal feature in the copy of document in each collection. TF-IDF is intended to reflect how relevant a term is in a given document.

The intuition behind it is that `if a word occurs multiple times in a document`, its `relevance` should be *boosted* as it should be more **meaningful** than other words that appear fewer times (TF). At the same time, if a word occurs many times in a document but also along many other documents, maybe it is because this word is just a `frequent word`; not because it was **relevant or meaningful** (IDF). In addition, some TF-IDF values can be applied to phrases. Below the code shows the tf-idf values calculated using the fucniton for the collection `wsj`: The higher the tf-idf value is, the rarer the word is.

In [111]:
m = bow_collections["medline"]
w = bow_collections["wsj"]
m_idf = idf(m)
w_idf = idf(w)
m_tfidf = convert_to_tfidf(m,m_idf)
w_tfidf = convert_to_tfidf(w,w_idf)
w_tfidf

[{'pierre': 3.912023005428146,
  'vinken': 3.2188758248682006,
  'NUM': 1.8325814637483102,
  'year': 1.8325814637483102,
  'old': 2.8134107167600364,
  'join': 3.912023005428146,
  'board': 3.912023005428146,
  'nonexecutive': 3.2188758248682006,
  'director': 3.2188758248682006},
 {'vinken': 3.2188758248682006,
  'chairman': 3.2188758248682006,
  'elsevier': 3.912023005428146,
  'dutch': 3.912023005428146,
  'publishing': 3.912023005428146,
  'group': 3.2188758248682006},
 {'rudolph': 3.912023005428146,
  'agnew': 3.912023005428146,
  'NUM': 0.9162907318741551,
  'year': 1.8325814637483102,
  'old': 2.8134107167600364,
  'former': 3.912023005428146,
  'chairman': 3.2188758248682006,
  'consolidated': 3.912023005428146,
  'gold': 3.912023005428146,
  'field': 3.912023005428146,
  'plc': 3.912023005428146,
  'wa': 2.120263536200091,
  'named': 3.912023005428146,
  'nonexecutive': 3.2188758248682006,
  'director': 3.2188758248682006,
  'british': 3.912023005428146,
  'industrial': 3.912

**Average Cosine Similarity Function** : This function calculates the average similarity of a collection to another collection. This functions compares each document of one collection (of documents) to another collection, which will be 0 and 1, 1 meaning the collections are identical/ the same. The higher the average simialirty, the more similar the collections are. The code below defines the Average Cosine Similarity function.

In [17]:
def avgCosSim(docsA, docsB):
    sims = []
    for A in docsA:
        for B in docsB:
            sims.append(cos_sim(A,B))
    return sum(sims)/len(sims)

The code below calculates the `average cosine similarity` for:
* Medline documents to each other (asked for in question)
* Medline documents to WSJ documents (asked for in question)
* WSJ documents to Medline documents
* WSJ documents to each other (asked for in question)

The results suggest that on average, the medline documents are less similar to other medline documents, than WSJ documents are similar to eachother. This is shown with the `(W,W)` similarity being **0.0428** and the `(M,M)` being **0.426**. Having a average cosine similarity of **0.007**, WSJ and the Medline collection, `(M,W)` are not very similar.

In [18]:
medline = bow_collections["medline"]
wsj = bow_collections["wsj"]
medline_idf = idf(medline)
wsj_idf = idf(wsj)
medline_tfidf = convert_to_tfidf(medline,medline_idf)
wsj_tfidf = convert_to_tfidf(wsj,wsj_idf)
results = dict({"Medline, Medline":avgCosSim(medline_tfidf,medline_tfidf), "Medline, WSJ":avgCosSim(medline_tfidf,wsj_tfidf), "WSJ, Medline":avgCosSim(wsj_tfidf,medline_tfidf), "WSJ, WSJ":avgCosSim(wsj_tfidf,wsj_tfidf)})
df = pd.DataFrame(results, index = ["Average Cosine Simularity"])
df.transpose()

Unnamed: 0,Average Cosine Simularity
"Medline, Medline",0.042554
"Medline, WSJ",0.007138
"WSJ, Medline",0.007138
"WSJ, WSJ",0.042766


c). Expand the document representations by adding **synonyms** and **hypernyms** for each **noun** in the document.  For example, 2 occurrences of the word *tiger* should add 2 occurrences of each of the following **lemma_names** found in the WordNet hypernym hierarchy above *tiger*:
* \['tiger', 'Panthera_tigris'\]
* \['big_cat', 'cat'\]
* \['feline', 'felid'\]
* \['carnivore'\]
* \['placental', 'placental_mammal', 'eutherian', 'eutherian_mammal'\]
* \['mammal', 'mammalian'\]
* \['vertebrate', 'craniate'\]
* \['chordate'\]
* \['animal', 'animate_being', 'beast', 'brute', 'creature', 'fauna'\]
* \['organism', 'being'\]
* \['living_thing', 'animate_thing'\]
* \['whole', 'unit'\]
* \['object', 'physical_object'\]
* \['physical_entity'\]
* \['entity'\]

Recompute the similarities calculated in part b).  Discuss your results. \[9 marks\]

In [19]:

# Part 1

nouns = {x.name().split('.', 1)[0] for x in wn.all_synsets('n')}
nouns_wsj_list = []
nouns_medline_list = []
for corpus, doclist in bow_collections.items():
    if corpus == 'wsj':
        for doc in doclist:
            for word, value in doc.items():
                if word in nouns:
                    nouns_wsj_list.append(word)
    else:
        for doc in doclist:
            for word, value in doc.items():
                if word in nouns:
                    nouns_medline_list.append(word)
                    



# Part 2
hypernyms_wsj_dict = {}
hypernyms_medline_dict = {}
word_list = []
hypernyms_list = []

for word in nouns_wsj_list:
    hypernyms_list += wn.synsets(word)
    for synset in hypernyms_list:
        while synset not in hypernyms_list:
            hypernyms_list += synset.hypernyms()
    hypernyms_wsj_dict[word] = hypernyms_wsj_dict.get(word, hypernyms_list)
    hypernyms_list = []
    
for word in nouns_medline_list:
    hypernyms_list += wn.synsets(word)
    for synset in hypernyms_list:
        while synset not in hypernyms_list:
            hypernyms_list += synset.hypernyms()
    hypernyms_medline_dict[word] = hypernyms_medline_dict.get(word, hypernyms_list)
    hypernyms_list = []
    

# Part 3
medline_synonyms_list = []
wsj_synonyms_list = []
synonyms_wsj_dict = {}
synonyms_medline_dict = {}

for word in nouns_medline_list:
    for ss in wn.synsets(word):
        synonyms_medline_dict[word] = synonyms_medline_dict.get(word, ss.lemma_names())


for word in nouns_wsj_list:
    for ss in wn.synsets(word):
        synonyms_wsj_dict[word] = synonyms_wsj_dict.get(word, ss.lemma_names())


In [20]:
print(nouns_wsj_list)

['pierre', 'year', 'old', 'board', 'director', 'dutch', 'group', 'year', 'old', 'former', 'gold', 'field', 'director', 'british', 'conglomerate', 'form', 'asbestos', 'kent', 'cigarette', 'filter', 'high', 'percentage', 'cancer', 'death', 'group', 'worker', 'year', 'asbestos', 'fiber', 'lung', 'brief', 'exposure', 'causing', 'symptom', 'show', 'decade', 'unit', 'kent', 'cigarette', 'filter', 'preliminary', 'finding', 'year', 'latest', 'result', 'today', 'england', 'journal', 'medicine', 'forum', 'attention', 'problem', 'old', 'story', 'year', 'asbestos', 'property', 'asbestos', 'product', 'worker', 'research', 'smoker', 'kent', 'cigarette', 'information', 'user', 'risk', 'james', 'boston', 'cancer', 'institute', 'team', 'national', 'cancer', 'institute', 'school', 'harvard', 'university', 'boston', 'spokeswoman', 'asbestos', 'amount', 'making', 'paper', 'filter', 'type', 'billion', 'kent', 'cigarette', 'filter', 'company', 'substance', 'three', 'time', 'number', 'four', 'five', 'worker'

**Part 1:** Creates a list of all nouns, and finds all nouns in each collection. Creates two `seperate noun lists`, one for nouns found in `medline collection`, and one for all nouns found in `wsj colleciton`. Iterates through each document in each collection, chekcing if each word in each doc is a noun. If it is a noun, it is added to the list of nouns. The code above prints all nouns within the wsj corpus. As we can see some words are in here multiple times, i.e. old appears twice on the first printed line, so we know old has atleast two senses.

**Part 2:** Two loops within part 2 find all the `hypernyms` for each of the nouns in *each collection*. 
The hypernyms will be added to a dicitonary, one for each collection. The word they relate to isn used as the key
i.e `{word:[hypernym1, hypernym2...]}`. **Hyponymy and Hypernymy** are linguistic terms which capture the idea of class inclusion, using super/sub classes to link words. An example of this is the word 'dog' would be a hyponym of 'animal', since it is a subclass of animal, all dogs are animals, while not all animals are dogs. Works by looping through noun lists for each collection one at a time. Creeates a temp list called *hypernyms_list* which will store all hypernyms of a given word. Checks if hypernym is *already* in the list, if it isnt, it is added to the list, and is stored with the word in the hypernym dictionary (for the given collection) along with any other hypernyms. The cell below prints the dictionary of hypernyms in the wsj collection for the word `rise`. As we can see all `hyponyms` of the word **rise** `(the hypernym)` are stored under the key 'rise'. This means words such as 'raise' and 'arise' are hyponyms of rise.

In [116]:
print(hypernyms_wsj_dict['rise'])

[Synset('rise.n.01'), Synset('rise.n.02'), Synset('ascent.n.01'), Synset('rise.n.04'), Synset('raise.n.01'), Synset('upgrade.n.04'), Synset('lift.n.04'), Synset('emanation.n.03'), Synset('rise.n.09'), Synset('advance.n.06'), Synset('rise.v.01'), Synset('rise.v.02'), Synset('arise.v.03'), Synset('rise.v.04'), Synset('surface.v.01'), Synset('originate.v.01'), Synset('ascend.v.08'), Synset('wax.v.02'), Synset('heighten.v.01'), Synset('get_up.v.02'), Synset('rise.v.11'), Synset('rise.v.12'), Synset('rise.v.13'), Synset('rebel.v.01'), Synset('rise.v.15'), Synset('rise.v.16'), Synset('resurrect.v.03')]


**Part 3:** Consists of two for loops which find all the synonyms for each noun in each corpus and adds
them to a dictionary in the same way as the hypernyms. A synoym is a word or phrase that means `exactly or nearly the same as` another word or phrase in the same language, for example *shut* is a **synonym** of *close*. From the code below, we can see that `decrease` and `expir`y are synoymns of the word **death.**

In [117]:
print(synonyms_wsj_dict['death'])

['death', 'decease', 'expiry']


The simplest way of defining how similar two concepts are according to WordNet is to use the pathlength measure:

\begin{eqnarray*}
\mbox{sim}(\mbox{synsetA},\mbox{synsetB})=\frac{1}{1+\mbox{lengthOfPath}(\mbox{synsetA},\mbox{synsetB})}
\end{eqnarray*}

Using semantic similarity such as synoymns and hypernyms would in most cases increases the average similarity. This is because, now more words in documents can be linked as being similar. For example, in the collection wsj, before, the word `death` was **not seen as similar** to the word `decease`, even though they mean they *same thing*. By saying that these two words are similar, if both the words occur frequently throughout documents, the `average similairy of the wsj documents containing these words would increase`. This is because some documents may use a different words, which have the same meaning to talk about the same thing. An example for the hypernyms in the wsj collection could be `lift` and `rise`. Rise is a hypernym of lift, shown above, and by creating this link between the similar words, the average similarity of a wsj documents containing lift and a document containing rise would increase, as a new similarty between words has been stated.
I have not provided recalculations of average similarities due to lack of time, but I have stated what would happen on average.

## Question 2: Supervised Methods for WSD (25 marks)
The objective of this question is to build and evaluate a word sense disambiguation (WSD) system for words with multiple senses.  

a).  For each word occurring in the medline corpus (defined above), **write code** to find how many senses it has according to WordNet.  Print a list of the 10 most frequently occurring words with 2 senses (in this corpus). \[4 marks\]

In [21]:
def find_m_freq(collection):
    word_List={}
    senses={}
    synset=[]
    list_of_two_senses = []
    
    for sentence in collection:
        for word in sentence:
            word_List[word] = word_List.get(word,0) + 1
            senses[word] = len(wn.synsets(word))
    words = sorted(word_List.items(),key=operator.itemgetter(1),reverse=True)
    
    for word,a in words:
        for key,value in senses.items():
            if(word == key) and (value==2):
                list_of_two_senses.append((key))
                    
    return list_of_two_senses

In [22]:
mostFreq = find_m_freq(medline)[0:10]
mostFreq3 = find_m_freq(medline)[0:3]
mostFreq

['membrane',
 'temperature',
 'molecular',
 'uptake',
 'data',
 'molecule',
 'may',
 'amino',
 'phenomenon',
 'ratio']

**Word Senses:** Words can often be ambigious. I.e. words can have multiple different meanings/senses. For example I placed my book on the *counter* & I placed my *counter* on the board. Here the word `counter` has multiple sense. The code above prints the most frequent words (First 10) with only 2 senses. The code displays that membrane, temperature and molecular are the top 3 most frequent words with 2 word senses.


'b). A *supervised* WSD algorithm derives model(s) from *sense-annotated corpus data* in order to predict senses of ambiguous words in un-annotated data.  Using the entire document as context, **implement** a supervised word sense disambiguation algorithm to determine the most likely sense of each occurrence of the 3 most frequently occuring words identified in part a). \[8 marks\]

In [29]:
def simplifiedLesk(word,sentence):  

    context_tokens=set(word_tokenize(sentence))-{word}   
    synsets=wn.synsets(word)
    scores=[]   
    for synset in synsets:
        sense_tok=set(word_tokenize(synset.definition()))
        scores.append((synset, synset.definition(),len(sense_tok.intersection(context_tokens))))
    scores_sorted=sorted(scores,key=operator.itemgetter(1)) 
    return scores_sorted[0]


sentence = ""
test_d = mostFreq3

for corpus, doc in collections.items():
    if corpus == 'medline':
        for sent in doc:
            sentence += sent
     
sentence_list = sent_tokenize(sentence)
for sentences in sentence_list:
    words = word_tokenize(sentences)
    for word in words:
        if word in test_d:
            print("Most likely sense of", word, "is", simplifiedLesk(word, sentences), "\n")

Most likely sense of membrane is (Synset('membrane.n.02'), 'a pliable sheet of tissue that covers or lines or connects the organs or cells of animals or plants', 4) 

Most likely sense of temperature is (Synset('temperature.n.01'), 'the degree of hotness or coldness of a body or environment (corresponding to its molecular activity)', 4) 

Most likely sense of temperature is (Synset('temperature.n.01'), 'the degree of hotness or coldness of a body or environment (corresponding to its molecular activity)', 3) 

Most likely sense of molecular is (Synset('molecular.a.01'), 'relating to or produced by or consisting of molecules', 2) 

Most likely sense of molecular is (Synset('molecular.a.01'), 'relating to or produced by or consisting of molecules', 2) 

Most likely sense of molecular is (Synset('molecular.a.01'), 'relating to or produced by or consisting of molecules', 1) 

Most likely sense of molecular is (Synset('molecular.a.01'), 'relating to or produced by or consisting of molecules'

### Word Sense Disambiguation (WSD) ###

**Supervised corpus-based methods** : Predict what the words/lemmas might be in context, `based on evidence` from sense-annotated corpa. Trained on labelled examples.
**Knowledge based methods** : Predict what words/lemmas might be in context of each sense `based on knowledge` gained from dictionaries and other lexical resources.

I have chosen the Knowledge based approach *(Lesk)* as the lesk algorithm performs disambiguation by **comparing overlap between dictionary definitions**. The overlap is the number of words/lemmas in common between the definitions of senseA and senseB. The simplified Lesk algorithm `disambiguates each ambiguous word in turn`. Finds tghe sense which leads to the  *largest/highest overlap* between its dictionary definitions along with the current context. uses the top definition, to decide the most likely sense. I have chosen the **knowledge based method** as it the only method that i felt comftable programming, which also gave a *correct prediction* of the sense for the 3 most frequent words with 2 word senses. The *recursive* algorithm prints out each use of the 3 most frequent words with two senses in the entirity of the medline corpus. It then suggests the most likely sense of the word based on the overlap between dictionary definitions. The code above states that the most likely sense for the word 
* **Molecular** is : relating to or produced by or consisting of molecules
* **Membrane** is : a pliable sheet of tissue that covers or lines or connects the organs or cells of animals or plants
* **Temperature** is : the degree of hotness or coldness of a body or environment (corresponding to its molecular activity)

c). Evaluate the performance of your WSD system.  How accurate is it for each of the 3 words? **Comment** on the strengths and weaknesses of your WSD system.\[8 marks\] 

`The Knowledge based system `I have produced takes a string and breaks it into sentences. For every sentence, it finds the occurrence of any of the 3 most frequently occurring words within the given collection. The simplifiedLesk method is then run on every individual word. This finds any and all the synsets of the given word and compares them to the other words in the sentence. If any intersections are found in the list of synsets the method calculates most likely sense of the word. This is determined by the sense with the most intersections. The definition of this `most likely sense` is then displayed along with the synset & number of intersections. 

**Strengths:** A strength of the alghorithm is that it does not just compare the word with the entire document but also the sentence it occurs within. This makes the result more reliable as there may be other senses of the same word which would makes sense wihtin other sentences in the same document but not necessarily in the sentence it was found in.

**Weaknesses**: Many semantic similarity measures can only be computed between words of same part-of-speech. This means that a particular semantic similarity measure may not cpature the correct/right semantic relationship between words. An example of this could be the word `skirt`, which has more than 1 sense. 1 sense of `skirt` is a **Garment hanging from the waist, mainily worn by women** while another sense is **slang for a women**. The sentence `The old **woman** hid it in her **skirt**`. The system possibly would not capture the correct semantic realtionship between words here.. So one weakness of knowledge based systems is that if a sentence has a word twice, but each word is of a different sense. This may cause the system to be inaccurate in some cases. However that, maximising semantic similairty is an ideal approach to disambiguity. More weakness of my system include not using Case normalisation , Number normalisation, Stopword filering & Lemmitisation. By extending my system to use these methods, the accuracy of simularity would on average increases. Moreover doesnt use sense training on annotated data with a NB classifiers. This would ensure a greater accuracy when determining a sense of a word due to the classifier being trained, rather than relying on knowledge based methods. The Lesk algorithm is also computationally very expensive to compare all possible sense combinations of words in a sentence.  If each word has just 2 senses, then there are $2^n$ possible sense combinations.

**Accuracy:**
A way to check how accurate the system is would be to print every sentence with the words `molecular, temperature and membrane` and ensure that the correct most likely sense has been given to each word. The code below I have printed `all sentences with the word temperature`, which there is a number of `88` sentences. After reading through each sentences, the **majority** of the sentences relate to: *the degree of hotness or coldness of a body or environment (corresponding to its molecular activity) i.e having a fever*, while the other sentences relate to the *somatic sensation of cold or heat*. This is not the best way to evalute your results as it would take a very long time if there where more documents and more frequent word sense being derived. From the basic evaluaiton done, I can conclude my system is accurate at determining the most likely sense of each occurrence of the word temperature.

In [40]:
for corpus, doc in collections.items():
    if corpus == 'medline':
        for sent in doc:
            sentence += sent
counter = 0    
sentence_list = sent_tokenize(sentence)
for sentences in sentence_list:
    words = word_tokenize(sentences)
    for word in words:
        if word == "temperature":
            counter += 1
            print(counter,".", sentences, "\n")

1 . In all the strains tested a shift to the elevated temperature resulted in an immediate decrease in growth rate which was due to limitation in the availability of endogenous methionine. 

2 . The first biosynthetic enzyme of the methionine pathway-homoserine transsuccinylase-was studied in extracts of Aerobacter aerogenes, Salmonella typhimurium, and Escherichia coli and was shown to be temperature sensitive in all of them.A uracil-requiring auxotroph of Anacystis nidulans was isolated after treatment with N-methyl-N'-nitrosoguanidine. 

3 . One of the mutations was genetically mapped at a site in or near the acrA and mtc loci at approximately 10.5 min on the Taylor and Trotter map (1972).The barotolerant nature of protein synthesis in Pseudomonas fluorescens is shown to be associated with the 30S ribosomal subunit.In a toluene-treated mutant of Escherichia coli K-12 having a temperature-sensitive, conditionally lethal mutation in the structural gene for deoxyribonucleic acid (DNA) 

35 . One of the mutations was genetically mapped at a site in or near the acrA and mtc loci at approximately 10.5 min on the Taylor and Trotter map (1972).The barotolerant nature of protein synthesis in Pseudomonas fluorescens is shown to be associated with the 30S ribosomal subunit.In a toluene-treated mutant of Escherichia coli K-12 having a temperature-sensitive, conditionally lethal mutation in the structural gene for deoxyribonucleic acid (DNA) ligase, an extensive DNA repair synthesis occurred in X-irradiated cells at the nonpermissive temperature, 42 C. At the permissive temperature, 30 C, nearly normal semiconservative synthesis and limited repair synthesis were observed when DNA ligase was activated by the addition of nicotinamide adenine dinucleotide.Histidine affects de novo purine nucleotide synthesis and purine nucleotide pool utilization in Neurospora crassa. 

36 . One of the mutations was genetically mapped at a site in or near the acrA and mtc loci at approximately 10.

75 . One of the mutations was genetically mapped at a site in or near the acrA and mtc loci at approximately 10.5 min on the Taylor and Trotter map (1972).The barotolerant nature of protein synthesis in Pseudomonas fluorescens is shown to be associated with the 30S ribosomal subunit.In a toluene-treated mutant of Escherichia coli K-12 having a temperature-sensitive, conditionally lethal mutation in the structural gene for deoxyribonucleic acid (DNA) ligase, an extensive DNA repair synthesis occurred in X-irradiated cells at the nonpermissive temperature, 42 C. At the permissive temperature, 30 C, nearly normal semiconservative synthesis and limited repair synthesis were observed when DNA ligase was activated by the addition of nicotinamide adenine dinucleotide.Histidine affects de novo purine nucleotide synthesis and purine nucleotide pool utilization in Neurospora crassa. 

76 . One of the mutations was genetically mapped at a site in or near the acrA and mtc loci at approximately 10.

d) How could you extend or improve your WSD system?  You are **not** expected to code any of these extensions or improvements, but your answer should give sufficient details to make it clear how they might be carried out in practice. \[5 marks\]

To extend the system, I could include:
* **Case normalisation** - Including case normalisation `removes any errors caused by cases`, by casing a word incorrectly, a return of a different sense for that word would affect the accuracy of the WSD. 

* **Number normalisation** - Including number normisation increases the accuracy as creates links between a number just being invloved rather than the value of the numbers. This is because it would not make a difference to the sense of the word, the only thing that would make a difference is whether there is a number there or not.

* **Stopword filering** - Removing stopwords should be included as, like numbers, stopwords would affect the accuracy of the system as they may affect what sense a word should have. Different stopwords would mean different senses for the same word which is incorrect, reducing the accuracy.

* **Lemmitisation** - Lemmitisation would improve the accuracy as it would find the sense for the head word for each word. This would make the sense found for each word more accurate as the sense of each word may be different to the sense of the headword. 

**Algorithmic improvemnts could include** : 

* Using a WSD system over the knowledge based method used. - Since these are trained on labelled examples. An example of this would be: Training on sense annotated data with NB classifier. - Gives an overall more accurate most-likely sense due to having training on testing data, and uses conditional and prior probability, generated by the training data to calculate the sense, along with making naive assumptions i.e. making features independent.

* Using supervised corpus-based methods: Uses featue extraction, a sense inventory (a pre-specified set if class labels) for every word of interest, along with training data, which would be a corpus of examples annotated with the class labels. An example of the training corpus could be SemCor


Use the code below to verify that the length of your submission does not exceed 2000 words.

In [45]:
##This code will word count all of the markdown cells in the notebook saved at filepath
##Running it before providing any answers shows that the questions have a word count of 1202

import io
from nbformat import current

filepath="a2.ipynb"
question_count=626

with io.open(filepath, 'r', encoding='utf-8') as f:
    nb = current.read(f, 'json')

word_count = 0
for cell in nb.worksheets[0].cells:
    if cell.cell_type == "markdown":
        word_count += len(cell['source'].replace('#', '').lstrip().split(' '))
print("Submission length is {}".format(word_count-question_count))

Submission length is 1680
