## NLP Project 13: Metaphor detection in poetry

Danila Goncharenko, 2303788

Ana Ferreira, 2308587

Mikhail Bichagov, 2304806

### This project explores the detection of metaphors in poetry using natural language processing, aiming to distinguish figurative and non-figurative language. 

We shall consider the common use of a phrase as literal use and its violation as an indicative of metaphorical use. The project initially attempts to imitate the approach of Neuman et al. (2013) published in PlusOne journal -Metaphor Identification in Large Texts Corpora- available online [`Metaphor Identification in Large Texts Corpora (plos.org)`](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0062343). So first consider the British national corpus (BNCCorpus), available through NLTK (see also [`British National Corpus, XML edition (ox.ac.uk)`](https://ota.bodleian.ox.ac.uk/repository/xmlui/handle/20.500.12024/2554)). For testing, we shall consider the annotated corpus available at https://www.eecs.uottawa.ca/~diana/resources/metaphor/type1_metaphor_annotated.txt 

In the above, the annotation at the end of the sentence i.e., @1@y   indicates whether it is a metaphor (y) or not (n). Here the presence of ‘y’ indicates that it is a metaphor, whereas “1” indicates the first head word of the sentence, which is “poise”, in the part of speech tag sequence. 


In [5]:
# Imports
import nltk
import pandas as pd
from nltk.corpus import stopwords, CategorizedPlaintextCorpusReader
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk import FreqDist, bigrams
from itertools import chain

from nltk.collocations import *

# Downloading the BNC corpus and stopwords
nltk.download('stopwords')

# Stopwords, Lemmatizer, Bi-gram
stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()
bigram_measures = nltk.collocations.BigramAssocMeasures()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Dan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
from nltk.corpus.reader.bnc import BNCCorpusReader

# CHANGE THE PATH
bnc_reader = BNCCorpusReader(root="C:/Users/Dan/Desktop/NLP Project/BNC/Texts", fileids=r'[A-K]/\w*/\w*\.xml')

# list_of_fileids = ['A/A0/A00.xml', 'A/A0/A01.xml']
# finder = BigramCollocationFinder.from_words(bnc_reader.words(fileids=list_of_fileids))
# scored = finder.score_ngrams(bigram_measures.raw_freq)

## Task 1

First, we shall consider the mutual information, see expression (2) in Neuman et al.’2003 paper, as a guideline to derive the metaphor-reasoning.  You can inspire from other available implementations of mutual information, in [`Collocations (nltk.org)`](https://www.nltk.org/howto/collocations.html), [`FNLP 2011: Tutorial 8: Working with corpora: mutual information (ed.ac.uk)`](http://www.inf.ed.ac.uk/teaching/courses/fnlp/lectures/8/tutorial.html). Consider the words “woman”, “use”, “dream”, “body”. Write a program that identifies all adjectives, adverbs and verbs that occur within 2 lexical units (span = 2 in the formula of mutual information) in BNC corpus and whose mutual information is equal or greater than 3, considered as the minimum statistical significance. Suggest appropriate adjustments (e.g., greater span) if no results are found to match the mutual information criterion.

### BNC Baby Corpus, diff formula

In [16]:
import nltk
from nltk.corpus.reader.bnc import BNCCorpusReader
from nltk import FreqDist, bigrams
from itertools import chain
from nltk.corpus import stopwords
from nltk.collocations import *
nltk.download('stopwords')


# Replace 'path_to_bnc_data' with the actual path to your downloaded BNC XML data
path_to_bnc_data = 'C:/Users/jklbichami/OneDrive - Valmet/Documents/School/Porgramming/NLP/bnc/Texts/news/'

# Initialize the BNC corpus reader
bnc_reader = BNCCorpusReader(root=path_to_bnc_data, fileids=r'[A-K]//w*//w*/.xml')


bigram_measures = nltk.collocations.BigramAssocMeasures()

# Here need to merge all of the words from different xlm files together.
# Its just an exmaple to show what it could look

corpus = bnc_reader.words('A1E.xml')

words_to_include = ['woman', 'use', 'dream', 'body']

# Get the words from the file in list format
es = stopwords.words('english')

g = FreqDist(bigrams(w.lower() for w in corpus if (w.isalpha() and w.lower() not in es)))

f = FreqDist()
for k in g.keys():
    if k[1] in words_to_include or k[0] in words_to_include:
        f[k] = g.get(k)

# Here we probably should not limit words by anything, except just removing stop words
u = FreqDist(w.lower() for w in corpus if (w.isalpha() and not(w.lower() in es)))

## If we have double nested list then we can use this to flatten it
chain.from_iterable(corpus)

from math import log
def mutInf(p,u1,u2,b):
    try:
        return log((float(b[p])/float(b.N()))/
                   ((float(u1[p[0]])*float(u2[p[1]]))/
                    (float(u1.N())*float(u2.N()))),
                   2)
    except:
        return

fmi = {}
for p in f.keys():
    fmi[p]  = mutInf(p,u,u,f)

fmi = {key: value for key, value in fmi.items() if value > 3}

dict(sorted(fmi.items(), key=lambda item: item[1], reverse=True))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jklbichami\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


{('body', 'iosco'): 21.734829502493657,
 ('body', 'middlemen'): 20.734829502493657,
 ('regulators', 'body'): 20.734829502493657,
 ('regulatory', 'body'): 20.1498670017725}

In [23]:
import nltk
from nltk.corpus.reader.bnc import BNCCorpusReader
from nltk import FreqDist, bigrams
from itertools import chain
from nltk.corpus import stopwords
from nltk.collocations import *
nltk.download('stopwords')
import os
import glob
import zipfile


corpus = []

zip_file_path = 'C:/Users/jklbichami/OneDrive - Valmet/Documents/School/Porgramming/NLP/2554.zip'

bnc_reader = BNCCorpusReader(root=zip_file_path, fileids=r'[A-K]//w*//w*/.xml')

with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    # List the files and folders in the zip archive
    file_list = zip_ref.namelist()

    for item in file_list:
        if 'download/Texts/' in item and item.endswith('.xml'):
            folder_name = item.rstrip('/')  # Extract folder name
            corpus.append(bnc_reader.sents(folder_name))

bigram_measures = nltk.collocations.BigramAssocMeasures()

words_to_include = ['woman', 'use', 'dream', 'body']

# Get the words from the file in list format
es = stopwords.words('english')


# Mikhail This part took 41 mins to load on my machine
g = FreqDist(bigrams(w.lower() for doc in corpus for sent in doc for w in sent if (w.isalpha() and w.lower() not in es)))

f = FreqDist()
for k in g.keys():
    if k[1] in words_to_include or k[0] in words_to_include:
        f[k] = g.get(k)

## Here we probably should not limit words by anything, except just removing stop words
u = FreqDist(w.lower() for w in corpus if (w.isalpha() and not(w.lower() in es)))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jklbichami\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [24]:
g

FreqDist({('per', 'cent'): 38049, ('gon', 'na'): 12436, ('last', 'year'): 10421, ('years', 'ago'): 10205, ('prime', 'minister'): 9467, ('last', 'night'): 8482, ('first', 'time'): 8379, ('would', 'like'): 8269, ('two', 'years'): 7427, ('united', 'states'): 7086, ...})

In [30]:
f = FreqDist()
for k in g.keys():
    if k[1] in words_to_include or k[0] in words_to_include:
        f[k] = g.get(k)

# Here we probably should not limit words by anything, except just removing stop words
u = FreqDist(w.lower() for w in chain.from_iterable(chain.from_iterable(corpus)) if (w.isalpha() and not(w.lower() in es)))

In [31]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType

# Initialize Spark session
spark = SparkSession.builder.appName("example").getOrCreate()

# Your list of strings
string_list = ["string1", "string2", "string3"]

# Create a DataFrame
df = spark.createDataFrame(string_list, StringType())

df.show()


Py4JJavaError: An error occurred while calling o38.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (86H26M3.vstage.co executor driver): java.net.SocketException: Connection reset
	at java.base/java.net.SocketInputStream.read(SocketInputStream.java:186)
	at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
	at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:252)
	at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:271)
	at java.base/java.io.DataInputStream.readInt(DataInputStream.java:392)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:774)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:766)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:525)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2844)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2780)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2779)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2779)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1242)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1242)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1242)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3048)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2982)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2971)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:984)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2398)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2419)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2438)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:530)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:483)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:61)
	at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:4344)
	at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:3326)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4334)
	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:546)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:4332)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:4332)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:3326)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:3549)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:280)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:315)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.net.SocketException: Connection reset
	at java.base/java.net.SocketInputStream.read(SocketInputStream.java:186)
	at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
	at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:252)
	at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:271)
	at java.base/java.io.DataInputStream.readInt(DataInputStream.java:392)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:774)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:766)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:525)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	... 1 more


## BNC full Corpus

### Expression (2) in Neuman et al.’2013 paper

In [43]:
# See expression (2) in Neuman et al.’2013 paper
import math
def Mutual_information(bigram_item, filteredCorpus, Corpus, span = 2):    
    '''
    Calculates Mutual information between node and collocate words
    bigram_item = The bigram which is considered in the equation
    filteredCorpus = The corpus with only considered words.
    Corpus = The whole corpus.
    span = span of words
    '''

    # filteredCorpus[fr_B_near_A] = frequency of collocate near the node word (e.g., color near purple)
    # Corpus.N() = size of the corpus (for instance 96,263,399, BNC)
    # Corpus[p[0]] = frequency of node word w1 (e.g., purple): 1262
    # Corpus[p[1]] = frequency of collocate word w2 (e.g., color): 115
    # span = span of words (e.g., 1 to left and 1 to right of the node word: 2)

    #sizeCorpus = 96 132 981 tokens in BNC

    try:
        return math.log10((filteredCorpus[bigram_item] * Corpus.N()) / (Corpus[p[0]] * Corpus[p[1]] * span))/math.log10(2)
    except:
        return 

### Frequencies calculation

In [46]:
# corpus = bnc_reader.words('A1E.xml')
corpus = bnc_reader.words('A/A0/A01.xml')

# Consider the words “woman”, “use”, “dream”, “body”.
words_to_include = ['woman', 'use', 'dream', 'body']

# Calculate the frequency of bigrams of all words in the corpus
bigram_frequency = FreqDist(bigrams(w.lower() for w in corpus if (w.isalpha() and w.lower() not in stop_words)))

# Calculate the frequency of bigrams of the words to include that are in the corpus
fr_words_to_include = FreqDist()
for key_word in bigram_frequency.keys():
    # If first or second word in bigram, e.g. ('woman', 'receiving'), has a word_to_include
    # Pass the frequency of this bigram from bigram_frequency to fr_words_to_include
    if key_word[1] in words_to_include or key_word[0] in words_to_include:
        fr_words_to_include[key_word] = bigram_frequency.get(key_word)

# Frequency of all words in the corpus, except stop words
unigram_frequency = FreqDist(w.lower() for w in corpus if (w.isalpha() and not(w.lower() in stop_words)))

### Calculate Mutual information frequency

In [45]:
## If we have double nested list then we can use this to flatten it
chain.from_iterable(corpus)

# Identify all adjectives, adverbs and verbs that occur within 2 lexical units (span = 2 in the formula of mutual information) in BNC corpus
# Mutual information frequency
fmi = {}
for p in fr_words_to_include.keys():
    fmi[p]  = Mutual_information(p, fr_words_to_include, unigram_frequency)

# Mutual information is equal or greater than 3, considered as the minimum statistical significance
fmi = {key: value for key, value in fmi.items() if value > 3}

dict(sorted(fmi.items(), key=lambda item: item[1], reverse=True))

{('woman', 'receiving'): 9.925183519434208,
 ('illness', 'woman'): 8.925183519434208,
 ('multiplies', 'body'): 8.603255424546846,
 ('body', 'seriously'): 8.603255424546846,
 ('damage', 'body'): 8.603255424546846,
 ('body', 'kill'): 8.603255424546846,
 ('enters', 'body'): 7.603255424546846,
 ('body', 'enough'): 7.603255424546846,
 ('weakened', 'body'): 7.603255424546846,
 ('words', 'use'): 6.925183519434207,
 ('cells', 'body'): 6.6032554245468456,
 ('body', 'much'): 6.6032554245468456,
 ('use', 'condom'): 6.340221018713052,
 ('use', 'legacy'): 6.340221018713052,
 ('always', 'use'): 5.340221018713052,
 ('value', 'use'): 5.340221018713052,
 ('body', 'get'): 5.143823805909549,
 ('leaflet', 'use'): 4.925183519434208,
 ('could', 'use'): 4.2247438012931156,
 ('use', 'drugs'): 3.7552585179918956,
 ('use', 'deed'): 3.603255424546846,
 ('income', 'use'): 3.401621563377195,
 ('payment', 'use'): 3.0672025243066363,
 ('people', 'use'): 3.0182929238256895}

### Suggest appropriate adjustments (e.g., greater span), if no results

In [None]:
# Suggest appropriate adjustments (e.g., greater span), if no results

## Task 2

We would like to test this process in the previous metaphor annotated dataset. For this purpose, consider the following approach. Write a program that inputs each sentence of the annotated corpus, and then reads the head word (given in the annotation), then calculate the mutual distance between the head-word and each of the first two words occurring either on the left hand side part or right hand side part of the head-word. If the average of mutual distances from head word to each of the two words situated at two lexical units is greater than 3, then we shall consider the sentence is not a metaphor, otherwise, it is a metaphor. Test this reasoning and report the result for each annotated sentence and save it in your database. Given the ground truth of the annotated dataset, calculate the corresponding accuracy, and comment on the efficiency of the proposed approach.

In [40]:
url = "https://www.eecs.uottawa.ca/~diana/resources/metaphor/type1_metaphor_annotated.txt"

# Read the data into a DataFrame
df = pd.read_csv(url, delimiter='\t', header=None, names=['Text'])

# Extract the symbol ('y' or 'n') and number into separate columns
df['Symbol'] = df['Text'].str.extract(r'@(\d+)@([yn])')[1]
df['Number'] = df['Text'].str.extract(r'@(\d+)@([yn])')[0]

# Delete it from the original text
df['Text'] = df['Text'].str.replace(r'(@\d+@y|@\d+@n)', '', regex=True)

# Replace NaN values in 'Number' with 0, and then convert to integer
df['Number'] = df['Number'].fillna(0).astype(int)

# Extract the word with the specified index and save it in the 'Head-word' column
df['Head-word'] = df.apply(lambda row: row['Text'].split()[int(row['Number']) - 1], axis=1)

# Display the resulting DataFrame
df

Unnamed: 0,Text,Symbol,Number,Head-word
0,poise is a club .,y,1,poise
1,destroying alexandria . sunlight is silence,y,4,sunlight
2,feet are no anchor . gravity sucks at the mind,y,1,feet
3,on the day 's horizon is a gesture of earth,y,5,horizon
4,he said good-by as if good-by is a number .,y,6,good-by
...,...,...,...,...
614,as the season of cold is the season of darkness,n,5,cold
615,"else all beasts were tigers ,",y,3,beasts
616,without which earth is sand,n,3,earth
617,the sky is cloud on cloud,n,2,sky


In [3]:
# Input each sentence of the annotated corpus

# Read the head word (given in the annotation)

# Calculate the mutual distance between the head-word and each of the first two words
# either on the left hand side part or right hand side part of the head-word.

# If average is greater than 3, sentence is not a metaphor

# Report the result for each annotated sentence 

# Save it in your database

# Calculate the corresponding accuracy

# Comment on the efficiency of the proposed approach.


## Task 3

We consider the (adjective-noun) type of metaphor (referred to as Metaphor type III). A metaphor  assumes to occur when the categories of noun and adjective are such that one is concrete and the other one is abstract. WordStat noun categorization based on WordNet, which classifies 69,817 nouns into 25 categories, of which 13 are concrete categories (e.g., artifact) provides a database for a such categorization. It is freely available in [`Wordnet based categorization dictionary - Provalis Research`](https://provalisresearch.com/products/content-analysis-software/wordstat-dictionary/wordnet-based-categorization-dictionary/). Write a program that allows you to retrieve the category of noun and adjective / adverb in a sentence according to WordStat.

In [4]:
# adjective-noun type of metaphor is a Metaphor type III.

# If categories of noun and adjective: one is concrete and the other one is abstract,
# Then it is a Metaphor.

# WordStat noun categorization based on WordNet provides a database for a such categorization.

# Retrieve the category of noun and adjective / adverb in a sentence according to WordStat.


In [83]:
import json
import pandas as pd
import nltk
from nltk.corpus.reader.bnc import BNCCorpusReader
from nltk import FreqDist, bigrams
import re

file_p = r'C:\Users\jklbichami\OneDrive - Valmet\Documents\School\Porgramming\NLP\WordNet2\WordNet Words & Phrases.CAT'

with open(file_p, 'r') as file:
    cat_data = file.readlines()

pattern = r'\.(.+)'

topics = [a.strip('\n') for a in cat_data if '\t' not in a ]
topics_indices = [cat_data.index(a + '\n') for a in topics]

noun_categories = [re.search(pattern, a).group(1).strip('\n') for a in cat_data if '\t\t' not in a and a.startswith('\tNOUN.')]
noun_categories_idices = [cat_data.index('\tNOUN.' + a + '\n') for a in noun_categories]

verb_categories = [re.search(pattern, a).group(1).strip('\n') for a in cat_data if '\t\t' not in a and a.startswith('\tVERB.')]
verb_categories_idices = [cat_data.index('\tVERB.' + a + '\n') for a in verb_categories]

adj_categories = [re.search(pattern, a).group(1).strip('\n') for a in cat_data if '\t\t' not in a and a.startswith('\tADJ.')]
adj_categories_idices = [cat_data.index('\tADJ.' + a + '\n') for a in adj_categories]

cat_df = pd.DataFrame(columns=['Type', 'Category', 'Word'])


## Mikhail: did not finish this but idea is pretty simple
## 1. We get the indeces of all topics
## 2. We get indeces for respective categories
## 3. We iterate through each topics
## 4. We iterate through their categories
## 5. Assign all results to df based on indices
## Here is examplee for Nouns
for indx, num in enumerate(noun_categories_idices):
    temp_dict = {}
    if indx == 0:
        temp_dict['Category'] = [noun_categories[indx]]*abs(cat_data.index('NOUNS\n')+1-noun_categories_idices[indx+1]+1)
        temp_dict['Type'] = ['Noun']*abs(cat_data.index('NOUNS\n')+1-noun_categories_idices[indx+1]+1)
        temp_dict['Word'] = [a[2:].strip('(1)\n)').lower() for a in  cat_data[cat_data.index('NOUNS\n')+2:noun_categories_idices[indx+1]]]
        cat_df = pd.concat([cat_df, pd.DataFrame(temp_dict)], ignore_index=True)
    else:
        try:
            temp_dict['Category'] = [noun_categories[indx]]*abs(noun_categories_idices[indx]+1-noun_categories_idices[indx+1])
            temp_dict['Type'] = ['Noun']*abs(noun_categories_idices[indx]+1-noun_categories_idices[indx+1])
            temp_dict['Word'] = [a[2:].strip('(1)\n)').lower() for a in  cat_data[noun_categories_idices[indx]+1:noun_categories_idices[indx+1]]]
            cat_df = pd.concat([cat_df, pd.DataFrame(temp_dict)], ignore_index=True)
        except IndexError:
            pass

for indx, num in enumerate(verb_categories):
    temp_dict = {}
    if indx == 0:
        temp_dict['Category'] = [verb_categories[indx]]*abs(cat_data.index('VERBS\n')+1-verb_categories_idices[indx+1]+1)
        temp_dict['Type'] = ['Verb']*abs(cat_data.index('VERBS\n')+1-verb_categories_idices[indx+1]+1)
        temp_dict['Word'] = [a[2:].strip('(1)\n)').lower() for a in  cat_data[cat_data.index('VERBS\n')+2:verb_categories_idices[indx+1]]]
        cat_df = pd.concat([cat_df, pd.DataFrame(temp_dict)], ignore_index=True)
    else:
        try:
            temp_dict['Category'] = [verb_categories[indx]]*abs(verb_categories_idices[indx]+1-verb_categories_idices[indx+1])
            temp_dict['Type'] = ['Verb']*abs(verb_categories_idices[indx]+1-verb_categories_idices[indx+1])
            temp_dict['Word'] = [a[2:].strip('(1)\n)').lower() for a in  cat_data[verb_categories_idices[indx]+1:verb_categories_idices[indx+1]]]
            cat_df = pd.concat([cat_df, pd.DataFrame(temp_dict)], ignore_index=True)
        except IndexError:
            pass

for indx, num in enumerate(adj_categories):
    temp_dict = {}
    if indx == 0:
        temp_dict['Category'] = [adj_categories[indx]]*abs(cat_data.index('ADJECTIVES\n')+1-adj_categories_idices[indx+1]+1)
        temp_dict['Type'] = ['Adjective']*abs(cat_data.index('ADJECTIVES\n')+1-adj_categories_idices[indx+1]+1)
        temp_dict['Word'] = [a[2:].strip('(1)\n)').lower() for a in  cat_data[cat_data.index('ADJECTIVES\n')+2:adj_categories_idices[indx+1]]]
        cat_df = pd.concat([cat_df, pd.DataFrame(temp_dict)], ignore_index=True)
    else:
        try:
            temp_dict['Category'] = [adj_categories[indx]]*abs(adj_categories_idices[indx]+1-adj_categories_idices[indx+1])
            temp_dict['Type'] = ['Adjective']*abs(adj_categories_idices[indx]+1-adj_categories_idices[indx+1])
            temp_dict['Word'] = [a[2:].strip('(1)\n)').lower() for a in  cat_data[adj_categories_idices[indx]+1:adj_categories_idices[indx+1]]]
            cat_df = pd.concat([cat_df, pd.DataFrame(temp_dict)], ignore_index=True)
        except IndexError:
            pass

temp_dict = {}
temp_dict['Category'] = [adj_categories[indx]]*abs(cat_data.index('ADVERBS\n')+2-cat_data.index('NOUNS\n'))
temp_dict['Type'] = ['Adjective']*abs(cat_data.index('ADVERBS\n')+2-cat_data.index('NOUNS\n'))
temp_dict['Word'] = [a[2:].strip('(1)\n)').lower() for a in  cat_data[cat_data.index('ADVERBS\n')+2:cat_data.index('NOUNS\n')]]
cat_df = pd.concat([cat_df, pd.DataFrame(temp_dict)], ignore_index=True)



In [84]:
cat_df[cat_df['Category'] == 'ALL']

Unnamed: 0,Type,Category,Word
150326,Adjective,ALL,a
150327,Adjective,ALL,a.d.
150328,Adjective,ALL,a.k.a.
150329,Adjective,ALL,a.m.
150330,Adjective,ALL,a_bit
...,...,...,...
154988,Adjective,ALL,youthfully
154989,Adjective,ALL,zealously
154990,Adjective,ALL,zestfully
154991,Adjective,ALL,zestily


## Task 4

Now we would like to imitate the procedure mentioned in Neuman’s paper for type III semaphore. Write a program that identifies the occurrence of Noun-Adjective/Adverb part-of-speech in a given sentence. Then, use WordNet lexical database to find out the number of senses of each adjective. If every adjective has one single sense, then return, no metaphor. If the Noun has no entry in wordnet, then return UNKNOWN. Otherwise (adjective has more than one sense and noun has an entry in WordNet), then identify the set S of nouns in the BNC corpus that collocate with the given Noun of the given sentence (this corresponds to a set of nouns whose mutual information value is greater or equal than 3). Next, for each element (noun) of S, use the WordStat categorization to identify those who belong to concrete class. Let S1 be a subset of S, which contains these “concrete”-category nouns. If the number of elements in S1 is large, then restrict to the first three elements who have the highest mutual information values. Finally, to find out whether, whether the sentence containing adjective A and noun N is a metaphor, we need to test the compatibility of each elements of S1 with N. If there is no elements in S1 compatible with N, then we shall consider S as a metaphor, otherwise, it is not. To evaluate this compatibility, you can use the Wu and Palmer WordNet semantic similarity already implemented in NLTK. Therefore, assume that if the Wu and Palmer semantic similarity of at least of the nouns in S1 with N is greater than a threshold 0.4, then the compatibility between S1 and N is granted. (Note this is only a very rough approximation). Write a code that implements this reasoning and test it on two simple examples of your choice. Test this process for other values of threshold values (e.g., 0.3, 0.5, 0.6) 

In [5]:
from nltk.corpus import wordnet as wn

# Imitate the procedure mentioned in Neuman’s paper for type III semaphore

# Identify the occurrence of Noun-Adjective/Adverb part-of-speech in a given sentence.

# Use WordNet lexical database to find out the number of senses of each adjective.

# If every adjective has one single sense, then no metaphor

# If the Noun has no entry in wordnet, then return UNKNOWN.

# Otherwise: adjective has more than one sense and noun has an entry in WordNet

# Identify the set S of nouns in the BNC corpus that collocate with the given Noun of the given sentence
# set of nouns: mutual information value is >= 3

# Use the WordStat categorization for each noun in S to identify those who belong to concrete class.

# Let S1 be a subset of S, which contains these “concrete”-category nouns.
# If the number of elements in S1 is large, then restrict to the first three elements who have the highest mutual information values.

# Test the compatibility of each elements of S1 with N.
# To check if the sentence containing adjective A and noun N is a metaphor

# If there is no elements in S1 compatible with N, then we shall consider S as a metaphor

# Evaluate, using the Wu and Palmer WordNet semantic similarity from NLTK.

# if semantic similarity of at least of the nouns in S1 with N >= 0.4, then the compatibility between S1 and N is granted.

# Test it on two simple examples of your choice. 

# Test this process for other values of threshold values (e.g., 0.3, 0.5, 0.6) 


## Task 5

Test the above reasoning on the first half of the dataset https://www.eecs.uottawa.ca/~diana/resources/metaphor/type1_metaphor_annotated.txt where adjective-noun type of relationship occurs. Motivate your reasoning and answers. Estimate the accuracy accordingly, and report individual results in your database.

In [7]:
# Test the above reasoning on the first half of the dataset

# Motivate your reasoning and answers.

# Estimate the accuracy accordingly, and report individual results in your database.


## Task 6

Instead of the calculus of the semantic similarity between N and each elements of S1 in step 4, we would like to use the wordnet domain of each individual words. For this purpose, download the wordnet domain from [`WordNet Domains (fbk.eu)`](https://wndomains.fbk.eu/download.html). Therefore, the compatibility between N and an element N1 of S1 is granted if N and N1 belong to the same wordnet domain. Write a program that allows you to implement this reasoning and test it on simple sentences of your choice.

In [9]:
# Use the wordnet domain of each individual words, instead of similarity between N and each elements of S1

# Download the wordnet domain

# If N and N1 belong to the same wordnet domain, compatibility between N and an element N1 of S1 is granted

# Test it on simple sentences of your choice


## Task 7

Test the reasoning of 6) on the same subset of annotated metaphor dataset used in 5) and compare the performance in terms of accuracy. Save individual results in your database as well.

In [10]:
# Test the reasoning of 6) on dataset used in 5)

# Compare the performance in terms of accuracy

# Save individual results in your database as well


## Task 8

Repeat 6) and 7) when using Reuter corpus (also accessible via NLTK) instead of BNC corpus. Conclude on the impact of the corpus on the accuracy of metaphor identification.  

In [11]:
# Repeat 6) and 7) when using Reuter corpus from NLTK

# Conclude on the impact of the corpus on the accuracy of metaphor identification


## Task 9

You may want to enhance the reasoning of any of project specifications above, feel free to suggest any state-of-the-art approach that you judge relevant and accommodate to achieve the goal accordingly. Motivate your choice by concise literature review. Use appropriate literature to discuss your findings.

In [14]:
# Enhance the reasoning of any of project specifications above

# Suggest any relevant state-of-the-art approach

# Motivate your choice by concise literature review. 

# Use appropriate literature to discuss your findings.
