## Exercises

You may work in groups for these exercises. They are due at the beginning of next class. You can submit them as a link to a Colab notebook or GitHub CodeSpace.

0. Use the corpora that you assembled last week (Pausanias++):
1. Using programming techniques from the course so far, find other potential collocates for a word of your choice.
2. Calculate the μ and Mutual Information scores for at least 5 of these collocate pairs. How do your results change depending on your definition of a collocation? What might these changes mean? (Write your answers to these questions down.)
3. Calculate the Delta P for these same five pairs. Do any results stand out? Why? What might they tell us about your corpus.

In [118]:
from lxml import etree
from MyCapytain.common.constants import Mimetypes

In [119]:
from MyCapytain.resources.texts.local.capitains.cts import CapitainsCtsText
from lxml import etree
from MyCapytain.common.constants import Mimetypes
import pandas as pd
urns_pausanias = []
raw_xmls_pausanias = []
unannotated_strings_pausanias = []
urns_iliad = []
raw_xmls_iliad = []
unannotated_strings_iliad = []

In [120]:
with open("../tei/tlg0525.tlg001.perseus-eng2.xml") as f:
    textPausanias = CapitainsCtsText(urn="urn:cts:greekLit:tlg0525.tlg001.perseus-eng2", resource=f)

for ref in textPausanias.getReffs(level=len(textPausanias.citation)):
    urn = f"{textPausanias.urn}:{ref}"
    node = textPausanias.getTextualNode(ref)
    raw_xml = node.export(Mimetypes.XML.TEI)
    tree = node.export(Mimetypes.PYTHON.ETREE)
    s = etree.tostring(tree, encoding="unicode", method="text")

    urns_pausanias.append(urn)
    raw_xmls_pausanias.append(raw_xml)
    unannotated_strings_pausanias.append(s)

In [121]:
with open("../tei/tlg0012.tlg002.perseus-eng3.xml") as f:
    textIliad = CapitainsCtsText(urn="urn:cts:greekLit:tlg0012.tlg002.perseus-eng3", resource=f)

for ref in textIliad.getReffs(level=len(textIliad.citation)):
    urn = f"{textIliad.urn}:{ref}"
    node = textIliad.getTextualNode(ref)
    raw_xml = node.export(Mimetypes.XML.TEI)
    tree = node.export(Mimetypes.PYTHON.ETREE)
    s = etree.tostring(tree, encoding="unicode", method="text")

    urns_iliad.append(urn)
    raw_xmls_iliad.append(raw_xml)
    unannotated_strings_iliad.append(s)

In [122]:
pausanias_df = pd.DataFrame({
    "urn": pd.Series(urns_pausanias, dtype="string"),
    "raw_xml": raw_xmls_pausanias,
    "unannotated_strings": pd.Series(unannotated_strings_pausanias, dtype="string")
    
})


iliad_df = pd.DataFrame({
    "urn": pd.Series(urns_iliad, dtype="string"),
    "raw_xml": raw_xmls_iliad,
    "unannotated_strings": pd.Series(unannotated_strings_iliad, dtype="string")
})


In [195]:
import spacy
nlp = spacy.load("en_core_web_sm", disable=["ner"])

In [124]:
import spacy
import pandas as pd
from collections import defaultdict
from math import log2

nlp = spacy.load("en_core_web_sm")
window_size = 3  
def find_collocates(doc, node_word, window_size=3):
    collocate_dict = defaultdict(int)  
    tokens = [token.text.lower() for token in doc]  
    
    for idx, token in enumerate(tokens):
        if token == node_word:
            
            left_window = tokens[max(0, idx - window_size): idx]  
            right_window = tokens[idx + 1: idx + 1 + window_size]  
            
            
            collocates = left_window + right_window
            
            
            for collocate in collocates:
                collocate_dict[collocate] += 1
    
    return collocate_dict
pausanias_df['nlp_docs'] = pausanias_df['unannotated_strings'].apply(lambda text: nlp(text))

node_word = "son"

pausanias_df['collocates'] = pausanias_df['nlp_docs'].apply(lambda doc: find_collocates(doc, node_word))

collocates_first_doc = pausanias_df['collocates'].iloc[0]
print(f"Collocates for '{node_word}' in the first document: {collocates_first_doc}")


Collocates for 'son' in the first document: defaultdict(<class 'int'>, {'by': 1, 'ptolemy': 3, ',': 6, 'of': 4, 'lagus': 1, 'when': 1, 'antigonus': 1, 'demetrius': 1})


In [196]:
import spacy
import pandas as pd
from collections import defaultdict
from math import log2

nlp = spacy.load("en_core_web_sm")
window_size = 3  
def find_collocates(doc, node_word, window_size=3):
    collocate_dict = defaultdict(int)  
    tokens = [token.text.lower() for token in doc]  
    
    for idx, token in enumerate(tokens):
        if token == node_word:
            
            left_window = tokens[max(0, idx - window_size): idx]  
            right_window = tokens[idx + 1: idx + 1 + window_size]  
            
            
            collocates = left_window + right_window
            
            
            for collocate in collocates:
                collocate_dict[collocate] += 1
    
    return collocate_dict
iliad_df['nlp_docs'] = iliad_df['unannotated_strings'].apply(lambda text: nlp(text))

node_word = "son"

iliad_df['collocates'] = iliad_df['nlp_docs'].apply(lambda doc: find_collocates(doc, node_word))

collocates_first_doc = iliad_df['collocates'].iloc[0]
print(f"Collocates for '{node_word}' in the first document: {collocates_first_doc}")


Collocates for 'son' in the first document: defaultdict(<class 'int'>, {',': 3, 'agamemnon': 1, "'s": 1, 'had': 1, 'slain': 1, 'wife': 1, 'of': 3, 'the': 2, 'atreus': 2, 'vengeance': 1, 'for': 1, 'when': 1})


One problem I encountered here is that I couldn't find any collocate for any type other than "son". I don't know if it has anything to do with text striping or stop words. So I decided to use only son as my node word

In [193]:
import spacy

quotation_lines_pausanias = []
for unannotated_string in unannotated_strings_pausanias:
    
    lines = unannotated_string.splitlines()
    for line in lines:
        if line.strip() != "":
            quotation_lines_pausanias.append(line.split())
nlp = spacy.load("en_core_web_sm")
nlp.max_length = 2000000

text = " ".join([" ".join(line) for line in quotation_lines_pausanias])
doc = nlp(text)

for token in doc:
    if token.lemma_ == "son":  
        print(f"Token: {token.text} (lemma: {token.lemma_})")
        for child in token.children:
            print(f"Child: {child.text}")


Token: son (lemma: son)
Child: of
Token: son (lemma: son)
Child: of
Token: son (lemma: son)
Child: of
Token: sons (lemma: son)
Child: his
Token: son (lemma: son)
Child: of
Token: son (lemma: son)
Child: of
Token: son (lemma: son)
Child: of
Token: son (lemma: son)
Child: the
Child: -
Child: in
Child: of
Token: son (lemma: son)
Child: a
Child: Erysichthon
Token: son (lemma: son)
Child: This
Token: son (lemma: son)
Child: the
Child: reputed
Child: of
Token: son (lemma: son)
Child: His
Token: son (lemma: son)
Child: his
Child: and
Child: EvagorasEvagoras
Token: son (lemma: son)
Child: the
Child: of
Token: son (lemma: son)
Child: the
Child: of
Token: son (lemma: son)
Child: of
Token: son (lemma: son)
Child: of
Token: son (lemma: son)
Child: of
Token: son (lemma: son)
Child: Ajax
Child: of
Token: son (lemma: son)
Child: the
Child: of
Token: son (lemma: son)
Child: the
Child: bastard
Child: of
Token: son (lemma: son)
Child: of
Token: son (lemma: son)
Child: of
Token: son (lemma: son)
Child: o

In [191]:
import spacy

quotation_lines_iliad = []
for unannotated_string in unannotated_strings_iliad:
    
    lines = unannotated_string.splitlines()
    for line in lines:
        if line.strip() != "":
            quotation_lines_iliad.append(line.split())
nlp = spacy.load("en_core_web_sm")
nlp.max_length = 2000000

text = " ".join([" ".join(line) for line in quotation_lines_iliad])
doc = nlp(text)

for token in doc:
    if token.lemma_ == "son":  
        print(f"Token: {token.text} (lemma: {token.lemma_})")
        for child in token.children:
            print(f"Child: {child.text}")


Token: son (lemma: son)
Child: Agamemnon
Token: son (lemma: son)
Child: the
Child: of
Token: son (lemma: son)
Child: the
Child: of
Token: son (lemma: son)
Child: of
Token: son (lemma: son)
Child: of
Token: son (lemma: son)
Child: his
Token: son (lemma: son)
Child: the
Child: of
Token: son (lemma: son)
Child: the
Child: of
Token: son (lemma: son)
Child: the
Child: of
Token: son (lemma: son)
Child: his
Token: son (lemma: son)
Child: of
Token: son (lemma: son)
Child: his
Token: son (lemma: son)
Child: her
Token: son (lemma: son)
Child: of
Token: son (lemma: son)
Child: the
Child: of
Token: son (lemma: son)
Child: of
Token: son (lemma: son)
Child: of
Token: son (lemma: son)
Child: of
Token: son (lemma: son)
Child: the
Child: dear
Child: of
Token: son (lemma: son)
Child: his
Child: dear
Token: son (lemma: son)
Child: the
Child: dear
Child: of
Token: sons (lemma: son)
Child: the
Child: of
Token: sons (lemma: son)
Child: the
Child: of
Token: son (lemma: son)
Child: of
Token: son (lemma: son)


Below is the code for The Iliad

frequency of collocation

In [201]:
node = 'son'
collocate1 = 'of'
def count_ngram_collocations(x, w1, w2, l_size: int = 1, r_size: int = 1):
    lemmata = [t.lemma_ for t in x]

    # the right-hand side of a slice in Python is exclusive, so we add 1 to make sure
    # we're actually getting one element to the right
    chunked_lemmata = [lemmata[i - l_size:i + r_size + 1] for i in range(0, len(lemmata))]

    cooccurrences = [1 for l in chunked_lemmata if w1 in l and w2 in l]

    return sum(cooccurrences)

iliad_df['agalma_megas_1l-1r'] = iliad_df['nlp_docs'].apply(count_ngram_collocations, args=(node, collocate1))

observed_1l_1r_freq_agalma_megas = iliad_df[iliad_df['agalma_megas_1l-1r'] > 0].shape[0]

print (observed_1l_1r_freq_agalma_megas)

collocate2 = 'by'

iliad_df['agalma_megas_1l-1r'] = iliad_df['nlp_docs'].apply(count_ngram_collocations, args=(node, collocate2))

observed_1l_1r_freq_agalma_megas = iliad_df[iliad_df['agalma_megas_1l-1r'] > 0].shape[0]

print (observed_1l_1r_freq_agalma_megas)

collocate3 = 'Agamemnon'
iliad_df['agalma_megas_1l-1r'] = iliad_df['nlp_docs'].apply(count_ngram_collocations, args=(node, collocate3))

observed_1l_1r_freq_agalma_megas = iliad_df[iliad_df['agalma_megas_1l-1r'] > 0].shape[0]

print (observed_1l_1r_freq_agalma_megas)

collocate4 = 'his'

iliad_df['agalma_megas_1l-1r'] = iliad_df['nlp_docs'].apply(count_ngram_collocations, args=(node, collocate4))

observed_1l_1r_freq_agalma_megas = iliad_df[iliad_df['agalma_megas_1l-1r'] > 0].shape[0]

print (observed_1l_1r_freq_agalma_megas)

collocate5 = 'her'

iliad_df['agalma_megas_1l-1r'] = iliad_df['nlp_docs'].apply(count_ngram_collocations, args=(node, collocate5))

observed_1l_1r_freq_agalma_megas = iliad_df[iliad_df['agalma_megas_1l-1r'] > 0].shape[0]

print (observed_1l_1r_freq_agalma_megas)


151
4
8
53
12


mutual infomation

In [203]:

from collections import Counter
import pandas as pd
import math
node = 'son'
collocate1 = 'of'
def expected_frequency_of_collocation(df: pd.DataFrame, node: str, collocate: str, window_size: int = 1):

    lemmata = [t.lemma_ for t in df['nlp_docs'].explode()] 
    counter = Counter(lemmata)
    node_count = counter[node]
    collocate_count = counter[collocate]

    return (node_count * collocate_count * window_size) / len(lemmata)

def count_ngram_collocations(x, w1, w2, l_size: int = 5, r_size: int = 5):
    lemmata = [t.lemma_ for t in x]

    # the right-hand side of a slice in Python is exclusive, so we add 1 to make sure
    # we're actually getting one element to the right
    chunked_lemmata = [lemmata[i - l_size:i + r_size + 1] for i in range(0, len(lemmata))]

    cooccurrences = [1 for l in chunked_lemmata if w1 in l and w2 in l]

    return sum(cooccurrences)

def count_dependency_collocations(x, w1, w2):
    w2_is_child_of_w1 = len([t for t in x if t.lemma_ == w1 and w2 in [tt.lemma_ for tt in t.children]])

    return w2_is_child_of_w1




def getMu (node, collocate):
    iliad_df['agalma_megas_dependencies'] = iliad_df['nlp_docs'].apply(count_dependency_collocations, args=(node, collocate))
    observed_dep_freq_agalma_megas = iliad_df[iliad_df['agalma_megas_dependencies'] > 0].shape[0]
    iliad_df['agalma_megas_5l-5r'] = iliad_df['nlp_docs'].apply(count_ngram_collocations, args=(node, collocate))
    observed_5l_5r_freq_agalma_megas = iliad_df[pausanias_df['agalma_megas_5l-5r'] > 0].shape[0]    
    expected_freq_agalma_megas = expected_frequency_of_collocation(iliad_df, node, collocate)
    mu_deps = observed_dep_freq_agalma_megas / expected_freq_agalma_megas
    mu_1l_1r = observed_1l_1r_freq_agalma_megas / expected_freq_agalma_megas
    mu_5l_5r = observed_5l_5r_freq_agalma_megas / expected_freq_agalma_megas

    print(f"μ for son with dependency of: {mu_deps}\n\nμ for son and of in a 1L, 1R window: {mu_1l_1r}")
    print(f"μ for son and of in a 5L, 5R window: {mu_5l_5r}")





In [204]:
getMu('son','of')
getMu('son','by')
getMu('son','Agamemnon')
getMu('son','his')
getMu('son','her')

  observed_5l_5r_freq_agalma_megas = iliad_df[pausanias_df['agalma_megas_5l-5r'] > 0].shape[0]
  observed_5l_5r_freq_agalma_megas = iliad_df[pausanias_df['agalma_megas_5l-5r'] > 0].shape[0]


μ for son with dependency of: 12.331213605993117

μ for son and of in a 1L, 1R window: 1.006629682121887
μ for son and of in a 5L, 5R window: 0.33554322737396236
μ for son with dependency of: 0.8278204622712887

μ for son and of in a 1L, 1R window: 9.933845547255466
μ for son and of in a 5L, 5R window: 3.3112818490851548


  observed_5l_5r_freq_agalma_megas = iliad_df[pausanias_df['agalma_megas_5l-5r'] > 0].shape[0]
  observed_5l_5r_freq_agalma_megas = iliad_df[pausanias_df['agalma_megas_5l-5r'] > 0].shape[0]


μ for son with dependency of: 15.044737096930378

μ for son and of in a 1L, 1R window: 180.53684516316451
μ for son and of in a 5L, 5R window: 60.17894838772151
μ for son with dependency of: 11.801502265867523

μ for son and of in a 1L, 1R window: 2.6720382488756655
μ for son and of in a 5L, 5R window: 0.8906794162918885
μ for son with dependency of: 9.933845547255466

μ for son and of in a 1L, 1R window: 9.933845547255466
μ for son and of in a 5L, 5R window: 3.3112818490851548


  observed_5l_5r_freq_agalma_megas = iliad_df[pausanias_df['agalma_megas_5l-5r'] > 0].shape[0]


Inversed MI

In [206]:
import math
def getInvMu(node,collocate):
    iliad_df['megas_agalma_collocations'] = iliad_df['nlp_docs'].apply(count_dependency_collocations, args=(collocate, node))
    expected_freq_agalma_megas = expected_frequency_of_collocation(iliad_df, node, collocate)
    observed_freq_megas_agalma = iliad_df[iliad_df['megas_agalma_collocations'] > 0].shape[0]
    mu = observed_freq_megas_agalma / expected_freq_agalma_megas

    mutual_information_megas_agalma = math.log(mu, 2)
    print(f"the inversed MI for node {node} and collocate {collocate} is {mutual_information_megas_agalma}")

I encountered math domain errors in the calculation of inversed MI for node son and collocates 'his' and 'her'

In [209]:
getInvMu('son','of')
getInvMu('son','by')
getInvMu('son','Agamemnon')
# getInvMu('son','his')
# getInvMu('son','her')

the inversed MI for node son and collocate of is 0.816887965854154
the inversed MI for node son and collocate by is -0.27261018498494866
the inversed MI for node son and collocate Agamemnon is 6.718541913096526


delta_p

In [210]:
def count_node_frequency(doc, node_word):
    node_count = 0
    
    node_word_lower = node_word.lower()
    
    
    for token in doc:
        
        if token.text.lower() == node_word_lower:
            node_count += 1
    
    return node_count


def count_non_node_tokens(doc, node_word):
    non_node_count = 0
    node_word_lower = node_word.lower() 
    
    for token in doc:
        
        if token.text.lower() != node_word_lower:
            non_node_count += 1
    
    return non_node_count

observed_freq_megas_agalma = iliad_df[iliad_df['megas_agalma_collocations'] > 0].shape[0]
def delta_p_calc(doc,node,collocate):
    delta_p = observed_freq_megas_agalma/count_node_frequency(doc,node)-count_node_frequency(doc,collocate)/count_non_node_tokens(doc,node)

    return delta_p

In [212]:
print (delta_p_calc(doc,'son','of'))
print (delta_p_calc(doc,'son','by'))
print (delta_p_calc(doc,'son','Agamemnon'))
print (delta_p_calc(doc,'son','his'))
print (delta_p_calc(doc,'son','her'))

-0.041657897333128
-0.004320008727683066
0.0049309969221077
-0.0015194239759594424
0.0031222859366195265


A positive delta_p indicates the probability that a collocate occurs increases as the node occurs; a negative delta_p indicates the probability that a collocate occurs decreases as the node occurs. Compared to the delta_p in Pausanias, there are more positive delta_p in the Iliad.

Below is the code for Pausanias

frequency of co-occurence

Collocation of 'son' and 'of'

In [126]:
node = 'son'
collocate = 'of'

def count_dependency_collocations(x, w1, w2):
    w2_is_child_of_w1 = len([t for t in x if t.lemma_ == w1 and w2 in [tt.lemma_ for tt in t.children]])

    return w2_is_child_of_w1

pausanias_df['agalma_megas_dependencies'] = pausanias_df['nlp_docs'].apply(count_dependency_collocations, args=(node, collocate))

observed_dep_freq_agalma_megas = pausanias_df[pausanias_df['agalma_megas_dependencies'] > 0].shape[0]

observed_dep_freq_agalma_megas

843

In [127]:
def count_ngram_collocations(x, w1, w2, l_size: int = 1, r_size: int = 1):
    lemmata = [t.lemma_ for t in x]

    # the right-hand side of a slice in Python is exclusive, so we add 1 to make sure
    # we're actually getting one element to the right
    chunked_lemmata = [lemmata[i - l_size:i + r_size + 1] for i in range(0, len(lemmata))]

    cooccurrences = [1 for l in chunked_lemmata if w1 in l and w2 in l]

    return sum(cooccurrences)

pausanias_df['agalma_megas_1l-1r'] = pausanias_df['nlp_docs'].apply(count_ngram_collocations, args=(node, collocate))

observed_1l_1r_freq_agalma_megas = pausanias_df[pausanias_df['agalma_megas_1l-1r'] > 0].shape[0]

observed_1l_1r_freq_agalma_megas


856

In [128]:
node = 'son'
collocate = 'of'
def count_ngram_collocations(x, w1, w2, l_size: int = 5, r_size: int = 5):
    lemmata = [t.lemma_ for t in x]

    # the right-hand side of a slice in Python is exclusive, so we add 1 to make sure
    # we're actually getting one element to the right
    chunked_lemmata = [lemmata[i - l_size:i + r_size + 1] for i in range(0, len(lemmata))]

    cooccurrences = [1 for l in chunked_lemmata if w1 in l and w2 in l]

    return sum(cooccurrences)

pausanias_df['agalma_megas_5l-5r'] = pausanias_df['nlp_docs'].apply(count_ngram_collocations, args=(node, collocate))

observed_5l_5r_freq_agalma_megas = pausanias_df[pausanias_df['agalma_megas_5l-5r'] > 0].shape[0]

observed_5l_5r_freq_agalma_megas


897

In [129]:
node = 'son'
collocate = 'of'
def count_ngram_collocations(x, w1, w2, l_size: int = 10, r_size: int = 10):
    lemmata = [t.lemma_ for t in x]

    # the right-hand side of a slice in Python is exclusive, so we add 1 to make sure
    # we're actually getting one element to the right
    chunked_lemmata = [lemmata[i - l_size:i + r_size + 1] for i in range(0, len(lemmata))]

    cooccurrences = [1 for l in chunked_lemmata if w1 in l and w2 in l]

    return sum(cooccurrences)

pausanias_df['agalma_megas_10l-10r'] = pausanias_df['nlp_docs'].apply(count_ngram_collocations, args=(node, collocate))

observed_10l_10r_freq_agalma_megas = pausanias_df[pausanias_df['agalma_megas_10l-10r'] > 0].shape[0]

observed_10l_10r_freq_agalma_megas


923

In [156]:

from collections import Counter
import pandas as pd
import math
node = 'son'
collocate = 'of'
def expected_frequency_of_collocation(df: pd.DataFrame, node: str, collocate: str, window_size: int = 1):
    # """
    # `node` and `collocate` should be the string representations
    # of the associated lemmata
    # """

    lemmata = [t.lemma_ for t in df['nlp_docs'].explode()] 
    counter = Counter(lemmata)
    node_count = counter[node]
    collocate_count = counter[collocate]

    return (node_count * collocate_count * window_size) / len(lemmata)

expected_freq_agalma_megas = expected_frequency_of_collocation(pausanias_df, node, collocate)
observed_10l_10r_freq_agalma_megas = pausanias_df[pausanias_df['agalma_megas_10l-10r'] > 0].shape[0]
mu_deps = observed_dep_freq_agalma_megas / expected_freq_agalma_megas
mu_1l_1r = observed_1l_1r_freq_agalma_megas / expected_freq_agalma_megas
mu_5l_5r = observed_5l_5r_freq_agalma_megas / expected_freq_agalma_megas
mu_10l_10r = observed_10l_10r_freq_agalma_megas / expected_freq_agalma_megas

print(f"μ for son with dependency of: {mu_deps}\n\nμ for son and of in a 1L, 1R window: {mu_1l_1r}")
print(f"μ for son and of in a 5L, 5R window: {mu_5l_5r}")
print(f"μ for son and of in a 10L, 10R window: {mu_10l_10r}")




μ for son with dependency of: 1.4922412766477178

μ for son and of in a 1L, 1R window: 6.715085744914729
μ for son and of in a 5L, 5R window: 0.42054072341890225
μ for son and of in a 10L, 10R window: 6.715085744914729


Mutual information calculation

In [154]:
import math

mutual_information_agalma_megas_deps = math.log(mu_deps, 2)

print (mutual_information_agalma_megas_deps)

mutual_information_agalma_megas_1l_1r = math.log(mu_1l_1r, 2)
print (mutual_information_agalma_megas_1l_1r)

5.750908117446363
7.920833118888675


Inversed MI

In [157]:
import math
pausanias_df['megas_agalma_collocations'] = pausanias_df['nlp_docs'].apply(count_dependency_collocations, args=(collocate, node))

observed_freq_megas_agalma = pausanias_df[pausanias_df['megas_agalma_collocations'] > 0].shape[0]

## Note that the expected frequency does not change depending on which direction the dependency goes
mu = observed_freq_megas_agalma / expected_freq_agalma_megas

mutual_information_megas_agalma = math.log(mu, 2)

mutual_information_megas_agalma

-0.5034391752858247

Delta's P

- The observed frequency of the collocate pair in the corpus (O11), divided by the frequency of the node in the corpus (R1)
  - minus the observed frequency of _the collocate **without** the node_ in the corpus (O21), divided by the tokens that are not the node in the corpus (R2)

AND

- The observed frequency of the collocate pair (O11), divided by the frequency of the *collocate* in the corpus (C1)
  - minus the observed frequency of _the node **without** the collocate_ (O12), divided by the tokens that are not the collocate (C2)

In [184]:
def count_node_frequency(doc, node_word):
    node_count = 0
    
    node_word_lower = node_word.lower()
    
    
    for token in doc:
        
        if token.text.lower() == node_word_lower:
            node_count += 1
    
    return node_count

print (count_node_frequency(doc,'son'))

def count_non_node_tokens(doc, node_word):
    non_node_count = 0
    node_word_lower = node_word.lower() 
    
    for token in doc:
        
        if token.text.lower() != node_word_lower:
            non_node_count += 1
    
    return non_node_count


1364


delta_p calculation function

In [185]:
def count_node_frequency(doc, node_word):
    node_count = 0
    
    node_word_lower = node_word.lower()
    
    
    for token in doc:
        
        if token.text.lower() == node_word_lower:
            node_count += 1
    
    return node_count
def delta_p_calc(doc,node,collocate):
    delta_p = observed_freq_megas_agalma/count_node_frequency(doc,node)-count_node_frequency(doc,collocate)/count_non_node_tokens(doc,node)

    return delta_p

print (delta_p_calc(doc,'son','of'))

-0.04678986214251216


Collocation of 'son' and 'by'

In [132]:
node = 'son'
collocate = 'by'

def count_dependency_collocations(x, w1, w2):
    w2_is_child_of_w1 = len([t for t in x if t.lemma_ == w1 and w2 in [tt.lemma_ for tt in t.children]])

    return w2_is_child_of_w1

pausanias_df['agalma_megas_dependencies'] = pausanias_df['nlp_docs'].apply(count_dependency_collocations, args=(node, collocate))

observed_dep_freq_agalma_megas = pausanias_df[pausanias_df['agalma_megas_dependencies'] > 0].shape[0]

observed_dep_freq_agalma_megas

18

In [133]:
node = 'son'
collocate = 'by'
def count_ngram_collocations(x, w1, w2, l_size: int = 5, r_size: int = 5):
    lemmata = [t.lemma_ for t in x]

    # the right-hand side of a slice in Python is exclusive, so we add 1 to make sure
    # we're actually getting one element to the right
    chunked_lemmata = [lemmata[i - l_size:i + r_size + 1] for i in range(0, len(lemmata))]

    cooccurrences = [1 for l in chunked_lemmata if w1 in l and w2 in l]

    return sum(cooccurrences)

pausanias_df['agalma_megas_5l-5r'] = pausanias_df['nlp_docs'].apply(count_ngram_collocations, args=(node, collocate))

observed_5l_5r_freq_agalma_megas = pausanias_df[pausanias_df['agalma_megas_5l-5r'] > 0].shape[0]

observed_5l_5r_freq_agalma_megas


250

In [158]:

from collections import Counter
import pandas as pd
node = 'son'
collocate = 'by'
def expected_frequency_of_collocation(df: pd.DataFrame, node: str, collocate: str, window_size: int = 1):
    # """
    # `node` and `collocate` should be the string representations
    # of the associated lemmata
    # """

    lemmata = [t.lemma_ for t in df['nlp_docs'].explode()]
    counter = Counter(lemmata)
    node_count = counter[node]
    collocate_count = counter[collocate]

    return (node_count * collocate_count * window_size) / len(lemmata)

expected_freq_agalma_megas = expected_frequency_of_collocation(pausanias_df, node, collocate)

mu_deps = observed_dep_freq_agalma_megas / expected_freq_agalma_megas
mu_1l_1r = observed_1l_1r_freq_agalma_megas / expected_freq_agalma_megas
mu_5l_5r = observed_5l_5r_freq_agalma_megas / expected_freq_agalma_megas
mu_10l_10r = observed_10l_10r_freq_agalma_megas / expected_freq_agalma_megas

print(f"μ for son with dependency by: {mu_deps}\n\nμ for son and by in a 1L, 1R window: {mu_1l_1r}")
print(f"μ for son and by in a 5L, 5R window: {mu_5l_5r}")


μ for son with dependency by: 7.387003713446435

μ for son and by in a 1L, 1R window: 33.24151671050896
μ for son and by in a 5L, 5R window: 2.0817919556076316


MI

In [135]:
import math

mutual_information_agalma_megas_deps = math.log(mu_deps, 2)

print (mutual_information_agalma_megas_deps)

mutual_information_agalma_megas_1l_1r = math.log(mu_1l_1r, 2)
print (mutual_information_agalma_megas_1l_1r)

0.27355459081180294
5.845096575770637


Inversed MI

In [159]:
import math
pausanias_df['megas_agalma_collocations'] = pausanias_df['nlp_docs'].apply(count_dependency_collocations, args=(collocate, node))

observed_freq_megas_agalma = pausanias_df[pausanias_df['megas_agalma_collocations'] > 0].shape[0]

## Note that the expected frequency does not change depending on which direction the dependency goes
mu = observed_freq_megas_agalma / expected_freq_agalma_megas

mutual_information_megas_agalma = math.log(mu, 2)

mutual_information_megas_agalma

0.4959470121482507

delta_p

In [186]:
print (delta_p_calc(doc,'son','by'))

-0.00945197353706723


collocation between 'son' and 'his'

In [136]:
node = 'son'
collocate = 'his'

def count_dependency_collocations(x, w1, w2):
    w2_is_child_of_w1 = len([t for t in x if t.lemma_ == w1 and w2 in [tt.lemma_ for tt in t.children]])

    return w2_is_child_of_w1

pausanias_df['agalma_megas_dependencies'] = pausanias_df['nlp_docs'].apply(count_dependency_collocations, args=(node, collocate))

observed_dep_freq_agalma_megas = pausanias_df[pausanias_df['agalma_megas_dependencies'] > 0].shape[0]

observed_dep_freq_agalma_megas

107

In [137]:
node = 'son'
collocate = 'his'
def count_ngram_collocations(x, w1, w2, l_size: int = 5, r_size: int = 5):
    lemmata = [t.lemma_ for t in x]

    # the right-hand side of a slice in Python is exclusive, so we add 1 to make sure
    # we're actually getting one element to the right
    chunked_lemmata = [lemmata[i - l_size:i + r_size + 1] for i in range(0, len(lemmata))]

    cooccurrences = [1 for l in chunked_lemmata if w1 in l and w2 in l]

    return sum(cooccurrences)

pausanias_df['agalma_megas_5l-5r'] = pausanias_df['nlp_docs'].apply(count_ngram_collocations, args=(node, collocate))

observed_5l_5r_freq_agalma_megas = pausanias_df[pausanias_df['agalma_megas_5l-5r'] > 0].shape[0]

observed_5l_5r_freq_agalma_megas


228

In [163]:

from collections import Counter
import pandas as pd
node = 'son'
collocate = 'his'
def expected_frequency_of_collocation(df: pd.DataFrame, node: str, collocate: str, window_size: int = 1):
    # """
    # `node` and `collocate` should be the string representations
    # of the associated lemmata
    # """

    lemmata = [t.lemma_ for t in df['nlp_docs'].explode()]
    counter = Counter(lemmata)
    node_count = counter[node]
    collocate_count = counter[collocate]

    return (node_count * collocate_count * window_size) / len(lemmata)

expected_freq_agalma_megas = expected_frequency_of_collocation(pausanias_df, node, collocate)
print (expected_freq_agalma_megas)
mu_deps = observed_dep_freq_agalma_megas / expected_freq_agalma_megas
mu_1l_1r = observed_1l_1r_freq_agalma_megas / expected_freq_agalma_megas
mu_5l_5r = observed_5l_5r_freq_agalma_megas / expected_freq_agalma_megas
mu_10l_10r = observed_10l_10r_freq_agalma_megas / expected_freq_agalma_megas

print(f"μ for son with dependency his: {mu_deps}\n\nμ for son and his in a 1L, 1R window: {mu_1l_1r}")
print(f"μ for son and his in a 5L, 5R window: {mu_5l_5r}")


10.478864017152725
μ for son with dependency his: 10.497321066476513

μ for son and his in a 1L, 1R window: 47.23794479914431
μ for son and his in a 5L, 5R window: 2.958335936916108


MI

In [139]:
import math

mutual_information_agalma_megas_deps = math.log(mu_deps, 2)

print (mutual_information_agalma_megas_deps)

mutual_information_agalma_megas_1l_1r = math.log(mu_1l_1r, 2)
print (mutual_information_agalma_megas_1l_1r)

3.3520565644905207
6.352056564490521


Inversed MI

In [166]:
# import math
# node = 'son'
# collocate = 'his'
# pausanias_df['megas_agalma_collocations'] = pausanias_df['nlp_docs'].apply(count_dependency_collocations, args=(collocate, node))

# observed_freq_megas_agalma = pausanias_df[pausanias_df['megas_agalma_collocations'] > 0].shape[0]

# print (observed_freq_megas_agalma)

# ## Note that the expected frequency does not change depending on which direction the dependency goes
# mu = observed_freq_megas_agalma / expected_freq_agalma_megas

# mutual_information_megas_agalma = math.log(mu, 2)

# mutual_information_megas_agalma

0


ValueError: math domain error

delta_p

In [187]:
print (delta_p_calc(doc,'son','his'))

-0.006651388785343606


There's a math domain error. After printing out both observed freq. and expected freq., it's clear that the observed freq. here is 0

Collocation between 'son' and 'the'

In [140]:
node = 'son'
collocate = 'the'

def count_dependency_collocations(x, w1, w2):
    w2_is_child_of_w1 = len([t for t in x if t.lemma_ == w1 and w2 in [tt.lemma_ for tt in t.children]])

    return w2_is_child_of_w1

pausanias_df['agalma_megas_dependencies'] = pausanias_df['nlp_docs'].apply(count_dependency_collocations, args=(node, collocate))

observed_dep_freq_agalma_megas = pausanias_df[pausanias_df['agalma_megas_dependencies'] > 0].shape[0]

observed_dep_freq_agalma_megas

633

In [141]:
node = 'son'
collocate = 'the'
def count_ngram_collocations(x, w1, w2, l_size: int = 5, r_size: int = 5):
    lemmata = [t.lemma_ for t in x]

    # the right-hand side of a slice in Python is exclusive, so we add 1 to make sure
    # we're actually getting one element to the right
    chunked_lemmata = [lemmata[i - l_size:i + r_size + 1] for i in range(0, len(lemmata))]

    cooccurrences = [1 for l in chunked_lemmata if w1 in l and w2 in l]

    return sum(cooccurrences)

pausanias_df['agalma_megas_5l-5r'] = pausanias_df['nlp_docs'].apply(count_ngram_collocations, args=(node, collocate))

observed_5l_5r_freq_agalma_megas = pausanias_df[pausanias_df['agalma_megas_5l-5r'] > 0].shape[0]

observed_5l_5r_freq_agalma_megas


863

In [169]:

from collections import Counter
import pandas as pd
node = 'son'
collocate = 'the'
def expected_frequency_of_collocation(df: pd.DataFrame, node: str, collocate: str, window_size: int = 1):
    # """
    # `node` and `collocate` should be the string representations
    # of the associated lemmata
    # """

    lemmata = [t.lemma_ for t in df['nlp_docs'].explode()]
    counter = Counter(lemmata)
    node_count = counter[node]
    collocate_count = counter[collocate]

    return (node_count * collocate_count * window_size) / len(lemmata)

expected_freq_agalma_megas = expected_frequency_of_collocation(pausanias_df, node, collocate)
print (expected_freq_agalma_megas)
mu_deps = observed_dep_freq_agalma_megas / expected_freq_agalma_megas
mu_1l_1r = observed_1l_1r_freq_agalma_megas / expected_freq_agalma_megas
mu_5l_5r = observed_5l_5r_freq_agalma_megas / expected_freq_agalma_megas
mu_10l_10r = observed_10l_10r_freq_agalma_megas / expected_freq_agalma_megas

print(f"μ for son with dependency the: {mu_deps}\n\nμ for son and the in a 1L, 1R window: {mu_1l_1r}")
print(f"μ for son and the in a 5L, 5R window: {mu_5l_5r}")


137.53253689569064
μ for son with dependency the: 0.7998107392102259

μ for son and the in a 1L, 1R window: 3.5991483264460165
μ for son and the in a 5L, 5R window: 0.22540120832288185


MI

In [143]:
import math

mutual_information_agalma_megas_deps = math.log(mu_deps, 2)

print (mutual_information_agalma_megas_deps)

mutual_information_agalma_megas_1l_1r = math.log(mu_1l_1r, 2)
print (mutual_information_agalma_megas_1l_1r)

2.202432533633878
2.6378378306066836


Inversed MI

In [170]:
# import math
# pausanias_df['megas_agalma_collocations'] = pausanias_df['nlp_docs'].apply(count_dependency_collocations, args=(collocate, node))

# observed_freq_megas_agalma = pausanias_df[pausanias_df['megas_agalma_collocations'] > 0].shape[0]
# print (observed_freq_megas_agalma)
# ## Note that the expected frequency does not change depending on which direction the dependency goes
# mu = observed_freq_megas_agalma / expected_freq_agalma_megas

# mutual_information_megas_agalma = math.log(mu, 2)

# mutual_information_megas_agalma

0


ValueError: math domain error

delta_p

In [188]:
print (delta_p_calc(doc,'son','the'))

-0.08729785709886648


collocation between 'son' and 'her'

In [144]:
node = 'son'
collocate = 'her'

def count_dependency_collocations(x, w1, w2):
    w2_is_child_of_w1 = len([t for t in x if t.lemma_ == w1 and w2 in [tt.lemma_ for tt in t.children]])

    return w2_is_child_of_w1

pausanias_df['agalma_megas_dependencies'] = pausanias_df['nlp_docs'].apply(count_dependency_collocations, args=(node, collocate))

observed_dep_freq_agalma_megas = pausanias_df[pausanias_df['agalma_megas_dependencies'] > 0].shape[0]

observed_dep_freq_agalma_megas

19

In [145]:
node = 'son'
collocate = 'her'
def count_ngram_collocations(x, w1, w2, l_size: int = 5, r_size: int = 5):
    lemmata = [t.lemma_ for t in x]

    # the right-hand side of a slice in Python is exclusive, so we add 1 to make sure
    # we're actually getting one element to the right
    chunked_lemmata = [lemmata[i - l_size:i + r_size + 1] for i in range(0, len(lemmata))]

    cooccurrences = [1 for l in chunked_lemmata if w1 in l and w2 in l]

    return sum(cooccurrences)

pausanias_df['agalma_megas_5l-5r'] = pausanias_df['nlp_docs'].apply(count_ngram_collocations, args=(node, collocate))

observed_5l_5r_freq_agalma_megas = pausanias_df[pausanias_df['agalma_megas_5l-5r'] > 0].shape[0]

observed_5l_5r_freq_agalma_megas


31

In [171]:

from collections import Counter
import pandas as pd
node = 'son'
collocate = 'her'
def expected_frequency_of_collocation(df: pd.DataFrame, node: str, collocate: str, window_size: int = 1):
    # """
    # `node` and `collocate` should be the string representations
    # of the associated lemmata
    # """

    lemmata = [t.lemma_ for t in df['nlp_docs'].explode()]
    counter = Counter(lemmata)
    node_count = counter[node]
    collocate_count = counter[collocate]

    return (node_count * collocate_count * window_size) / len(lemmata)

expected_freq_agalma_megas = expected_frequency_of_collocation(pausanias_df, node, collocate)

mu_deps = observed_dep_freq_agalma_megas / expected_freq_agalma_megas
mu_1l_1r = observed_1l_1r_freq_agalma_megas / expected_freq_agalma_megas
mu_5l_5r = observed_5l_5r_freq_agalma_megas / expected_freq_agalma_megas
mu_10l_10r = observed_10l_10r_freq_agalma_megas / expected_freq_agalma_megas

print(f"μ for son with dependency her: {mu_deps}\n\nμ for son and her in a 1L, 1R window: {mu_1l_1r}")
print(f"μ for son and her in a 5L, 5R window: {mu_5l_5r}")


μ for son with dependency her: 53.85125707102451

μ for son and her in a 1L, 1R window: 242.33065681961028
μ for son and her in a 5L, 5R window: 15.176263356379634


In [147]:
import math

mutual_information_agalma_megas_deps = math.log(mu_deps, 2)

print (mutual_information_agalma_megas_deps)

mutual_information_agalma_megas_1l_1r = math.log(mu_1l_1r, 2)
print (mutual_information_agalma_megas_1l_1r)

3.2174759173652885
8.71101539032285


In [176]:
# import math
# node = 'son'
# collocate = 'her'
# pausanias_df['megas_agalma_collocations'] = pausanias_df['nlp_docs'].apply(count_dependency_collocations, args=(collocate, node))

# observed_freq_megas_agalma = pausanias_df[pausanias_df['megas_agalma_collocations'] > 0].shape[0]
# print (observed_freq_megas_agalma)
# ## Note that the expected frequency does not change depending on which direction the dependency goes
# mu = observed_freq_megas_agalma / expected_freq_agalma_megas

# mutual_information_megas_agalma = math.log(mu, 2)

# mutual_information_megas_agalma

0


ValueError: math domain error

delta p

In [189]:
print (delta_p_calc(doc,'son','her'))

-0.0020096788727646375


collocation between 'son' and 'a'

In [148]:
node = 'son'
collocate = 'a'

def count_dependency_collocations(x, w1, w2):
    w2_is_child_of_w1 = len([t for t in x if t.lemma_ == w1 and w2 in [tt.lemma_ for tt in t.children]])

    return w2_is_child_of_w1

pausanias_df['agalma_megas_dependencies'] = pausanias_df['nlp_docs'].apply(count_dependency_collocations, args=(node, collocate))

observed_dep_freq_agalma_megas = pausanias_df[pausanias_df['agalma_megas_dependencies'] > 0].shape[0]

observed_dep_freq_agalma_megas

110

In [149]:
node = 'son'
collocate = 'a'
def count_ngram_collocations(x, w1, w2, l_size: int = 10, r_size: int = 10):
    lemmata = [t.lemma_ for t in x]

    # the right-hand side of a slice in Python is exclusive, so we add 1 to make sure
    # we're actually getting one element to the right
    chunked_lemmata = [lemmata[i - l_size:i + r_size + 1] for i in range(0, len(lemmata))]

    cooccurrences = [1 for l in chunked_lemmata if w1 in l and w2 in l]

    return sum(cooccurrences)

pausanias_df['agalma_megas_10l-10r'] = pausanias_df['nlp_docs'].apply(count_ngram_collocations, args=(node, collocate))

observed_1l_1r_freq_agalma_megas = pausanias_df[pausanias_df['agalma_megas_10l-10r'] > 0].shape[0]

observed_1l_1r_freq_agalma_megas


495

In [177]:

from collections import Counter
import pandas as pd
node = 'son'
collocate = 'her'
def expected_frequency_of_collocation(df: pd.DataFrame, node: str, collocate: str, window_size: int = 1):
    # """
    # `node` and `collocate` should be the string representations
    # of the associated lemmata
    # """

    lemmata = [t.lemma_ for t in df['nlp_docs'].explode()]
    counter = Counter(lemmata)
    node_count = counter[node]
    collocate_count = counter[collocate]

    return (node_count * collocate_count * window_size) / len(lemmata)

expected_freq_agalma_megas = expected_frequency_of_collocation(pausanias_df, node, collocate)

mu_deps = observed_dep_freq_agalma_megas / expected_freq_agalma_megas
mu_1l_1r = observed_1l_1r_freq_agalma_megas / expected_freq_agalma_megas
mu_5l_5r = observed_5l_5r_freq_agalma_megas / expected_freq_agalma_megas
mu_10l_10r = observed_10l_10r_freq_agalma_megas / expected_freq_agalma_megas

print(f"μ for son with dependency a: {mu_deps}\n\nμ for son and a in a 1L, 1R window: {mu_1l_1r}")
print(f"μ for son and a in a 5L, 5R window: {mu_5l_5r}")


μ for son with dependency a: 53.85125707102451

μ for son and a in a 1L, 1R window: 242.33065681961028
μ for son and a in a 5L, 5R window: 15.176263356379634


MI

In [151]:
import math

mutual_information_agalma_megas_deps = math.log(mu_deps, 2)

print (mutual_information_agalma_megas_deps)

mutual_information_agalma_megas_1l_1r = math.log(mu_1l_1r, 2)
print (mutual_information_agalma_megas_1l_1r)

5.750908117446363
7.920833118888675


Inversed MI

In [180]:
# import math
# node = 'son'
# collocate = 'a'
# pausanias_df['megas_agalma_collocations'] = pausanias_df['nlp_docs'].apply(count_dependency_collocations, args=(collocate, node))

# observed_freq_megas_agalma = pausanias_df[pausanias_df['megas_agalma_collocations'] > 0].shape[0]

# ## Note that the expected frequency does not change depending on which direction the dependency goes
# mu = observed_freq_megas_agalma / expected_freq_agalma_megas

# mutual_information_megas_agalma = math.log(mu, 2)

# mutual_information_megas_agalma

ValueError: math domain error

delta p

In [190]:
print (delta_p_calc(doc,'son','a'))

-0.019062776533433603


Observation as the window size increases: As the window size increases, the collocation frequency increases as well. That's because a window of larger size captures more collocates around the node. Meanwhile, with a larger window size the base of tokens that are potentially collocates increases as well because a larger window covers more tokens itself.

With a window size of 1, only the words to the left and right of the node are counted. With a window size of 10, a total of 20 words surrounding the node is counted.

Nevertheless, Mu decreases significantly as window size increases

For all the collocations I chose with regard to "son", the mutual information is larger than 1, indicating that the occurence of the node comes with the occurence of the collocate

In calculating the inversed MI, I got three different kinds of results. The first one is an inverse MI < 0 while > -1. This aligns with the expected result since adjectives aren't expected to rule the nouns. The second one is an inverse MI > 0 while < 1. This looks like that the adjectives, or the collocates, are somehow affecting the occurences of the nodes. The third one is a math fault. After trying to print out the expected and observed frequencies, it appeared that the observed frequencies in those cases after I switched the position of node and collocate, is 0. I have not yet figured out what caused the denominator to be 0.

In all cases of Pausanias, the delta p is negative, indicating that when the node appears, the collocate is less likely to appear to its right or left. (This doesn't seem reasonable because a collocation pair is supposed to occur together)