## HW 4 Submission
### Aadit Zaveri and Rohan Valia

In [1]:
#loading files
from utils import load_pausanias

pausanias_df = load_pausanias('eng') 

In [2]:
from utils import load_iliad

iliad_df = load_iliad()

When calculating co-occurrences in Greek, it is generally insufficient to use the L and R windows that Brezina uses for English (@Brezina2018 67–70). Instead, we'll look for a dependency relationship between the **node** and its **collocates**. Below, you can see that we can access the dependencies of a token through its `children` property.

In [3]:
test_token = pausanias_df['nlp_docs'][0][1]

# we use a list comprehension to evaluate the generator at `test_token.children`
f"token: '{test_token}, {test_token.lemma_}', dependencies: {[(c, c.lemma_) for c in test_token.children]}"

"token: 'the, the', dependencies: []"

In [16]:
# Mu function

from collections import Counter
import pandas as pd

def expected_frequency_of_collocation(df: pd.DataFrame, node: str, collocate: str, window_size: int = 1):
    """
    `node` and `collocate` should be the string representations
    of the associated lemmata
    """

    lemmata = [t.lemma_ for t in df['nlp_docs'].explode()]
    counter = Counter(lemmata)
    node_count = counter[node]
    collocate_count = counter[collocate]

    return (node_count * collocate_count * window_size) / len(lemmata)



In [17]:
def deltaP(node, collocate, df):    

    def count_ngram_collocations(x, w1, w2, l_size: int = 1, r_size: int = 1):
        lemmata = [t.lemma_ for t in x]

        indexes = [i for i, lemma in enumerate(lemmata) if lemma == w1]

        cooccurrences = 0

        for i in indexes:
            left = max(i - l_size, 0)
            right = min(i + r_size + 1, len(lemmata))
            window = lemmata[left:right]

            if w2 in window:
                cooccurrences += 1
                
        return cooccurrences

    df['o11_temple_a_ngrams'] = df['nlp_docs'].apply(count_ngram_collocations, args=(node, collocate))

    o11 = df['o11_temple_a_ngrams'].sum()

    all_tokens = df['nlp_docs'].explode()

    r1 = len([t for t in all_tokens if t.lemma_ == node])
    r2 = len(all_tokens) - r1
    c1 = len([t for t in all_tokens if t.lemma_ == collocate])
    o21 = c1 - o11

    return (o11 / r1) - (o21 / r2)

Notice that we're also accessing the `lemma_` property here. Because Greek is heavily inflected, we'll tend to focus on collocations of lemmata, rather than types -- but you might find in your own work that it is interesting to look at type collocations instead. Just be sure to note which kind of "word" you're examining.

### Frequency of co-occurrence

The frequency of co-occurrence reports the presence of both a **node** (`w1`) and a **collocate** (`w2`). Given a DataFrame like `pausanias_df`, we can calculate the frequency of co-occurrence in two different ways. 

We can either count when the collocate is a dependency of the node, like so:

In [18]:
print('Pausanias: ')
node = 'god'
collocate = 'the' # the

def count_dependency_collocations(x, w1, w2):
    w2_is_child_of_w1 = len([t for t in x if t.lemma_ == w1 and w2 in [tt.lemma_ for tt in t.children]])

    return w2_is_child_of_w1

pausanias_df['dependencies'] = pausanias_df['nlp_docs'].apply(count_dependency_collocations, args=(node, collocate))

observed_freq = pausanias_df[pausanias_df['dependencies'] > 0].shape[0]

print('Observed Freq: ', observed_freq)

expected_freq = expected_frequency_of_collocation(pausanias_df, node, collocate)

mu_deps = observed_freq / expected_freq

print('Mu: ', mu_deps)

import math

mutual_info = math.log(mu_deps, 2)
print('Mutual Information: ', mutual_info)

deltaP_1 = deltaP(node, collocate, pausanias_df)
print('Delta P: ', deltaP_1)



Pausanias: 
Observed Freq:  322
Mu:  8.314623403386307
Mutual Information:  3.0556509205935973
Delta P:  0.6868650246447927


In [19]:
print('Iliad: ')
node = 'god'
collocate = 'the' # the

def count_dependency_collocations(x, w1, w2):
    w2_is_child_of_w1 = len([t for t in x if t.lemma_ == w1 and w2 in [tt.lemma_ for tt in t.children]])

    return w2_is_child_of_w1

iliad_df['dependencies'] = iliad_df['nlp_docs'].apply(count_dependency_collocations, args=(node, collocate))

observed_freq = iliad_df[iliad_df['dependencies'] > 0].shape[0]

print('Observed Freq: ', observed_freq)

expected_freq = expected_frequency_of_collocation(iliad_df, node, collocate)

mu_deps = observed_freq / expected_freq

print('Mu: ', mu_deps)

import math

mutual_info = math.log(mu_deps, 2)
print('Mutual Information: ', mutual_info)

deltaP_1 = deltaP(node, collocate, iliad_df)
print('Delta P: ', deltaP_1)



Iliad: 
Observed Freq:  179
Mu:  6.554758580371991
Mutual Information:  2.712542645235118
Delta P:  0.4037241526145461


In [20]:
node = 'god' 
collocate = 'a' 

def count_dependency_collocations(x, w1, w2):
    w2_is_child_of_w1 = len([t for t in x if t.lemma_ == w1 and w2 in [tt.lemma_ for tt in t.children]])

    return w2_is_child_of_w1

pausanias_df['dependencies'] = pausanias_df['nlp_docs'].apply(count_dependency_collocations, args=(node, collocate))

observed_freq = pausanias_df[pausanias_df['dependencies'] > 0].shape[0]

observed_freq

print('Observed Freq: ', observed_freq)

expected_freq = expected_frequency_of_collocation(pausanias_df, node, collocate)

mu_deps = observed_freq / expected_freq

print('Mu: ', mu_deps)

mutual_info = math.log(mu_deps, 2)

print('Mutual Information: ', mutual_info)

deltaP_2 = deltaP(node, collocate, pausanias_df)
print('Delta P: ', deltaP_2)

Observed Freq:  15
Mu:  1.7737622063788958
Mutual Information:  0.8268126124164578
Delta P:  0.01462683309133956


In [21]:
print('Iliad: ')
node = 'god'
collocate = 'a' # the

def count_dependency_collocations(x, w1, w2):
    w2_is_child_of_w1 = len([t for t in x if t.lemma_ == w1 and w2 in [tt.lemma_ for tt in t.children]])

    return w2_is_child_of_w1

iliad_df['dependencies'] = iliad_df['nlp_docs'].apply(count_dependency_collocations, args=(node, collocate))

observed_freq = iliad_df[iliad_df['dependencies'] > 0].shape[0]

print('Observed Freq: ', observed_freq)

expected_freq = expected_frequency_of_collocation(iliad_df, node, collocate)

mu_deps = observed_freq / expected_freq

print('Mu: ', mu_deps)

import math

mutual_info = math.log(mu_deps, 2)
print('Mutual Information: ', mutual_info)

deltaP_1 = deltaP(node, collocate, iliad_df)
print('Delta P: ', deltaP_1)



Iliad: 
Observed Freq:  40
Mu:  8.611590793436276
Mutual Information:  3.106279766984668
Delta P:  0.09968310338954181


In [22]:
node = 'god'  
collocate = 'of' # mother

def count_dependency_collocations(x, w1, w2):
    w2_is_child_of_w1 = len([t for t in x if t.lemma_ == w1 and w2 in [tt.lemma_ for tt in t.children]])

    return w2_is_child_of_w1

pausanias_df['dependencies'] = pausanias_df['nlp_docs'].apply(count_dependency_collocations, args=(node, collocate))

observed_freq = pausanias_df[pausanias_df['dependencies'] > 0].shape[0]

observed_freq

print('Observed Freq: ', observed_freq)

expected_freq = expected_frequency_of_collocation(pausanias_df, node, collocate)

mu_deps = observed_freq / expected_freq

print('Mu: ', mu_deps)

mutual_info = math.log(mu_deps, 2)

print('Mutual Information: ', mutual_info)

deltaP_3 = deltaP(node, collocate, pausanias_df)
print('Delta P: ', deltaP_3)

Observed Freq:  17
Mu:  0.8190066925627196
Mutual Information:  -0.2880528539031336
Delta P:  0.009484953336988497


In [23]:
print('Iliad: ')
node = 'god'
collocate = 'of' # the

def count_dependency_collocations(x, w1, w2):
    w2_is_child_of_w1 = len([t for t in x if t.lemma_ == w1 and w2 in [tt.lemma_ for tt in t.children]])

    return w2_is_child_of_w1

iliad_df['dependencies'] = iliad_df['nlp_docs'].apply(count_dependency_collocations, args=(node, collocate))

observed_freq = iliad_df[iliad_df['dependencies'] > 0].shape[0]

print('Observed Freq: ', observed_freq)

expected_freq = expected_frequency_of_collocation(iliad_df, node, collocate)

mu_deps = observed_freq / expected_freq

print('Mu: ', mu_deps)

import math

mutual_info = math.log(mu_deps, 2)
print('Mutual Information: ', mutual_info)

deltaP_1 = deltaP(node, collocate, iliad_df)
print('Delta P: ', deltaP_1)



Iliad: 
Observed Freq:  15
Mu:  0.9944817380983576
Mutual Information:  -0.007983216132855285
Delta P:  0.032784619315278485


In [24]:
node = 'god' # temple
collocate = 'in' # god

def count_dependency_collocations(x, w1, w2):
    w2_is_child_of_w1 = len([t for t in x if t.lemma_ == w1 and w2 in [tt.lemma_ for tt in t.children]])

    return w2_is_child_of_w1

pausanias_df['dependencies'] = pausanias_df['nlp_docs'].apply(count_dependency_collocations, args=(node, collocate))

observed_freq = pausanias_df[pausanias_df['dependencies'] > 0].shape[0]

observed_freq

print('Observed Freq: ', observed_freq)

expected_freq = expected_frequency_of_collocation(pausanias_df, node, collocate)

mu_deps = observed_freq / expected_freq

print('Mu: ', mu_deps)

mutual_info = math.log(mu_deps, 2)

print('Mutual Information: ', mutual_info)

deltaP_4 = deltaP(node, collocate, pausanias_df)
print('Delta P: ', deltaP_4)

Observed Freq:  8
Mu:  1.1282628849552394
Mutual Information:  0.1741032544687521
Delta P:  0.013209740295223218


In [25]:
print('Iliad: ')
node = 'god'
collocate = 'in' # the

def count_dependency_collocations(x, w1, w2):
    w2_is_child_of_w1 = len([t for t in x if t.lemma_ == w1 and w2 in [tt.lemma_ for tt in t.children]])

    return w2_is_child_of_w1

iliad_df['dependencies'] = iliad_df['nlp_docs'].apply(count_dependency_collocations, args=(node, collocate))

observed_freq = iliad_df[iliad_df['dependencies'] > 0].shape[0]

print('Observed Freq: ', observed_freq)

expected_freq = expected_frequency_of_collocation(iliad_df, node, collocate)

mu_deps = observed_freq / expected_freq

print('Mu: ', mu_deps)

import math

mutual_info = math.log(mu_deps, 2)
print('Mutual Information: ', mutual_info)

deltaP_1 = deltaP(node, collocate, iliad_df)
print('Delta P: ', deltaP_1)



Iliad: 
Observed Freq:  6
Mu:  0.9205443293085306
Mutual Information:  -0.11944089788176607
Delta P:  0.009850962357349684


In [26]:
node= 'god' # priest
collocate = 'great'  # god

def count_dependency_collocations(x, w1, w2):
    w2_is_child_of_w1 = len([t for t in x if t.lemma_ == w1 and w2 in [tt.lemma_ for tt in t.children]])

    return w2_is_child_of_w1

pausanias_df['dependencies'] = pausanias_df['nlp_docs'].apply(count_dependency_collocations, args=(node, collocate))

observed_freq = pausanias_df[pausanias_df['dependencies'] > 0].shape[0]

observed_freq

print('Observed Freq: ', observed_freq)

expected_freq = expected_frequency_of_collocation(pausanias_df, node, collocate)

mu_deps = observed_freq / expected_freq

print('Mu: ', mu_deps)

mutual_info = math.log(mu_deps, 2)

print('Mutual Information: ', mutual_info)

deltaP_5 = deltaP(node, collocate, pausanias_df)
print('Delta P: ', deltaP_5)

Observed Freq:  1
Mu:  2.341525372775373
Mutual Information:  1.2274486711691057
Delta P:  0.0012806986548452477


In [27]:
print('Iliad: ')
node = 'god'
collocate = 'great' # the

def count_dependency_collocations(x, w1, w2):
    w2_is_child_of_w1 = len([t for t in x if t.lemma_ == w1 and w2 in [tt.lemma_ for tt in t.children]])

    return w2_is_child_of_w1

iliad_df['dependencies'] = iliad_df['nlp_docs'].apply(count_dependency_collocations, args=(node, collocate))

observed_freq = iliad_df[iliad_df['dependencies'] > 0].shape[0]

print('Observed Freq: ', observed_freq)

expected_freq = expected_frequency_of_collocation(iliad_df, node, collocate)

mu_deps = observed_freq / expected_freq

print('Mu: ', mu_deps)

import math

mutual_info = math.log(mu_deps, 2)
print('Mutual Information: ', mutual_info)

deltaP_1 = deltaP(node, collocate, iliad_df)
print('Delta P: ', deltaP_1)



Iliad: 
Observed Freq:  3
Mu:  3.218633075579945
Mutual Information:  1.686448118843056
Delta P:  0.004544966785896781


Or we can count when the collocate and node co-occur within a given window, as follows:

You can experiment in your own notebooks by adjusting the `l_size` and `r_size` args passed to the `count_ngram_collocations` function.

## Exercises

You may work in groups for these exercises. They are due at the beginning of next class. You can submit them as a link to a Colab notebook or GitHub CodeSpace.

0. Use the corpora that you assembled last week (Pausanias++):
1. Using programming techniques from the course so far, find other potential collocates for a word of your choice (indicative verb)
2. Calculate the μ and Mutual Information scores for at least 5 of these collocate pairs. How do your results change depending on your definition of a collocation? What might these changes mean? (Write your answers to these questions down.)
3. Calculate the Delta P for these same five pairs. Do any results stand out? Why? What might they tell us about your corpus.

## 2. Mu and Mutual Information Score Analysis

In our dispersion lab we chose the word God/Goddess. Since we are working with the English version, we chose to stick with the masculine version- God- as our node and varied it with 5 different collocate pairs to see what most frequently correlates to the word God in Pausanias and Iliad.

### Definitions
Mutual information measures how much the appearance of one word in our collocate pair suggests the appearance of the other word. We can calculate it by taking the log<sub>2</sub> of `mu` (observed / expected frequency).

> For μ > 1 we speak of positive association (where the components are more likely to occur together than if they were independent), and for μ < 1 we speak of negative association (where the components are less likely to occur together than if they were independent).

In other words, μ says that f the ratio is greater than 1, the words co-appear more frequently than expected.

### Analysis
#### Top Mutual Information Values (Positive Association)
God Pausanias Context:
"the": Highest MI score (3.06) with MU of 8.31, suggesting "the" occurs significantly more often with God Pausanias than expected by chance.

"great": MI of 1.23, indicating a relatively strong association, showing the word "great" often accompanies mentions of God Pausanias.

Iliad Context:
"the": Highest MI score (2.71) with MU of 6.55, highlighting frequent co-occurrence. This suggests the definite article plays a significant role in structuring content within the Iliad.

"great": MI of 1.69, indicating an elevated association. This aligns with the expectation that the word "great" would frequently appear in an epic like the Iliad.

#### Low Mutual Information Values (Negative Association)
God Pausanias Context:
"of": MI of -0.29, indicating that "of" appears less frequently with God Pausanias than expected, suggesting weaker relevance or structural connection.
"in": MI of 0.17, showing a minimal association, meaning the word "in" occurs close to independent distribution with God Pausanias.

Iliad Context:
"in": MI of -0.12, reflecting weak association with co-occurrence less frequent than expected.
"of": MI of -0.007, suggesting almost no meaningful co-occurrence.

#### Observations and Insights
High MU values for common words like "the" and "great" in both texts indicate these words play important roles, with greater-than-expected presence, perhaps due to stylistic or narrative structures.
Low MI values for words like "of" and "in" might reflect less direct association with key themes or characters, especially in the context of God Pausanias, suggesting these words do not feature prominently in forming significant collocations.