## Lab 4 Submission
### Rohan and Aadit

In [2]:
# I've created a utils.py file for frequently reused functionality -- you can import from it like so
from utils import load_pausanias

pausanias_df = load_pausanias('eng') # you can use `load_pausanias('eng')` to load the English version

When calculating co-occurrences in Greek, it is generally insufficient to use the L and R windows that Brezina uses for English (@Brezina2018 67–70). Instead, we'll look for a dependency relationship between the **node** and its **collocates**. Below, you can see that we can access the dependencies of a token through its `children` property.

In [3]:
test_token = pausanias_df['nlp_docs'][0][1]

# we use a list comprehension to evaluate the generator at `test_token.children`
f"token: '{test_token}, {test_token.lemma_}', dependencies: {[(c, c.lemma_) for c in test_token.children]}"

"token: 'the, the', dependencies: []"

In [4]:
# Mu function

from collections import Counter
import pandas as pd

def expected_frequency_of_collocation(df: pd.DataFrame, node: str, collocate: str, window_size: int = 1):
    """
    `node` and `collocate` should be the string representations
    of the associated lemmata
    """

    lemmata = [t.lemma_ for t in df['nlp_docs'].explode()]
    counter = Counter(lemmata)
    node_count = counter[node]
    collocate_count = counter[collocate]

    return (node_count * collocate_count * window_size) / len(lemmata)



In [5]:
def deltaP(node, collocate):    

    def count_ngram_collocations(x, w1, w2, l_size: int = 1, r_size: int = 1):
        lemmata = [t.lemma_ for t in x]

        indexes = [i for i, lemma in enumerate(lemmata) if lemma == w1]

        cooccurrences = 0

        for i in indexes:
            left = max(i - l_size, 0)
            right = min(i + r_size + 1, len(lemmata))
            window = lemmata[left:right]

            if w2 in window:
                cooccurrences += 1
                
        return cooccurrences

    pausanias_df['o11_temple_a_ngrams'] = pausanias_df['nlp_docs'].apply(count_ngram_collocations, args=(node, collocate))

    o11 = pausanias_df['o11_temple_a_ngrams'].sum()

    all_tokens = pausanias_df['nlp_docs'].explode()

    r1 = len([t for t in all_tokens if t.lemma_ == node])
    r2 = len(all_tokens) - r1
    c1 = len([t for t in all_tokens if t.lemma_ == collocate])
    o21 = c1 - o11

    return (o11 / r1) - (o21 / r2)

Notice that we're also accessing the `lemma_` property here. Because Greek is heavily inflected, we'll tend to focus on collocations of lemmata, rather than types -- but you might find in your own work that it is interesting to look at type collocations instead. Just be sure to note which kind of "word" you're examining.

### Frequency of co-occurrence

The frequency of co-occurrence reports the presence of both a **node** (`w1`) and a **collocate** (`w2`). Given a DataFrame like `pausanias_df`, we can calculate the frequency of co-occurrence in two different ways. 

We can either count when the collocate is a dependency of the node, like so:

In [6]:
node = 'god'
collocate = 'the' # the

def count_dependency_collocations(x, w1, w2):
    w2_is_child_of_w1 = len([t for t in x if t.lemma_ == w1 and w2 in [tt.lemma_ for tt in t.children]])

    return w2_is_child_of_w1

pausanias_df['agalma_megas_dependencies'] = pausanias_df['nlp_docs'].apply(count_dependency_collocations, args=(node, collocate))

observed_dep_freq_agalma_megas = pausanias_df[pausanias_df['agalma_megas_dependencies'] > 0].shape[0]

print('Observed Freq: ', observed_dep_freq_agalma_megas)

expected_freq_agalma_megas = expected_frequency_of_collocation(pausanias_df, node, collocate)

mu_deps = observed_dep_freq_agalma_megas / expected_freq_agalma_megas

print('Mu: ', mu_deps)

import math

mutual_information_agalma_megas_deps = math.log(mu_deps, 2)
print('Mutual Information: ', mutual_information_agalma_megas_deps)

deltaP_1 = deltaP(node, collocate)
print('Delta P: ', deltaP_1)



Observed Freq:  322
Mu:  8.314623403386307
Mutual Information:  3.0556509205935973
Delta P:  0.6868650246447927


In [15]:
node = 'god' 
collocate = 'child' 

def count_dependency_collocations(x, w1, w2):
    w2_is_child_of_w1 = len([t for t in x if t.lemma_ == w1 and w2 in [tt.lemma_ for tt in t.children]])

    return w2_is_child_of_w1

pausanias_df['agalma_megas_dependencies'] = pausanias_df['nlp_docs'].apply(count_dependency_collocations, args=(node, collocate))

observed_dep_freq_agalma_megas = pausanias_df[pausanias_df['agalma_megas_dependencies'] > 0].shape[0]

observed_dep_freq_agalma_megas

print('Observed Freq: ', observed_dep_freq_agalma_megas)

expected_freq_agalma_megas = expected_frequency_of_collocation(pausanias_df, node, collocate)

mu_deps = observed_dep_freq_agalma_megas / expected_freq_agalma_megas

print('Mu: ', mu_deps)

mutual_information_agalma_megas_deps = math.log(mu_deps, 2)

print('Mutual Information: ', mutual_information_agalma_megas_deps)

deltaP_2 = deltaP(node, collocate)
print('Delta P: ', deltaP_2)

Observed Freq:  1
Mu:  3.132581241956242
Mutual Information:  1.6473519256197655
Delta P:  -0.0007135831517232069


In [9]:
node = 'god'  
collocate = 'of' # mother

def count_dependency_collocations(x, w1, w2):
    w2_is_child_of_w1 = len([t for t in x if t.lemma_ == w1 and w2 in [tt.lemma_ for tt in t.children]])

    return w2_is_child_of_w1

pausanias_df['agalma_megas_dependencies'] = pausanias_df['nlp_docs'].apply(count_dependency_collocations, args=(node, collocate))

observed_dep_freq_agalma_megas = pausanias_df[pausanias_df['agalma_megas_dependencies'] > 0].shape[0]

observed_dep_freq_agalma_megas

print('Observed Freq: ', observed_dep_freq_agalma_megas)

expected_freq_agalma_megas = expected_frequency_of_collocation(pausanias_df, node, collocate)

mu_deps = observed_dep_freq_agalma_megas / expected_freq_agalma_megas

print('Mu: ', mu_deps)

mutual_information_agalma_megas_deps = math.log(mu_deps, 2)

print('Mutual Information: ', mutual_information_agalma_megas_deps)

deltaP_3 = deltaP(node, collocate)
print('Delta P: ', deltaP_3)

Observed Freq:  17
Mu:  0.8190066925627196
Mutual Information:  -0.2880528539031336
Delta P:  0.009484953336988497


In [10]:
node = 'god' # temple
collocate = 'in' # god

def count_dependency_collocations(x, w1, w2):
    w2_is_child_of_w1 = len([t for t in x if t.lemma_ == w1 and w2 in [tt.lemma_ for tt in t.children]])

    return w2_is_child_of_w1

pausanias_df['agalma_megas_dependencies'] = pausanias_df['nlp_docs'].apply(count_dependency_collocations, args=(node, collocate))

observed_dep_freq_agalma_megas = pausanias_df[pausanias_df['agalma_megas_dependencies'] > 0].shape[0]

observed_dep_freq_agalma_megas

print('Observed Freq: ', observed_dep_freq_agalma_megas)

expected_freq_agalma_megas = expected_frequency_of_collocation(pausanias_df, node, collocate)

mu_deps = observed_dep_freq_agalma_megas / expected_freq_agalma_megas

print('Mu: ', mu_deps)

mutual_information_agalma_megas_deps = math.log(mu_deps, 2)

print('Mutual Information: ', mutual_information_agalma_megas_deps)

deltaP_4 = deltaP(node, collocate)
print('Delta P: ', deltaP_4)

Observed Freq:  8
Mu:  1.1282628849552394
Mutual Information:  0.1741032544687521
Delta P:  0.013209740295223218


In [11]:
node= 'god' # priest
collocate = 'great'  # god

def count_dependency_collocations(x, w1, w2):
    w2_is_child_of_w1 = len([t for t in x if t.lemma_ == w1 and w2 in [tt.lemma_ for tt in t.children]])

    return w2_is_child_of_w1

pausanias_df['agalma_megas_dependencies'] = pausanias_df['nlp_docs'].apply(count_dependency_collocations, args=(node, collocate))

observed_dep_freq_agalma_megas = pausanias_df[pausanias_df['agalma_megas_dependencies'] > 0].shape[0]

observed_dep_freq_agalma_megas

print('Observed Freq: ', observed_dep_freq_agalma_megas)

expected_freq_agalma_megas = expected_frequency_of_collocation(pausanias_df, node, collocate)

mu_deps = observed_dep_freq_agalma_megas / expected_freq_agalma_megas

print('Mu: ', mu_deps)

mutual_information_agalma_megas_deps = math.log(mu_deps, 2)

print('Mutual Information: ', mutual_information_agalma_megas_deps)

deltaP_5 = deltaP(node, collocate)
print('Delta P: ', deltaP_5)

Observed Freq:  1
Mu:  2.341525372775373
Mutual Information:  1.2274486711691057
Delta P:  0.0012806986548452477


Or we can count when the collocate and node co-occur within a given window, as follows:

You can experiment in your own notebooks by adjusting the `l_size` and `r_size` args passed to the `count_ngram_collocations` function.

## Exercises

You may work in groups for these exercises. They are due at the beginning of next class. You can submit them as a link to a Colab notebook or GitHub CodeSpace.

0. Use the corpora that you assembled last week (Pausanias++):
1. Using programming techniques from the course so far, find other potential collocates for a word of your choice (indicative verb)
2. Calculate the μ and Mutual Information scores for at least 5 of these collocate pairs. How do your results change depending on your definition of a collocation? What might these changes mean? (Write your answers to these questions down.)
3. Calculate the Delta P for these same five pairs. Do any results stand out? Why? What might they tell us about your corpus.