## Lab 4 Submission
### Rohan and Aadit

In [3]:
# I've created a utils.py file for frequently reused functionality -- you can import from it like so
from utils import load_pausanias

pausanias_df = load_pausanias('grc') # you can use `load_pausanias('eng')` to load the English version

When calculating co-occurrences in Greek, it is generally insufficient to use the L and R windows that Brezina uses for English (@Brezina2018 67–70). Instead, we'll look for a dependency relationship between the **node** and its **collocates**. Below, you can see that we can access the dependencies of a token through its `children` property.

In [None]:
test_token = pausanias_df['nlp_docs'][0][1]

# we use a list comprehension to evaluate the generator at `test_token.children`
f"token: '{test_token}, {test_token.lemma_}', dependencies: {[(c, c.lemma_) for c in test_token.children]}"

"token: 'ἠπείρου, ἤπειρος', dependencies: [(τῆς, 'ὁ'), (Ἑλληνικῆς, 'Ἑλληνικός')]"

In [None]:
# Mu function

from collections import Counter
import pandas as pd

def expected_frequency_of_collocation(df: pd.DataFrame, node: str, collocate: str, window_size: int = 1):
    """
    `node` and `collocate` should be the string representations
    of the associated lemmata
    """

    lemmata = [t.lemma_ for t in df['nlp_docs'].explode()]
    counter = Counter(lemmata)
    node_count = counter[node]
    collocate_count = counter[collocate]

    return (node_count * collocate_count * window_size) / len(lemmata)



Notice that we're also accessing the `lemma_` property here. Because Greek is heavily inflected, we'll tend to focus on collocations of lemmata, rather than types -- but you might find in your own work that it is interesting to look at type collocations instead. Just be sure to note which kind of "word" you're examining.

### Frequency of co-occurrence

The frequency of co-occurrence reports the presence of both a **node** (`w1`) and a **collocate** (`w2`). Given a DataFrame like `pausanias_df`, we can calculate the frequency of co-occurrence in two different ways. 

We can either count when the collocate is a dependency of the node, like so:

In [None]:
node = 'ἄγαλμα' # statue
collocate = 'θεός' # god

def count_dependency_collocations(x, w1, w2):
    w2_is_child_of_w1 = len([t for t in x if t.lemma_ == w1 and w2 in [tt.lemma_ for tt in t.children]])

    return w2_is_child_of_w1

pausanias_df['agalma_megas_dependencies'] = pausanias_df['nlp_docs'].apply(count_dependency_collocations, args=(node, collocate))

observed_dep_freq_agalma_megas = pausanias_df[pausanias_df['agalma_megas_dependencies'] > 0].shape[0]

print('Observed Freq: ', observed_dep_freq_agalma_megas)

expected_freq_agalma_megas = expected_frequency_of_collocation(pausanias_df, node, collocate)

mu_deps = observed_dep_freq_agalma_megas / expected_freq_agalma_megas

print('Mu: ', mu_deps)

import math

mutual_information_agalma_megas_deps = math.log(mu_deps, 2)

print('Mutual Information: ', mutual_information_agalma_megas_deps)



Observed Freq:  9
Mu:  4.920898794300329
Mutual Information:  2.298921845575457


In [None]:
node = 'μέγας' # great
collocate = 'θεός' # god

def count_dependency_collocations(x, w1, w2):
    w2_is_child_of_w1 = len([t for t in x if t.lemma_ == w1 and w2 in [tt.lemma_ for tt in t.children]])

    return w2_is_child_of_w1

pausanias_df['agalma_megas_dependencies'] = pausanias_df['nlp_docs'].apply(count_dependency_collocations, args=(node, collocate))

observed_dep_freq_agalma_megas = pausanias_df[pausanias_df['agalma_megas_dependencies'] > 0].shape[0]

observed_dep_freq_agalma_megas

print('Observed Freq: ', observed_dep_freq_agalma_megas)

expected_freq_agalma_megas = expected_frequency_of_collocation(pausanias_df, node, collocate)

mu_deps = observed_dep_freq_agalma_megas / expected_freq_agalma_megas

print('Mu: ', mu_deps)

mutual_information_agalma_megas_deps = math.log(mu_deps, 2)

print('Mutual Information: ', mutual_information_agalma_megas_deps)

Observed Freq:  4
Mu:  4.1311249137336095
Mutual Information:  2.0465346839411716


In [None]:
node = 'μήτηρ' # mother
collocate = 'θεός' # god

def count_dependency_collocations(x, w1, w2):
    w2_is_child_of_w1 = len([t for t in x if t.lemma_ == w1 and w2 in [tt.lemma_ for tt in t.children]])

    return w2_is_child_of_w1

pausanias_df['agalma_megas_dependencies'] = pausanias_df['nlp_docs'].apply(count_dependency_collocations, args=(node, collocate))

observed_dep_freq_agalma_megas = pausanias_df[pausanias_df['agalma_megas_dependencies'] > 0].shape[0]

observed_dep_freq_agalma_megas

print('Observed Freq: ', observed_dep_freq_agalma_megas)

expected_freq_agalma_megas = expected_frequency_of_collocation(pausanias_df, node, collocate)

mu_deps = observed_dep_freq_agalma_megas / expected_freq_agalma_megas

print('Mu: ', mu_deps)

mutual_information_agalma_megas_deps = math.log(mu_deps, 2)

print('Mutual Information: ', mutual_information_agalma_megas_deps)

Observed Freq:  10
Mu:  32.024896915287854
Mutual Information:  5.001122021580986


In [None]:
node = 'ναός' # temple
collocate = 'θεός' # god

def count_dependency_collocations(x, w1, w2):
    w2_is_child_of_w1 = len([t for t in x if t.lemma_ == w1 and w2 in [tt.lemma_ for tt in t.children]])

    return w2_is_child_of_w1

pausanias_df['agalma_megas_dependencies'] = pausanias_df['nlp_docs'].apply(count_dependency_collocations, args=(node, collocate))

observed_dep_freq_agalma_megas = pausanias_df[pausanias_df['agalma_megas_dependencies'] > 0].shape[0]

observed_dep_freq_agalma_megas

print('Observed Freq: ', observed_dep_freq_agalma_megas)

expected_freq_agalma_megas = expected_frequency_of_collocation(pausanias_df, node, collocate)

mu_deps = observed_dep_freq_agalma_megas / expected_freq_agalma_megas

print('Mu: ', mu_deps)

mutual_information_agalma_megas_deps = math.log(mu_deps, 2)

print('Mutual Information: ', mutual_information_agalma_megas_deps)

Observed Freq:  6
Mu:  5.1383767185428155
Mutual Information:  2.361312664868964


In [None]:
node = 'ἱερόν' # priest
collocate = 'θεός' # god

def count_dependency_collocations(x, w1, w2):
    w2_is_child_of_w1 = len([t for t in x if t.lemma_ == w1 and w2 in [tt.lemma_ for tt in t.children]])

    return w2_is_child_of_w1

pausanias_df['agalma_megas_dependencies'] = pausanias_df['nlp_docs'].apply(count_dependency_collocations, args=(node, collocate))

observed_dep_freq_agalma_megas = pausanias_df[pausanias_df['agalma_megas_dependencies'] > 0].shape[0]

observed_dep_freq_agalma_megas

print('Observed Freq: ', observed_dep_freq_agalma_megas)

expected_freq_agalma_megas = expected_frequency_of_collocation(pausanias_df, node, collocate)

mu_deps = observed_dep_freq_agalma_megas / expected_freq_agalma_megas

print('Mu: ', mu_deps)

mutual_information_agalma_megas_deps = math.log(mu_deps, 2)

print('Mutual Information: ', mutual_information_agalma_megas_deps)

NameError: name 'pausanias_df' is not defined

Or we can count when the collocate and node co-occur within a given window, as follows:

You can experiment in your own notebooks by adjusting the `l_size` and `r_size` args passed to the `count_ngram_collocations` function.

## Exercises

You may work in groups for these exercises. They are due at the beginning of next class. You can submit them as a link to a Colab notebook or GitHub CodeSpace.

0. Use the corpora that you assembled last week (Pausanias++):
1. Using programming techniques from the course so far, find other potential collocates for a word of your choice (indicative verb)
2. Calculate the μ and Mutual Information scores for at least 5 of these collocate pairs. How do your results change depending on your definition of a collocation? What might these changes mean? (Write your answers to these questions down.)
3. Calculate the Delta P for these same five pairs. Do any results stand out? Why? What might they tell us about your corpus.