# 1.3 Statistical Dependence

In [1]:
import numpy as np
from utils1 import load_dataset, counter, simple_tokenize

## Load Dataset and Preprocess

In [2]:
text = load_dataset("King James Bible")
tokens = simple_tokenize(text, remove_symbols=True)
word_to_count = counter(tokens)

In [3]:
IGNORE_COUNTS = 10
# Turn the tokenized_text into (X_{i}, X_{i+1}) pairs, counting their unique counts
token_pairs = tuple(zip(tokens, tokens[1:]))
assert token_pairs[0] == (tokens[0], tokens[1]) and token_pairs[-1] == (tokens[-2], tokens[-1])
pair_to_counts = counter([
    x for x in token_pairs
    if word_to_count[x[0]]>=IGNORE_COUNTS and word_to_count[x[1]]>=IGNORE_COUNTS
])

## Calculate PMI

In [4]:
pair_pmi: list[tuple[str, str, int]] = list()
for (x, y), cnt in pair_to_counts.items():
    pmi = np.log((cnt * len(tokens)) / (word_to_count[x] * word_to_count[y]))
    pair_pmi.append((x, y, pmi))
pair_pmi = sorted(pair_pmi, key=lambda x: x[-1])

## Results

In [5]:
print("Highest 20 PMI:")
for x, y, pmi in pair_pmi[-1:-21:-1]:
    print(f"{x:<16} {y:<16} {pmi:>9.6f}")

Highest 20 PMI:
shadrach         meshach          10.804800
badgers          skins            10.329681
aha              aha              10.075285
ill              favoured         10.026495
judas            iscariot          9.884665
curious          girdle            9.721113
brook            kidron            9.717611
jonas            lovest            9.710642
poureth          contempt          9.669820
measuring        reed              9.633350
persecution      ariseth           9.574510
divers           colours           9.565460
mary             magdalene         9.505848
precept          precept           9.479200
overflowing      scourge           9.392188
shem             ham               9.301711
wreathen         chains            9.277778
sharp            sickle            9.264355
fiery            furnace           9.264355
committeth       adultery          9.251110


In [6]:
print("Lowest 20 PMI:")
for x, y, pmi in pair_pmi[:20]:
    print(f"{x:<16} {y:<16} {pmi:>9.6f}")

Lowest 20 PMI:
of               of               -7.322357
his              the              -6.528247
of               to               -6.385284
of               he               -6.121739
the              said             -5.777407
of               is               -5.722350
that             and              -5.638593
to               in               -5.379939
shall            of               -5.371118
the              israel           -5.337212
of               and              -5.325498
unto             and              -5.277327
of               in               -5.218400
of               will             -5.122443
from             of               -5.070545
in               shall            -5.058920
and              son              -5.051178
then             and              -4.953314
the              he               -4.943330
his              in               -4.909552


## Discussion

- Word pairs with the highest 20 PMI are usually special phrase, and uncommon in daily English, for example:
  - `shadrach meshach` is two names, they (actually with the third one named Abednego, but unable be counted into because the algorithm only consider two-words-pairs) usually occurs together rather than singly.
  - `judas iscariot` is someone's full name. `brook kidron` is a place.
  - `badgers skins` may be an object used in the story, which doesn't talk about badgers nor skins independently. Same as `curious girdle`.

- Words in the pairs with the lowest 20 PMI are familar in daily English. These words typically occur without fixed "partner" word. Furthermore, the couples in the result are uncommon and even grammatically incorrect. However, pairs in the list meaning that the words at least occur successively one time in the text. It is interesting that words like `of of` will occur successively, lets see their context:

In [7]:
CONTEXT_TOKENS_CNT = 10
for x, y, _ in pair_pmi[:10]:
    index = token_pairs.index((x, y))
    print(
        " ".join(tokens[index-CONTEXT_TOKENS_CNT:index]),
        " ".join(tokens[index:index+2]).upper(),
        " ".join(tokens[index+2:index+2+CONTEXT_TOKENS_CNT]),
    )

own sight and of the maidservants which thou hast spoken OF OF them shall i be had in honour therefore michal the
any manner of lost thing which another challengeth to be HIS THE cause of both parties shall come before the judges and
it and a flattering mouth worketh ruin boast not thyself OF TO morrow for thou knowest not what a day may bring
not in the blood of bullocks or of lambs or OF HE goats when ye come to appear before me who hath
and chief estates of galilee and when the daughter of THE SAID herodias came in and danced and pleased herod and them
is come upon me and that which i was afraid OF IS come unto me i was not in safety neither had
skins dyed red and a covering of badgers skins above THAT AND he made boards for the tabernacle of shittim wood standing
may bless thee in all that thou settest thine hand TO IN the land whither thou goest to possess it when thou
he also reap for he that soweth to his flesh SHALL OF the flesh reap corruption but he that soweth to the
this ru

- `of of` is an "error" result due to the tokenization. It should orignally be `which thou hast spoken of, of them ...`, which is grammatically correct. Same for `his the`
- `of to` is part of `of to morrow`, in which `to morrow` is an older spelling of `tomorrow`
- `of he` is part of `of he goats`, in which `he-goats` is an older spelling of "male goat".