# Experimentation for AOC computational poetry book

## First attempt: NLTK Ngram Language Modeling

See [nltk.lm package](https://www.nltk.org/api/nltk.lm.html) for relevant documentation.

Note that to calculate [entropy](https://www.nltk.org/api/nltk.lm.api.html?highlight=perplexity#nltk.lm.api.LanguageModel.entropy) (i.e. cross-entropy) the [function](https://www.nltk.org/_modules/nltk/lm/api.html#LanguageModel.entropy) takes a sequence of ngram tuples and calculates the mean of the logscore for each ngram. The [perplexity](https://www.nltk.org/api/nltk.lm.api.html?highlight=perplexity#nltk.lm.api.LanguageModel.perplexity) is 2^cross-entropy for the text.

### Test run following [this](https://www.nltk.org/api/nltk.lm.html) documention

In [237]:
from nltk.lm.preprocessing import padded_everygram_pipeline

text = [['a', 'b', 'c'], ['a', 'c', 'd', 'c', 'e', 'f']]
n = 3
train, vocab = padded_everygram_pipeline(n, text)

In [238]:
from nltk.lm import MLE
lm = MLE(n)
lm.fit(train, vocab)

In [239]:
print(lm.vocab, lm.counts)

<Vocabulary with cutoff=1 unk_label='<UNK>' and 9 items> <NgramCounter with 3 ngram orders and 45 ngrams>


In [240]:
lm.vocab.lookup(text[0]), lm.vocab.lookup(["aliens", "from", "Mars"])

(('a', 'b', 'c'), ('<UNK>', '<UNK>', '<UNK>'))

In [241]:
lm.counts['a'], lm.counts[['a']]['b'], lm.counts[['a', 'b']]['c'], 

(2, 1, 1)

Calculate score (or logprob) of "b" being preceded by "a":

In [242]:
lm.score("b", ["a"]), lm.logscore("b", ["a"])

(0.5, -1.0)

In [243]:
lm.logscore("b", ["a", "a"]), lm.logscore("d", ["a", "c"])

(-inf, 0.0)

In [244]:
test = [('a', 'b'), ('c', 'd')]
lm.entropy(test), lm.perplexity(test)

(1.292481250360578, 2.449489742783178)

In [245]:
test = [('a', 'b'), ('a', 'b')]
lm.entropy(test), lm.perplexity(test)

(1.0, 2.0)

### Try with actual writing data

In [246]:
import string

def simple_tokenize(s):
    s = s.lower()
    s = s.translate(str.maketrans('', '', string.punctuation))
    return s.split(" ")

data = []
with open("part_a.txt", "r") as fle:
    for line in fle:
        if line[0] == "[":
            text = line.split("]")[1]
            data.append(simple_tokenize(text.strip()))

(' ').join(data[0]), (' ').join(data[21])

('i visualize two paths my life might take the one where i have a child and the one where i do not i try to place them on equal footing imagining each as something i truly want though i know both will be filled with disappointment and loss',
 'we learn that our dog can still lick his wound even with the cone he nudges the cone up his neck and holds it there with his knee then with his everflexible neck he can reach beyond the edge of the cone to his genitals what is the point of this cone')

In [247]:
n = 3
train, vocab = padded_everygram_pipeline(n, data)

In [248]:
lm = MLE(n)
lm.fit(train, vocab)

In [249]:
print(lm.vocab, lm.counts)

<Vocabulary with cutoff=1 unk_label='<UNK>' and 3103 items> <NgramCounter with 3 ngram orders and 50337 ngrams>


In [250]:
lm.logscore("dog", ["my"])

-5.139551352398794

In [251]:
test = [("i", "truly"), ("truly", "want")]

lm.entropy(test), lm.perplexity(test)

(5.23176218663559, 37.57658845611188)

### Okay now try with a model that smooths (i.e. can deal w unseen ngrams)

In [252]:
from nltk.lm import KneserNeyInterpolated
n = 4 
train, vocab = padded_everygram_pipeline(n, data)

model = KneserNeyInterpolated(n) 
model.fit(train, vocab)

In [253]:
test = [("i", "truly", "am"), ("i", "truly", "was")]

model.entropy(test), model.perplexity(test)

(16.66375382337209, 103822.08377491328)

In [254]:
test = [data[0][0:3], data[0][1:4], data[0][2:5]]

model.entropy(test)

0.07920825408287488

In [255]:
test = [data[0], data[1]]

model.entropy(test)

0.536621703609348

In [256]:
model.entropy(data)

0.23284617042905795

Note that logscore calculates the likelihood of a word given the previous n-1 words, regardless of how many words at in the list of the second item passed. i.e. the two cells below have the same output, even though the first one passes through the entire poem's words and the second one only passes through the last 4 words.

In [257]:
model.logscore(data[0][-1], data[0][:-1])

-0.014475099493263041

In [258]:
model.logscore(data[0][-1], data[0][-4:-1])

-0.014475099493263041

In [259]:
def score_join_ngram(a, b):
    """a and b should be 'poems' from the dataset, where each one is a list of words"""
    joining_ngrams = [a[-n+i:] + b[:i] for i in range(1,n)]
    return model.entropy(joining_ngrams)

score_join_ngram(data[0], data[1])

8.488897042811038

Shared function! See below. Right now only considers a->b and ignores asymmetries (for speed sake, eventually remove.)

In [265]:
import numpy as np

def calc_all_scores(data, score_join_func):
    scores = np.empty([len(data), len(data)])
    for i in range(len(data)):
        for j in range(len(data)):
            if i <= j:
                s = np.nan
            else:
                s = score_join_func(data[i], data[j])
            scores[i][j] = s
    return scores

In [266]:
len(data)

168

In [267]:
# join_scores = [score_join(data[i], data[i+1]) for i in range(0, len(data)-1)]
join_scores = calc_all_scores(data[:20], score_join_ngram)
join_scores

array([[       nan,        nan,        nan,        nan,        nan,
               nan,        nan,        nan,        nan,        nan,
               nan,        nan,        nan,        nan,        nan,
               nan,        nan,        nan,        nan,        nan],
       [8.18718355,        nan,        nan,        nan,        nan,
               nan,        nan,        nan,        nan,        nan,
               nan,        nan,        nan,        nan,        nan,
               nan,        nan,        nan,        nan,        nan],
       [7.87419281, 8.10650522,        nan,        nan,        nan,
               nan,        nan,        nan,        nan,        nan,
               nan,        nan,        nan,        nan,        nan,
               nan,        nan,        nan,        nan,        nan],
       [8.23301806, 8.20139822, 8.2381051 ,        nan,        nan,
               nan,        nan,        nan,        nan,        nan,
               nan,        nan,        nan,  

In [280]:
def get_top_scores(join_scores, thisdata, asc=True, n=5):
    flattened = join_scores.flatten()
    if asc:
        sorted_indices = np.argsort(flattened)
    else:
        sorted_indices = np.argsort(-flattened)
    indices_2d = np.unravel_index(sorted_indices, join_scores.shape)
    sorted_indices_list = list(zip(indices_2d[0], indices_2d[1]))
    
    for i, val in enumerate(sorted_indices_list[0:n]):
        print('indices:', val)
        print('score:', join_scores[val[0],val[1]])
        print('>>>', ' '.join(thisdata[val[0]]), '\n>>>>', ' '.join(thisdata[val[1]]), '\n')

    return sorted_indices_list

In [285]:
testdata = ['a', 'b', 'aa', 'bb']
testscores = np.array([[np.nan, 1, 10, 0], [np.nan, np.nan, 0, 10], [np.nan, np.nan, np.nan, 0], [np.nan, np.nan, np.nan, np.nan]])
x = get_top_scores(testscores, testdata, asc=False, n=8)

indices: (0, 2)
score: 10.0
>>> a 
>>>> a a 

indices: (1, 3)
score: 10.0
>>> b 
>>>> b b 

indices: (0, 1)
score: 1.0
>>> a 
>>>> b 

indices: (0, 3)
score: 0.0
>>> a 
>>>> b b 

indices: (1, 2)
score: 0.0
>>> b 
>>>> a a 

indices: (2, 3)
score: 0.0
>>> a a 
>>>> b b 

indices: (0, 0)
score: nan
>>> a 
>>>> a 

indices: (1, 0)
score: nan
>>> b 
>>>> a 



In [286]:
x = get_top_scores(join_scores, data)

indices: (19, 12)
score: 5.047822885747899
>>> i dont know kirkegaard but i can give you some poetry in derek walcotts love after love he implores me to love myself you will love again the stranger who was your self what stranger i have become so worried at the edge of the sea 
>>>> for long stretches of time i accept having to wait for the winds of chance to determine my fate i feel unrushed but then it descends on me the panic of uncertainty the fear of words like “never” i try to avoid catastrophic thinking but how to avoid my most central tendencies perhaps avoidance is the wrong way to view my worries you dont break habits you replace them but i fear replacing worry with hope 

indices: (13, 7)
score: 6.265058387215162
>>> i can kind of feel my brain wisening up to its anxiety like perhaps i am just feeling sad and the object of the sadness would be something else if i didn’t have this particular anxiety in mind i can feel the lightening my perspective on reality shifts like a lon

# Second attempt: edit distance

NLTK [Levenshtein edit distance](https://www.nltk.org/api/nltk.metrics.distance.html#nltk.metrics.distance.edit_distance_align) documentation. Edit distance is being calculated at the character level.

In [287]:
from nltk.metrics.distance import edit_distance

# testing

dog = "who is the dog here?"

edit_distance('test', 'task'), edit_distance(dog, "whose dog is here?"), edit_distance(dog, "who is the cat here?", substitution_cost=5)

(2, 8, 6)

In [289]:
def score_join_edit(a, b, n=20):
    """a and b should be 'poems' from the dataset, where each one is a list of words"""
    return edit_distance(' '.join(a)[-n:], ' '.join(b)[:n])

' '.join(data[0])[-20:],' '.join(data[1])[:20], score_join(data[0], data[1])

('appointment and loss', 'to push away disappo', 19)

In [290]:
join_scores = calc_all_scores(data, score_join_edit)

In [294]:
x = get_top_scores(join_scores, data, asc=True)

indices: (96, 58)
score: 9.0
>>> i am on the verge of tears the edge of crying i look over the precipice and it is no good out there id like to stay here where the emotions live behind the eyes where im still able to walk to work and make myself dinner i am overflowing with what with a lack of breath with a fear of the future with a desire to stay home with a hope that the people close to me can hold me right with a weight that bears down so hard on my ankles they swell at the elastic of my socks words should contain this much words should be on the verge should look out over the precipice should want so much more than they have to give the best books end just after the words do they hang around in the room with you swiveled and swirling that amorphous ghost they have birthed not ready not quite yet to leave 
>>>> i stare at the leaves which skirt the tree and look up at those yet to fall the backside of each leaf being a pale reflection colors are muted but still its their range the s

# Third attempt: word overlap

Just use union function of sets.

In [295]:
n = 10

x = set(data[0][-n:])
y = set(data[1][:n])
' '.join(x), ' '.join(y), ' '.join(x.intersection(y)), len(x.intersection(y))

('both with filled i loss know and will disappointment be',
 'push “worst” to away prepare i the for disappointment',
 'i disappointment',
 2)

In [296]:
def score_join_overlap(a, b, n=10):
    """a and b should be 'poems' from the dataset, where each one is a list of words"""
    x = set(a[-n:])
    y = set(b[:n])
    return len(x.intersection(y))

# join_scores = [score_join(data[i], data[i+1]) for i in range(0, len(data)-1)]

In [298]:
join_scores = calc_all_scores(data, score_join_overlap)

In [299]:
x = get_top_scores(join_scores, data, asc=True)

indices: (49, 24)
score: 0.0
>>> i remember reading ann lammot talking about the day her book is released you expect something to happen she writes but very little does that moment when the book turns into something that anyone can read that moment you think it becomes real is a small wave that lulls onto the shore and then retracts leaving the bubbling of air from the mollusks and crabs this is not what ive been told of birthing a human instead the moment when a child goes from something within you to something without is i am told momentous still i suspect that mostly we are made by the soft waves 
>>>> it does not feel like there is anything inside me that is alive no more alive than the rest of me my lungs and heart and liver all living within i am up all night vomiting and remember that 

indices: (112, 36)
score: 0.0
>>> i am breech born baby backwards tending to this head i can feel just below my ribs at my touch he may shift around little bird inside me unsure how to fly he won