# Experimental feature: Calculating lexical diversity of limericks to rank and select
Rami Ariss, March 30th 2022

NLTK (Natural Language Toolkit, https://www.nltk.org/) is a useful python package to do various text-based statistical and metric calculations.

For example, we can use NLTK to define sets of text to calculate the lexical diversity over a limerick, and rank a set of generated limericks based on the lexical diversity.

NLTK can enable metric calculations such as:
- lexical diversity: % words repeatedly used in given text
- collocations: words used together often
- word counts

Resources:
- Parts of Speech Tags: https://www.ibm.com/docs/en/watson-explorer/10.0.0?topic=analytics-part-speech-tag-sets
- NLTK examples: https://www.nltk.org/book/ch01.html
- NLTK Pre-Process notebook example: https://colab.research.google.com/github/gal-a/blog/blob/master/docs/notebooks/nlp/nltk_preprocess.ipynb#scrollTo=0JzUMH4jdXm7

# Notebook Preparation

In [1]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Not connected to a GPU


In [5]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger') 
nltk.download('treebank')

import pandas as pd
import matplotlib.pyplot as plt
import io
import unicodedata
import numpy as np
import re
import string
import json
import glob
import os

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Package treebank is already up-to-date!


# Data

Limerick files can be found under `/data/raw` in the github repository. Manually upload to `sample_data` in colab (had issues with mounting).

In [26]:
DATA_DIR = ''
!ls $DATADIR

free_form.txt  limericks.json  once_upon_a_time.txt  sample_data  sample.txt


## Load OEDILF Limericks

In [18]:
f = open('limericks.json')
poems_json = json.load(f)

In [19]:
poems_json['limericks']['0']['lines']

["cap'n jack was washed over the side.",
 'his crew searched but found not hair nor hide.',
 'no longer the helm,',
 'but the deep benthic realm,',
 'is where jack will forever reside.']

In [20]:
len(poems_json['limericks'])

72432

# Generated Poems

In [69]:
def generated_poems(fname):
  # read file
  with open(fname) as f:
    lines = f.readlines()
  
  # parse into list of poems
  poems = [[]]
  i = 0  # limerick index
  for l in lines:
    l = l.strip()  # remove '\n'
    # if empty line (new limerick)
    if len(l) == 0:
      poems.append([])
      i += 1
    else:
      poems[i].append(l)
  poems = poems[:-1] # exclude last blank

  lexical_diversity = calculate_lexical_diversities(poems)

  i_most_diverse = np.argmax(lexical_diversity)
  i_least_diverse = np.argmin(lexical_diversity)

  print(f'Number of poems: {len(poems)}\n Lexical Diversity:\n  Mean: {np.round(np.mean(lexical_diversity) * 100)}%\n  Max: {np.round(np.max(lexical_diversity)* 100)}%\n  Min: {np.round(np.min(lexical_diversity) * 100)}%')

  print(f'\nLexical Diversity: {np.round(lexical_diversity[i_most_diverse]*100)}%\n', format_poem(poems[i_most_diverse]))

  print(f'\nLexical Diversity: {np.round(lexical_diversity[i_least_diverse]*100)}%\n', format_poem(poems[i_least_diverse]))

  return poems, lexical_diversity

In [72]:
free_form_poems, free_form_ld = generated_poems('free_form.txt')

Number of poems: 499
 Lexical Diversity:
  Mean: 81.0%
  Max: 100.0%
  Min: 39.0%

Lexical Diversity: 100.0%
 though einstein's theorem was sound
with its logic intact, it would soon
be considered in depth
how a math axiomal
of the whole world without end can't defeat

Lexical Diversity: 39.0%
 as a girl, were my dreams quite unfulfilled
were they bright? were they dull? were they slurred
were their faces dull? were they bright
were they dull? were they dull
were they faded? were they faded? were they blurbed


In [73]:
once_poems, once_ld = generated_poems('once_upon_a_time.txt')

Number of poems: 500
 Lexical Diversity:
  Mean: 82.0%
  Max: 100.0%
  Min: 51.0%

Lexical Diversity: 100.0%
 once upon a time
logometry's being sublime
we know geometry
and clear analogy
has made all your life in the slime

Lexical Diversity: 51.0%
 once upon a time
you thought that, to me, was a rhyme
you thought that was a rhyme
was a time, not a time
and it not was in many a crime


In [74]:
sample_poems, sample_ld = generated_poems('sample.txt')

Number of poems: 4980
 Lexical Diversity:
  Mean: 79.0%
  Max: 100.0%
  Min: 25.0%

Lexical Diversity: 100.0%
 to keep infantry marching in line
they defend us, we stand and define
the army of battalions
our strong british galleons
with self-uppance: there is a sign

Lexical Diversity: 25.0%
 tomorrow... tomorrow... tomorrow..
tomorrow... tomorrow... tomorrow... tomorrow....
tomorrow...... tomorrow.. tomorrow....
encumber...... tomorrow..
encumber..... tomorrow..


## Raw Poems for Testing

In [75]:
poems_raw = """
I'd rather watch the clouds in the sky
though don't climb to a field, and i'll try
make the sky always clear
because nobody's here
you'll say i'll not fly off the eye

I'd rather watch the clouds in the sky
but ignore it, take care if you'd die
if you fly on a fly
you'd be seen in the sky
or a comerfly, put your way high

I'd rather watch the clouds in the sky
and clouds, if i'd take on a try
of the cloud, and, say
that the clouds had held sway..
being such an observant was i

I'd rather watch the clouds in the sky
that were conically shaped like a pie
and they'd fall in the night
simply fall, not just right
and to fall, they could fall way up high

I'd rather watch the clouds in the sky
for his eyes with a scientist's eye
to observe and observe
to observe, observe, observe
are a change from my mind, smile and sigh

I'd rather watch the clouds in the sky
ae and stars that are bigger than i
far from sea to up high
from a view in the sky
help me up. so thanks to heaven, i'm high

I'd rather watch the clouds in the sky
i look over; my love, so i cry
if i give you my love
and you come up above
if i sit there? bye, dear, goodbye

I'd rather watch the clouds in the sky
like the clouds, though they fly way up high
cloudy arcs in the sky
all that arc as they fly
or the shadow that flies like the sky

I'd rather watch the clouds in the sky
at convenience store, purchase and buy
hop to shop for a day
do some think they're away
at the convenience store, purchase and buy

I'd rather watch the clouds in the sky
i tried hard just to climb up, and then try
to come up and to fly
i'm to get to the sky
it would sure come away with my sigh
"""

In [76]:
def raw_output_to_poems(raw_output: str):
    """Parse a raw text representation of multiple poems into a list of poems

    :param raw_output: poem as a string (like txt files we sentto Rita)
    :return: list of poems, where each poem is a list of lines
    """
    poems = raw_output.split('\n\n')
    poems = [poem.strip() for poem in poems]
    poems = [poem.split('\n') for poem in poems]
    return poems

In [5]:
poems = raw_output_to_poems(poems_raw)

# Lexical Diversity

In [64]:
def calculate_lexical_diversity(text):
  """Given a tokenized text, calculate its lexical diversity.

  :param text: list, tokenized text
  :return: float, lexical diversity
  """
  return len(set(text)) / len(text)

def calculate_lexical_diversities(poems):
  """Calculate lexical diversity for a set of poems

  :param poems: list, poems as strings
  :return: list, lexical diversities (decimals)
  """
  lexical_diversity = []
  for poem in poems:
    # flatten each poem into a line
    flattened_poem = ' '.join(poem)

    # tokenize
    tokens = nltk.word_tokenize(flattened_poem)

    lexical_diversity.append(calculate_lexical_diversity(tokens))
  return lexical_diversity

In [66]:
def format_poem(poem):
  """Reformat a poem given as a list of lines for cleaner printing

  :param poem: list, each line is a string in list
  :return: str
  """
  return '\n'.join(poem)

In [None]:
def format_poems_json(poems_json):
  

In [22]:
nltk.word_tokenize(''.join(poems_json['limericks']['0']['lines']))

["cap'n",
 'jack',
 'was',
 'washed',
 'over',
 'the',
 'side.his',
 'crew',
 'searched',
 'but',
 'found',
 'not',
 'hair',
 'nor',
 'hide.no',
 'longer',
 'the',
 'helm',
 ',',
 'but',
 'the',
 'deep',
 'benthic',
 'realm',
 ',',
 'is',
 'where',
 'jack',
 'will',
 'forever',
 'reside',
 '.']

In [12]:
lexical_diversity = calculate_lexical_diversities(poems)

In [13]:
i_most_diverse = np.argmax(lexical_diversity)
i_least_diverse = np.argmin(lexical_diversity)

print(f'Number of poems: {len(poems)}\n Lexical Diversity:\n  Mean: {np.round(np.mean(lexical_diversity) * 100)}%\n  Max: {np.round(np.max(lexical_diversity)* 100)}%\n  Min: {np.round(np.min(lexical_diversity) * 100)}%')

print(f'\nLexical Diversity: {np.round(lexical_diversity[i_most_diverse]*100)}%\n', format_poem(poems[i_most_diverse]))

print(f'\nLexical Diversity: {np.round(lexical_diversity[i_least_diverse]*100)}%\n', format_poem(poems[i_least_diverse]))

Number of poems: 10
 Lexical Diversity:
  Mean: 74.0%
  Max: 82.0%
  Min: 66.0%

Lexical Diversity: 82.0%
 I'd rather watch the clouds in the sky
though don't climb to a field, and i'll try
make the sky always clear
because nobody's here
you'll say i'll not fly off the eye

Lexical Diversity: 66.0%
 I'd rather watch the clouds in the sky
like the clouds, though they fly way up high
cloudy arcs in the sky
all that arc as they fly
or the shadow that flies like the sky


# Parts of Speech Tagging

In [14]:
tokens = nltk.word_tokenize(' '.join(l for l in poems[0]))
tagged = nltk.pos_tag(tokens)
tagged

[('I', 'PRP'),
 ("'d", 'MD'),
 ('rather', 'RB'),
 ('watch', 'VB'),
 ('the', 'DT'),
 ('clouds', 'NN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('sky', 'NN'),
 ('though', 'IN'),
 ('do', 'VBP'),
 ("n't", 'RB'),
 ('climb', 'VB'),
 ('to', 'TO'),
 ('a', 'DT'),
 ('field', 'NN'),
 (',', ','),
 ('and', 'CC'),
 ('i', 'NN'),
 ("'ll", 'MD'),
 ('try', 'VB'),
 ('make', 'VB'),
 ('the', 'DT'),
 ('sky', 'NN'),
 ('always', 'RB'),
 ('clear', 'JJ'),
 ('because', 'IN'),
 ('nobody', 'NN'),
 ("'s", 'POS'),
 ('here', 'RB'),
 ('you', 'PRP'),
 ("'ll", 'MD'),
 ('say', 'VB'),
 ('i', 'JJ'),
 ("'ll", 'MD'),
 ('not', 'RB'),
 ('fly', 'VB'),
 ('off', 'RP'),
 ('the', 'DT'),
 ('eye', 'NN')]