# Loading and parsing the OEDILF dataset from [PoetRNN](https://github.com/sballas8/PoetRNN)

Just a quick notebook for downloading the dataset, exploring it, and parsing it into JSON
Note that PoetRNN only contains 90,000 limericks, whereas the website already has 113,885 approved ones by now, so we might have to scrape the rest ourselves if we need them!

### Downloading the data

In [44]:
url = 'https://github.com/sballas8/PoetRNN/raw/master/data/limericks.csv'
!wget $url
!ls

--2022-03-18 12:03:22--  https://github.com/sballas8/PoetRNN/raw/master/data/limericks.csv
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/sballas8/PoetRNN/master/data/limericks.csv [following]
--2022-03-18 12:03:22--  https://raw.githubusercontent.com/sballas8/PoetRNN/master/data/limericks.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15723453 (15M) [text/plain]
Saving to: ‘limericks.csv’


2022-03-18 12:03:23 (15.5 MB/s) - ‘limericks.csv’ saved [15723453/15723453]

get_data.ipynb limericks.csv


### Parsing into JSON

OEDILF dataset is just 15MB, so can just load into memory all at once for now to parse into json

In [45]:
import json

In [46]:
with open('limericks.csv', 'r') as limericks_file:
    content = limericks_file.read()
limericks = content.split("\"")
stripped = [limerick.strip('\n') for limerick in limericks]
filtered = [limerick for limerick in stripped if len(limerick) > 0]
filtered[:3]

["cap'n jack was washed over the side.\nhis crew searched but found not hair nor hide.\nno longer the helm,\nbut the deep benthic realm,\nis where jack will forever reside.",
 "ablactation, to wean off the breast,\nshould wait 'til age 2; this is best.\nthough some men never quit\n(bet you thought i'd rhyme tit)\nbecause they're mammarially obsessed.",
 "as a soup, bisque is best when served hot.\nmade with lobster, it hits the right spot.\ni think it tastes dreamy;\nit's so rich and creamy.\nit's the soup you'd be served on a yacht."]

In [47]:
def parse_limerick(limerick):
    """Parse a full limerick into a dictionary containing a list of lines

    :param limerick: full limerick as a single string
    :return: dictionary, where the 'lines' field has a list of 5 limerick lines
    """
    lines = limerick.split('\n')
    if len(lines) != 5:
        return None
    else:
        return {'lines': lines}

def limericks_to_json(limericks, json_path):
    """Parse a list of limerick strings and dump it as JSON

    :param limericks: list of limerick strings
    :param json_path: output path where JSON will be dumped
    :return: None
    """
    output = {'count': 0, 'limericks': {}}
    for index, limerick in enumerate(limericks):
        limerick_dict = parse_limerick(limerick)
        if limerick_dict:
            output['limericks'][index] = limerick_dict

    output['count'] = len(output['limericks'])

    with open(json_path, 'w') as outfile:
        json.dump(output, outfile)

In [48]:
limericks_to_json(filtered, 'limericks.json')