In [88]:
import string
import numpy as np 


# Prepare text data

We are going to prepare the text data (i.e. the descriptions of the images) by doing:   
<ol>
    <li>Load the data</li>
    <li>Create a dictionary that maps each photo to their descriptions</li>
    <li>Clean the descriptions (text to lowercase, remove puctuation, remove words that contain numbers)</li>
    <li>Create a vocabolary containing all the words in the descriptions</li>
    <li>Save the cleaned data to a file</li>
</ol>   

## 1. Load data

In [89]:
with open('./Flickr8k_text/Flickr8k.token.txt', 'r') as f:
    corpus = f.read()

## 2. Create Dictionary

This is an example of how the data is displayed inside the corpus variable:  

         
```956164675_9ee084364e.jpg#0\tA runner in a yellow shirt is cresting a hill .\n956164675_9ee084364e.jpg#1\tA runner with one green shoe and one white shoe runs uphill .\n956164675_9ee084364e.jpg#2\tA single runner is watched by onlookers in a race .\n956164675_9ee084364e.jpg#3\tMan wearing green sneakers runs down highway .\n956164675_9ee084364e.jpg#4\tThe runner in red and yellow has just made it up the hill .\n```   


We have each line that is separated by `\n` and the name of the photo and its description are separated with `\t`.

In [90]:
doc = corpus.split('\n')
photo_to_desc = dict()

# We need to use doc[:-1] because the last line is an empty line
for i in doc[:-1]:
    photo_id, desc = i.split('\t')
    photo_id = photo_id[:-2]
    if photo_id in photo_to_desc.keys():
        photo_to_desc[photo_id].append(desc)
    else:
        photo_to_desc[photo_id] = [desc]

## 3. Clean Descriptions

In [91]:
for key, desc_lst in photo_to_desc.items():
    for idx, desc in enumerate(desc_lst):
        # Lower case
        desc = desc.lower()
        # Punctuation
        for p in string.punctuation:
            desc = desc.replace(p, '')
        # Remove words that contain numbers
        for word in desc:
            if not word.isalpha() and word != ' ':
                desc = desc.replace(word, '')
        # Remove whitespaces at the end and at the beginning
        desc = desc.strip()

        photo_to_desc[key][idx] = desc

In [92]:
print(photo_to_desc['1000268201_693b08cb0e.jpg'])

['a child in a pink dress is climbing up a set of stairs in an entry way', 'a girl going into a wooden building', 'a little girl climbing into a wooden playhouse', 'a little girl climbing the stairs to her playhouse', 'a little girl in a pink dress going into a wooden cabin']


## 4. Create Vocabolary

In [93]:
vocab = set()

for key in photo_to_desc.keys():
    [vocab.update(d.split()) for d in photo_to_desc[key]]

## 5. Save The Cleaned Data

In [95]:
lines = list()

for key, desc_lst in photo_to_desc.items():
    for desc in desc_lst:
        lines.append(f"{key} {desc}")

data = '\n'.join(lines)

with open('descriptions.txt', 'w') as f:
    f.write(data)