# Image Caption Generator

This project is an analysis of some images where the deep learning models recognize the context of the images and describe them in natural language - English here.

All the libraries needed are imported

In [1]:
import os   # handling the files
import pickle # storing numpy features
import numpy as np

## Data Collection

The data used in this analysis consists of two datasets -
1. Flickr8k dataset 
2. IAPRTC12 dataset

## Data Preprocessing

The captions of these two datasets are stored in two separate files. The captions are loaded and these are stored in a dictionary with image name as key and captions as value in the form of any array. These are later cleaned to remove noise in the captions.

### Load captions data

In [27]:
# Reading the captions file
with open('Data/captions_dataset1.txt', 'r') as file:
    captions_dataset1 = file.readlines()
    
with open('Data/captions_dataset2.txt', 'r') as file:
    captions_dataset2 = file.readlines()

In [28]:
# Mapping the image and captions into a dictionary
mapping = {}

# Dataset1
for line in captions_dataset1[1:]:
    tokens = line.split(',')
    key = tokens[0]
    caption = tokens[1]
    if key not in mapping:
        mapping[key] = []
    mapping[key].append(caption)

# Dataset2
for line in captions_dataset2[1:]:
    tokens = line.split(';')
    key = tokens[0]
    caption = tokens[1]
    if key not in mapping:
        mapping[key] = []
    mapping[key].append(caption)

print("Total number of images: " + str(len(mapping)))

Total number of images: 12800


### Cleaning the captions

The captions are cleaned such that irrelevant characters such as digits, special characters are removed. Also, extra spaces are trimmed. The case of all captions are made lower.

In [29]:
# Converting the case and replacing irrelevant characters 
for image, captions in mapping.items():
    for i in range(len(captions)):
        cleaned_caption = captions[i].lower().replace('[^A-Za-z]', '').replace('\s+', ' ')
        processed_caption = 'startseq ' + " ".join([word for word in cleaned_caption.split() if len(word)>1]) + ' endseq'
        captions[i] = processed_caption

In [30]:
# Storing all the possible captions into a list for tokenization
all_captions = []
for key in mapping:
    for caption in mapping[key]:
        all_captions.append(caption)
        
print("Total number of possible captions: " + str(len(all_captions)))

Total number of possible captions: 48899


In [31]:
# Processed captions after cleaning
all_captions[:5]

['startseq child in pink dress is climbing up set of stairs in an entry way endseq',
 'startseq girl going into wooden building endseq',
 'startseq little girl climbing into wooden playhouse endseq',
 'startseq little girl climbing the stairs to her playhouse endseq',
 'startseq little girl in pink dress going into wooden cabin endseq']