# **Image Captioning using ResNet-152 and LSTM**

## **Part 1: Encoding Images Using a Convolutional Neural Network (CNN)**

The first step in the project involves encoding images into feature representations using a **pretrained Convolutional Neural Network (CNN)**. The CNN serves as an **encoder**, transforming high-dimensional image data into a lower-dimensional feature vector that captures essential visual information.

### Steps:

1. **Select a Pretrained CNN**: A state-of-the-art model (ResNet-152) is used as the backbone for feature extraction.
2. **Remove Fully Connected Layers**: Only the convolutional base of the CNN is retained, removing the final classification layers to obtain a **2048-dimensional feature vector**.
3. **Extract Feature Vectors**: Each image is passed through the CNN, and the output of the final convolutional layer (or a global pooling layer) is taken as the image representation.
4. **Store Feature Representations**: The extracted feature vectors are saved for efficient retrieval during training.

This process enables the transformation of images into meaningful numerical embeddings that can be fed into the sequence modeling component of the system.

## **Dataset: Flickr8k**
- Contains **8,000 images**, each annotated with **5 captions**.
- Preprocessing includes **resizing, cropping, normalization**, and **tokenization**.

## **Technologies & Libraries**
- **Deep Learning Framework:** PyTorch and Tensorflow
- **Pretrained Model:** ResNet-152
- **Text Processing:** custom tokenizer
- **Dataset:** Flickr8k




---

If necessary, download the data

In [None]:
# !wget https://github.com/jbrownlee/Datasets/releases/tag/Flickr8k/Flickr8k_Dataset.zip
# !unzip Flickr8k_Dataset.zip
# !git clone https://github.com/ysbecca/flickr8k-custom

---

In [None]:
import sys
sys.path.insert(0,'../')

from tqdm import tqdm 

from Pipeline.data_retrieving.Image_Caption_data_retriever import Image_Caption_data_retriever
from Pipeline.preprocessing.Image_Caption_preprocessing import Image_Caption_preprocessing
from Pipeline.modelling.dataloader.Image_Local_Dataloader import Image_Local_DataLoader
from Pipeline.modelling.models.EncoderCNN import *

---

### Create an instance to the data retriever for the images and for the captions

In [2]:
training_data_retriever = Image_Caption_data_retriever()

training_data_retriever.retrieve_data('./data/flickr8k-custom/captions/Flickr8k_train.token.txt')

# print the head
display(training_data_retriever.get_data().head())

print(f"there are {len(training_data_retriever.get_data())} training examples in the training set")

Unnamed: 0,image_ID,caption
0,1000268201_693b08cb0e.jpg,A child in a pink dress is climbing up a set o...
1,1000268201_693b08cb0e.jpg,A girl going into a wooden building .
2,1000268201_693b08cb0e.jpg,A little girl climbing into a wooden playhouse .
3,1000268201_693b08cb0e.jpg,A little girl climbing the stairs to her playh...
4,1000268201_693b08cb0e.jpg,A little girl in a pink dress going into a woo...


there are 35445 training examples in the training set


### clean and tokenize the captions

In [4]:
# define the preprocessing

# the preprocessing of the images is done in the data loader
input_preprocessing_params = {}

# preprocess the captions by removing the undesired characters and by tokenizing
output_preprocessing_params = {
    'lower_case':True,
    'remove_punctuation':True,
    'remove_stopwords':True,
    'remove_digits':True,
    'tokenizer':Custom_Tokenizer
}

preprocessor = Image_Caption_preprocessing(input_preprocessing_params, output_preprocessing_params)

In [5]:
import pickle

# preprocess the captions
list_of_captions = preprocessor.preprocess_output_data(training_data_retriever.get_data()['caption'])

# save the captions
with open("captions.pkl", "wb") as file:
    # Serialize the list and save it to the file
    pickle.dump(list_of_captions, file)

In [6]:
# save the tokenizer to use it later in the decoder
with open("custom_tokenizer.pkl", "wb") as file:
    # Serialize the list and save it to the file
    pickle.dump(preprocessor.get_output_preprocessing_params()['tokenizer'], file)

In [7]:
# add the local path to the images 
local_path = './data/Flicker8k_Dataset/'
training_data_retriever.get_data()['image_ID'] = training_data_retriever.get_data()['image_ID'].map(lambda x: local_path + x)

---

### create a dataloader for the images

In [8]:
from torchvision import transforms

# Preprocessing to be apply to the images to be fed to the resnet (reference PyTorch doc)
resenet_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

In [None]:
# NOTE: the IDs inside the list are repeated 5 time, because for each image there are 5 captions. 
# Instead of computing the same feature 5 times, I will do it only once and then re-copy the result 5 times
image_dataloader = Image_Local_DataLoader(
    x=training_data_retriever.get_data()['image_ID'].to_list()[::5],
    batch_size=1,
    shuffle=False,
    image_preprocessing_fn = resenet_transform
)

---
### create an instance of the EncoderCNN

In [10]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [11]:
# Load the pretrained ResNet-152 and replace top fc layer
encoder_net = EncoderCNN()

# move the network to the GPU
encoder_net = encoder_net.to(device)

In [12]:
# Manually pass a dummy input to check the output shape of the EncodernCNN
dummy_input = torch.randn(1,3,224,224).to(device) # random tensor of shape (1,3,224,224) -> 1 batch of 3 channels of size 224x224 (shape of the image accepted by the model)
print(f'Output shape of the model: {encoder_net(dummy_input).shape}')

Output shape of the model: torch.Size([1, 2048])


---
### encoder the images, using the EncoderCNN and the dataloader just created

In [13]:
# create empty tensor to put the features inside
N_samples = image_dataloader.__len__()*5 # remember that each feature must repeat 5 times
feature_lenght = 2048  # I know that the output will be of lenght 2048 because I inspected the architecture of the model and verified this value
features = torch.empty((N_samples,feature_lenght), device=device)

In [14]:
# iterate over the whole dataset
start_idx = 0

for inputs in tqdm(image_dataloader):
    # convert from tf to torch
    inputs = torch.tensor(inputs.numpy(), dtype=torch.float32)
    
    inputs = inputs.to(device)
    batch_size = inputs.shape[0]
    
    # run the encoding 
    feature = encoder_net(inputs)
    
    # re-copy each feature five times
    feature_copied_five_times = torch.empty((5*batch_size,feature_lenght), device=device)
    
    i = 0
    for f in feature:
        feature_copied_five_times[i:i+5] = f
        i+=5
          
    # Store features in the tensor
    features[start_idx : (start_idx + 5*batch_size)] = feature_copied_five_times

    # Update index
    start_idx += 5*batch_size
    

100%|██████████| 7089/7089 [02:34<00:00, 45.88it/s]


In [15]:
# save the features locally
torch.save(features.to('cpu'),'data/encoded_images_features.pt')