## COMP5625M Assessment 2 - Image Caption Generation [100 marks]

<div class="logos"><img src="./drive/MyDrive/DeepLearningCW2/Comp5625M_logo.jpg" width="220px" align="right"></div>

The maximum marks for each part are shown in the section headers. The overall assessment carries a total of 100 marks.
This assessment is weighted 25% of the final grade for the module.

### Motivation 

Through this assessment, you will:

> 1. Understand the principles of text pre-processing and vocabulary building.
> 2. Gain experience working with an image-to-text model.
> 3. Use and compare two text similarity metrics for evaluating an image-to-text model, and understand evaluation challenges.


### Setup and resources 

Having a GPU will speed up the image feature extraction process. If you want to use a GPU, please refer to the module website for recommended working environments with GPUs.

Please implement the coursework using PyTorch and Python-based libraries, and refer to the notebooks and exercises provided.

This assessment will use a subset of the [COCO "Common Objects in Context" dataset](https://cocodataset.org/) for image caption generation. COCO contains 330K images of 80 object categories, and at least five textual reference captions per image. Our subset consists of nearly 5070 of these images, each with five or more different descriptions of the salient entities and activities, and we will refer to it as COCO_5070.

To download the data:

> 1. **Images and annotations**: download the zipped file provided in the link here as [``COMP5625M_data_assessment_2.zip``](https://leeds365-my.sharepoint.com/:u:/g/personal/scssali_leeds_ac_uk/EWWzE-_AIrlOkvOKxH4rjIgBF_eUx8KDJMPKM2eHwCE0dg?e=DdX62H). 

``Info only:`` To understand more about the COCO dataset, you can look at the [download page](https://cocodataset.org/#download). We have already provided you with the "2017 Train/Val annotations (241MB)", but our image subset consists of fewer images than the original COCO dataset. **So, no need to download anything from here!** 

> 2. **Image metadata**: as our set is a subset of the full COCO dataset, we have created a CSV file containing relevant metadata for our particular subset of images. You can also download it from Drive, "coco_subset_meta.csv", at the same link as 1.


### Submission

Please submit the following:

> 1. Your completed Jupyter notebook file, in .ipynb format. **Do not change the file name.**
> 2. The .html version of your notebook; File > Download as > HTML (.html). Check that all cells have been run and all outputs (including all graphs you would like to be marked) are displayed in the .html for marking.

**Final note:**

> **Please include everything you would like to be marked in this notebook, including figures. Under each section, put the relevant code containing your solution. You may re-use functions you defined previously, but any new code must be in the appropriate section.** Feel free to add as many code cells as you need under each section.

In [48]:
%ls

caption_image_ids.png              Comp5625M_logo.jpg  [0m[01;34msample_data[0m/
cleancaptions.png                  [01;34mdrive[0m/
comp5625M_figure_imageCaption.jpg  features_map.pt


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Your student username (for example, ```sc15jb```):

Your full name:

### Imports

Feel free to add to this section as needed.

In [2]:
import torch
import torch.nn as nn
from torchvision import transforms
import torchvision.models as models
from torch.utils.data import Dataset
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
import os
import numpy as np

Detect which device (CPU/GPU) to use.

In [3]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)

Using device: cpu


The basic principle of our image-to-text model is as pictured in the diagram below, where an Encoder network encodes the input image as a feature vector by providing the outputs of the last convolutional layer of a pre-trained CNN (we use [ResNet50](https://arxiv.org/abs/1512.03385)). This pretrained network has been trained on the complete ImageNet dataset and is thus able to recognise common objects. 

**(Hint)** You can alternatively use the COCO trained pretrained weights from [PyTorch](https://pytorch.org/vision/stable/models.html). One way to do this is use the "FasterRCNN_ResNet50_FPN_V2_Weights.COCO_V1" but use e.g., "resnet_model = model.backbone.body". Alternatively, you can use the checkpoint from your previous coursework where you finetuned to COCO dataset. 

These features are then fed into a Decoder network along with the reference captions. As the image feature dimensions are large and sparse, the Decoder network includes a linear layer which downsizes them, followed by *a batch normalisation layer* to speed up training. Those resulting features, as well as the reference text captions, are then passed into a recurrent network (we will use **RNN** in this assessment). 

The reference captions used to compute loss are represented as numerical vectors via an **embedding layer** whose weights are learned during training.

<!-- ![Encoder Decoder](comp5625M_figure.jpg) --> 


<div>
<center><img src="comp5625M_figure_imageCaption.jpg" width="1000"/></center>
</div>


The Encoder-Decoder network could be coupled and trained end-to-end, without saving features to disk; however, this requires iterating through the entire image training set during training. We can make the **training more efficient by decoupling the networks**. Thus, we will:

> First extract the feature representations of the images from the Encoder

> Save these features (Part 1) such that during the training of the Decoder (Part 3), we only need to iterate over the image feature data and the reference captions.

**Hint**
Try commenting out the feature extraction part once you have saved the embeddings. This way if you have to re-run the entire codes for some reason then you can only load these features. 


### Overview

> 1. Extracting image features 
> 2. Text preparation of training and validation data 
> 3. Training the decoder
> 4. Generating predictions on test data
> 5. Caption evaluation via BLEU score
> 6. Caption evaluation via Cosine similarity
> 7. Comparing BLEU and Cosine similarity


## 1 Extracting image features [11 marks]

> 1.1 Design a encoder layer with pretrained ResNet50 (4 marks)

> 1.2 Image feature extraction step (7 marks)

#### 1.1 Design a encoder layer with pretrained ResNet50 (4 marks)

> Read through the template EncoderCNN class below and complete the class.

> You are expected to use ResNet50 pretrained on imageNet provided in the Pytorch library (torchvision.models)


In [4]:
class EncoderCNN(nn.Module):
    def __init__(self):
        """Load the pretrained ResNet-50 and replace top fc layer."""
        super(EncoderCNN, self).__init__() # This line is used to call the constructor of the parent class nn.Module. This is necessary because EncoderCNN is a subclass of nn.Module and needs to inherit some of its properties and methods.
        
        # Your code here!
        resnet = models.resnet50(pretrained=True)
        modules = list(resnet.children())[:-1]  # Remove last FC layer
        self.resnet = nn.Sequential(*modules)
        # TO COMPLETE
        # keep all layers of the pretrained net except the last layers of fully-connected ones (you are permitted to take other layers too but this can affect your accuracy!)

        
    def forward(self, images):
        """Extract feature vectors from input images."""
        with torch.no_grad():
            features = self.resnet(images)
        return features.view(features.size(0), -1)
        # TO COMPLETE
        # remember no gradients are needed
        # return features 


In [5]:
# instantiate encoder and put into evaluation mode.
# Your code here!

encoder = EncoderCNN()
encoder.eval()


Downloading: "https://download.pytorch.org/models/resnet50-0676ba61.pth" to /root/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth
100%|██████████| 97.8M/97.8M [00:00<00:00, 245MB/s]


EncoderCNN(
  (resnet): Sequential(
    (0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU(inplace=True)
    (3): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
    (4): Sequential(
      (0): Bottleneck(
        (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (downsample): Sequential(
          (0): Conv2d(64

#### 1.2 Image feature extraction step (7 marks)

Pass the images through the ```Encoder``` model, saving the resulting features for each image. You may like to use a ```Dataset``` and ```DataLoader``` to load the data in batches for faster processing, or you may choose to simply read in one image at a time from disk without any loaders.

Note that as this is a forward pass only, no gradients are needed. You will need to be able to match each image ID (the image name without file extension) with its features later, so we suggest either saving a dictionary of image ID: image features, or keeping a separate ordered list of image IDs.

Use this ImageNet transform provided.

In [6]:
data_transform = transforms.Compose([ 
    transforms.ToTensor(),
    transforms.Resize(224), 
    transforms.CenterCrop(224), 
    transforms.Normalize((0.485, 0.456, 0.406),   # using ImageNet norms
                         (0.229, 0.224, 0.225))])

!ls


In [None]:
%cd ./drive/MyDrive/DeepLearningCW2/

/content/drive/MyDrive/DeepLearningCW2


In [None]:
%ls

[0m[01;34mdrive[0m/  features_map.pt  [01;34msample_data[0m/


In [None]:
%cd COMP5625M_data_assessment_2.zip\ \(Unzipped\ Files\)

/content/drive/MyDrive/DeepLearningCW2/COMP5625M_data_assessment_2.zip (Unzipped Files)


In [None]:
!pwd

/content


In [7]:
# Get unique images from the csv for extracting features - helper code
imageList = pd.read_csv("/content/drive/MyDrive/DeepLearningCW2/COMP5625M_data_assessment_2.zip (Unzipped Files)/coco_subset_meta.csv")
imageList['file_name']
len(imageList.id.unique())

imagesUnique = sorted(imageList['file_name'].unique())
print(len(imagesUnique))

df_unique_files =  pd.DataFrame.from_dict(imagesUnique)

df_unique_files.columns = ['file_name']
df_unique_files

5068


Unnamed: 0,file_name
0,000000000009.jpg
1,000000000025.jpg
2,000000000030.jpg
3,000000000034.jpg
4,000000000036.jpg
...,...
5063,000000581906.jpg
5064,000000581909.jpg
5065,000000581913.jpg
5066,000000581921.jpg


In [8]:
from PIL import Image

In [9]:
# Define a class COCOImagesDataset(Dataset) function that takes the 
# image file names and reads the image and apply transform to it
# ---> your code here! we have provided you a sketch 

IMAGE_DIR = "/content/drive/MyDrive/DeepLearningCW2/COMP5625M_data_assessment_2.zip (Unzipped Files)/coco/images"

class COCOImagesDataset(Dataset):
    def __init__(self, image_dir, transform=None):

        self.image_dir = image_dir
        self.filenames = os.listdir(self.image_dir)
        # --> your code here!
        self.transform = transform

    def __getitem__(self, index):
        

        # --> your code here!
        filename = self.filenames[index]
        img_path = os.path.join(self.image_dir, filename)
        image = Image.open(img_path).convert('RGB')

        if self.transform is not None:
            image = self.transform(image)


        return image, filename

    def __len__(self):
        return len(self.filenames)
    

In [10]:
from torch.utils.data import DataLoader

In [11]:
# Use Dataloader to use the unique files using the class COCOImagesDataset
# make sure that shuffle is False as we are not aiming to retrain in this exercise
# Your code here-->
#batch_size = 32

IMAGE_DIR = "/content/drive/MyDrive/DeepLearningCW2/COMP5625M_data_assessment_2.zip (Unzipped Files)/coco/images"

# Instantiate the dataset
dataset = COCOImagesDataset(IMAGE_DIR, transform=data_transform)

# Instantiate the data loader
data_loader = DataLoader(dataset, shuffle=False)


In [145]:
print(data_loader)
num_images = len(data_loader)
print("Total number of images:", num_images)


<torch.utils.data.dataloader.DataLoader object at 0x7ff7c8b87010>
Total number of images: 5071


In [13]:
from pycocotools.coco import COCO

In [None]:
# Apply encoder to extract featues and save them (e.g., you can save it using image_ids)
# Hint - make sure to save your features after running this - you can use torch.save to do this
"""这一块以后重新加载的时候没必要跑了"""
features_map = dict()
from tqdm.notebook import tqdm
from PIL import Image

# Define paths to the COCO dataset and annotations
dataDir = '/content/drive/MyDrive/DeepLearningCW2/COMP5625M_data_assessment_2.zip (Unzipped Files)/coco/images'
annFile = '/content/drive/MyDrive/DeepLearningCW2/COMP5625M_data_assessment_2.zip (Unzipped Files)/coco/annotations2017/instances_train2017.json'


with torch.no_grad():

# ---> Your code here!  
  """
  那我应该在这里写啥呢?
  how to extract features ?
  extract feature 的一般过程或者步骤是怎样的？

  提取出来的feature应该就是卷积层的操作吧。
  那卷积核是怎么设置的？
  那比如我要提取，
  那在这个语义分割的任务里，提取完特征之后，应该干什么？为什么这样子就能提取出图像特征了？


  how to save the feature?


  """
  count_a = 0
  # Iterate over all images in the dataset and extract features
  for filename in os.listdir(dataDir):
      # Load the image
      image_path = os.path.join(dataDir, filename)
      image = Image.open(image_path).convert('RGB')
      
      # Apply transforms and extract features
      input_tensor = data_transform(image).unsqueeze(0)
      features = encoder(input_tensor)
      features = features.squeeze().detach().numpy()
      
      # Save features in the dictionary
      image_id = filename.split('.')[0]
      count_a = count_a + 1
      print(image_id,count_a)
      features_map[image_id] = features

  # Save the features dictionary as a binary file
  torch.save(features_map, '/content/drive/MyDrive/DeepLearningCW2/features_map.pt')



In [144]:
print(len(features_map))

5071


In [191]:
filename_list=[]
for filename in os.listdir(dataDir):
  x = filename.strip('0').strip('.jpg')
  x = int(x)
  filename_list.append(x)

In [193]:
print(filename_list,len(filename_list))
print('000000000009.jpg'in filename_list)

[418882, 352841, 347727, 28239, 387153, 225363, 554066, 282707, 282711, 228953, 215135, 387678, 272480, 531552, 147042, 11360, 537335, 116334, 544371, 353398, 373880, 387696, 111737, 515202, 147073, 419449, 396421, 552066, 246409, 349322, 515210, 215691, 512140, 161941, 27285, 212633, 494217, 119966, 204448, 320667, 398494, 360101, 147122, 184994, 307379, 563898, 493235, 246478, 349376, 214725, 63676, 558798, 434389, 267988, 126182, 226517, 48863, 318677, 233703, 485099, 275695, 436975, 537844, 193271, 514294, 325362, 253958, 385026, 163858, 499728, 532501, 90137, 466974, 483357, 458778, 294943, 204826, 204832, 548889, 507939, 532520, 450596, 8234, 458807, 401458, 409658, 311354, 204855, 24636, 163903, 442428, 286787, 426053, 458821, 565313, 450634, 393292, 557135, 393291, 122964, 24657, 303178, 286802, 32846, 458836, 548953, 360541, 213091, 417885, 524404, 483447, 106616, 540795, 188544, 221307, 532610, 376970, 163975, 557197, 90255, 573584, 409744, 401556, 286860, 57495, 540820, 4261

In [192]:
print(filename_list)
print(type(filename_list[0]))


[418882, 352841, 347727, 28239, 387153, 225363, 554066, 282707, 282711, 228953, 215135, 387678, 272480, 531552, 147042, 11360, 537335, 116334, 544371, 353398, 373880, 387696, 111737, 515202, 147073, 419449, 396421, 552066, 246409, 349322, 515210, 215691, 512140, 161941, 27285, 212633, 494217, 119966, 204448, 320667, 398494, 360101, 147122, 184994, 307379, 563898, 493235, 246478, 349376, 214725, 63676, 558798, 434389, 267988, 126182, 226517, 48863, 318677, 233703, 485099, 275695, 436975, 537844, 193271, 514294, 325362, 253958, 385026, 163858, 499728, 532501, 90137, 466974, 483357, 458778, 294943, 204826, 204832, 548889, 507939, 532520, 450596, 8234, 458807, 401458, 409658, 311354, 204855, 24636, 163903, 442428, 286787, 426053, 458821, 565313, 450634, 393292, 557135, 393291, 122964, 24657, 303178, 286802, 32846, 458836, 548953, 360541, 213091, 417885, 524404, 483447, 106616, 540795, 188544, 221307, 532610, 376970, 163975, 557197, 90255, 573584, 409744, 401556, 286860, 57495, 540820, 4261

## 2 Text preparation [23 marks]

> 2.1 Build the caption dataset (3 Marks)

> 2.2 Clean the captions (3 marks)

> 2.3 Split the data (3 marks)

> 2.4 Building the vocabulary (10 marks)

> 2.5 Prepare dataset using dataloader (4 marks)


#### 2.1 Build the caption dataset (3 Marks)

All our selected COCO_5029 images are from the official 2017 train set.

The ```coco_subset_meta.csv``` file includes the image filenames and unique IDs of all the images in our subset. The ```id``` column corresponds to each unique image ID.

The COCO dataset includes many different types of annotations: bounding boxes, keypoints, reference captions, and more. We are interested in the captioning labels. Open ```captions_train2017.json``` from the zip file downloaded from the COCO website. You are welcome to come up with your own way of doing it, but we recommend using the ```json``` package to initially inspect the data, then the ```pandas``` package to look at the annotations (if you read in the file as ```data```, then you can access the annotations dictionary as ```data['annotations']```).

Use ```coco_subset_meta.csv``` to cross-reference with the annotations from ```captions_train2017.json``` to get all the reference captions for each image in COCO_5029.

For example, you may end up with data looking like this (this is a ```pandas``` DataFrame, but it could also be several lists, or some other data structure/s):

<img src="./caption_image_ids.png" alt="images matched to caption" width="700"/>

In [None]:
import json

# loading captions for training
dir_caption = "/content/drive/MyDrive/DeepLearningCW2/COMP5625M_data_assessment_2.zip (Unzipped Files)/coco/annotations2017/captions_train2017.json"
with open(dir_caption, 'r') as json_file:
    data = json.load(json_file)
    
df_caption = pd.DataFrame.from_dict(data["annotations"])
df_caption.head(100)
# print(len(df_caption))
#因为txt文件中有59万条数据，但是我们的作业中只有8000条数据，所以需要重新整合一下，那么如何整合呢？
#把8000张图片，每一张图片对应的所有caption都找出来
#遍历大数据集，如果子集中存在这一条，那就添加一条数据。
#子集id txt中的id caption 子集filename


In [190]:
print(len(df_caption["image_id"]))
print(df_caption["image_id"]==203564)
df_caption.head(100)

591753
0          True
1         False
2         False
3         False
4         False
          ...  
591748    False
591749    False
591750    False
591751    False
591752    False
Name: image_id, Length: 591753, dtype: bool


Unnamed: 0,image_id,id,caption
0,203564,37,A bicycle replica with a clock as the front wh...
1,322141,49,A room with blue walls and a white sink and door.
2,16977,89,A car that seems to be parked illegally behind...
3,106140,98,A large passenger airplane flying through the ...
4,106140,101,There is a GOL plane taking off in a partly cl...
...,...,...,...
95,139011,2688,A crowd of people are waiting to get on a red ...
96,285421,2761,A cat drinking water from a toilet in a bathroom.
97,507362,2830,An intersection during a cold and foggy night.
98,208408,2884,A person walking in the rain while holding an ...


In [None]:
new_file_nb = pd.DataFrame()
for i in filename_list:
  for index, row in df_caption.iterrows():
    if i == row["image_id"]:
      print(row)
      filename_here = str(i)
      filename_here = filename.zfill(12) + '.jpg'
      new_row_data = {'image_id':row["id"], 'id': row["id"],'caption':row["caption"],'filename':filename_here}
      temp = pd.DataFrame([new_row_data])
      new_file_nb = pd.concat([new_file_nb, temp], ignore_index=True)


  



image_id                                               418882
id                                                     651274
caption     A stove top has many different items cooking o...
Name: 166076, dtype: object
image_id                                               418882
id                                                     651466
caption     A white stove bears multiple pans and trays of...
Name: 166086, dtype: object
image_id                                               418882
id                                                     653653
caption     A number of various food items atop a white st...
Name: 166133, dtype: object
image_id                                              418882
id                                                    655096
caption     Food and wine on the kitchen counters and stove.
Name: 166184, dtype: object
image_id                                               418882
id                                                     656623
caption     A meal is b

In [198]:
#如何获取df的一列？并进行循环？
#如何获取df的每一行？
for index, row in df_caption.iterrows():
  if index == 1 :
    print(index,row["image_id"])

1 322141


KeyboardInterrupt: ignored

In [200]:
a = 123456
a = str(a)
a = a.zfill(12) + '.jpg'
print(a,type(a))

000000123456.jpg <class 'str'>


In [148]:
# Hint: get the filename matching id from coco_subset_meta.csv - make sure that for each id you add image filename
meta_dir = "/content/drive/MyDrive/DeepLearningCW2/COMP5625M_data_assessment_2.zip (Unzipped Files)/coco_subset_meta.csv"
coco_subset_meta = pd.read_csv(meta_dir) #Load csv file as a DataFrame
print(coco_subset_meta)
print(type(coco_subset_meta))
# --> your code here! - name the new dataframe as "new_file"

oid = coco_subset_meta["id"]
print(oid)

      Unnamed: 0  license         file_name  \
0              0        2  000000262145.jpg   
1              1        1  000000262146.jpg   
2              2        3  000000524291.jpg   
3              3        1  000000262148.jpg   
4              4        3  000000393223.jpg   
...          ...      ...               ...   
7995        7995        1  000000059582.jpg   
7996        7996        3  000000514241.jpg   
7997        7997        3  000000069826.jpg   
7998        7998        1  000000108739.jpg   
7999        7999        1  000000080067.jpg   

                                               coco_url  height  width  \
0     http://images.cocodataset.org/train2017/000000...     427    640   
1     http://images.cocodataset.org/train2017/000000...     640    480   
2     http://images.cocodataset.org/train2017/000000...     426    640   
3     http://images.cocodataset.org/train2017/000000...     512    640   
4     http://images.cocodataset.org/train2017/000000...     480  

In [149]:
new_file = pd.DataFrame()
#这里的new_file是个什么东西？

for index in range(len(df_caption)):
  a = df_caption.iloc[index][0]
  b = df_caption.iloc[index][1]
  c = df_caption.iloc[index][2]

  if a in oid:
    #如何根据
    #print(a,"|",b,"\",",c,"8888") #322141 | 49 ", A room with blue walls and a white sink and door. 8888
    #print("in")
    row = coco_subset_meta.loc[a]
    #print(row,"**********",row["file_name"])
    #print(a,b,c)
    new_row_data = {'image_id':row["id"], 'id': b,'caption':c,'filename':row["file_name"]}
    #print(new_row_data)
    temp = pd.DataFrame([new_row_data])
    #new_file = new_file.append(new_row_data, ignore_index=True)
    new_file = pd.concat([new_file, temp], ignore_index=True)
new_file.head()



Unnamed: 0,image_id,id,caption,filename
0,401703,1132,The back door with a window in the kitchen.,000000401703.jpg
1,401703,2155,The kitchen has a white door with a window.,000000401703.jpg
2,280651,2473,A black and white photo of an older man skiing.,000000280651.jpg
3,280651,2620,Two people on the snow for cross country skiing.,000000280651.jpg
4,401703,3577,A kitchen door next to a kitchen sing and coun...,000000401703.jpg


In [150]:
#对结果进行排序
new_file.sort_values("image_id",inplace=True)
#inplace默认为False,如果该值为False，那么原来的pd顺序没变，只是返回的是排序的
print(len(new_file))
new_file.head(10)

7791


Unnamed: 0,image_id,id,caption,filename
4783,61,559824,A girl in a bathing suit with a pink umbrella.,000000000061.jpg
4779,61,556899,A woman in a floral swimsuit holds a pink umbr...,000000000061.jpg
4778,61,556653,"A woman posing for the camera, holding a pink,...",000000000061.jpg
4777,61,552549,Woman in swim suit holding parasol on sunny day.,000000000061.jpg
4780,61,557547,A woman with an umbrella near the sea,000000000061.jpg
7342,71,518232,A group of people that are sitting in front of...,000000000071.jpg
7341,71,515205,The extra laptop is on standby for the compute...,000000000071.jpg
7338,71,514548,The people are using their computers in the da...,000000000071.jpg
2402,71,767053,A woman hods a stuffed pig close to her,000000000071.jpg
7263,71,401070,People are sitting in the dark using computers.,000000000071.jpg


In [48]:
#把结果存在一个csv文件到硬盘里。
outputpath='/content/drive/MyDrive/DeepLearningCW2/caption_data.csv'
new_file.to_csv(outputpath,sep=',',index=False,header=True)

#### 2.2 Clean the captions (3 marks)

Create a cleaned version of each caption. If using dataframes, we suggest saving the cleaned captions in a new column; otherwise, if you are storing your data in some other way, create data structures as needed. 

**A cleaned caption should be all lowercase, and consist of only alphabet characters.**

Print out 10 original captions next to their cleaned versions to facilitate marking.


<img src="/content/drive/MyDrive/DeepLearningCW2/cleancaptions.png" alt="images matched to caption" width="700"/>

In [173]:
import re


In [174]:
csv_file = "/content/drive/MyDrive/DeepLearningCW2/caption_data.csv"
csv_data = pd.read_csv(csv_file)
csv_df = pd.DataFrame(csv_data)

In [175]:
csv_df.head()

Unnamed: 0,image_id,id,caption,filename
0,61,559824,A girl in a bathing suit with a pink umbrella.,000000000061.jpg
1,61,556899,A woman in a floral swimsuit holds a pink umbr...,000000000061.jpg
2,61,556653,"A woman posing for the camera, holding a pink,...",000000000061.jpg
3,61,552549,Woman in swim suit holding parasol on sunny day.,000000000061.jpg
4,61,557547,A woman with an umbrella near the sea,000000000061.jpg


In [171]:
#new_file["clean_caption"] = "" # add a new column to the dataframe for the cleaned captions
#从csv文件中读取df
def gen_clean_captions_df(df):

    # Remove spaces in the beginning and at the end
    # Convert to lower case
    # Replace all non-alphabet characters with space
    # Replace all continuous spaces with a single space

    # -->your code here
    df["clean_caption"] = ""
    for i in range(len(df)):
        # 去掉首尾空格，转换为小写字母
        text = df.loc[i, 'caption'].strip().lower()
        # 替换非字母字符为空格
        text = re.sub('[^a-zA-Z]+', ' ', text)
        # 替换连续空格为单个空格
        text = re.sub('\s+', ' ', text)
        # 存储到 clean_caption 列中
        df.loc[i, 'clean_caption'] = text
    return df



In [176]:
# 对 DataFrame 进行清洗
df_cleaned = gen_clean_captions_df(csv_df)
# 输出清洗后的 DataFrame
print(df_cleaned)
#在这个示例中，我们创建了一个包含数据的 DataFrame 对象 df。然后，我们定义了一个 gen_clean_captions_df() 函数，并将 df 作为参数传递给该函数进行清洗。最后，我们使用 print() 函数输出


      image_id      id                                            caption  \
0           61  559824     A girl in a bathing suit with a pink umbrella.   
1           61  556899  A woman in a floral swimsuit holds a pink umbr...   
2           61  556653  A woman posing for the camera, holding a pink,...   
3           61  552549   Woman in swim suit holding parasol on sunny day.   
4           61  557547              A woman with an umbrella near the sea   
...        ...     ...                                                ...   
7786    581863  635547  a giraffe grazing on the tree's in the wildern...   
7787    581863  633777           A giraffe in a field eating tree leaves.   
7788    581863  633105         A large giraffe standing in a grass field.   
7789    581863  632091         A zebra takes long strides amid palm trees   
7790    581863  628773  A giraffe is striding across short grass, past...   

              filename                                      clean_caption  

In [122]:
#存新的dataframe到新的csv
outputpath_clean='/content/drive/MyDrive/DeepLearningCW2/caption_data_cleaned.csv'
df_cleaned.to_csv(outputpath_clean,sep=',',index=False,header=True)

In [162]:
#读取
csv_file = "/content/drive/MyDrive/DeepLearningCW2/caption_data_cleaned.csv"
csv_data_cleaned = pd.read_csv(csv_file)
csv_df_cleaned = pd.DataFrame(csv_data_cleaned)
csv_df_cleaned.head(10)
print(len(csv_df_cleaned))


7791


In [None]:
csv_df_cleaned.head(1500)

In [170]:
x = csv_df_cleaned[csv_df_cleaned.filename=='000000000064.jpg'].index.tolist()
print(x)

[]


In [159]:
df_try = pd.DataFrame({'team': ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'D'],
                   'points': [5, 7, 7, 9, 12, 9, 9, 4],
                   'rebounds': [11, 8, 10, 6, 6, 5, 9, 12]})

In [160]:
df_try.index[df_try['points']==7].tolist()

[1, 2]

#### 2.3  Split the data (3 marks)

Split the data 70/10/20% into train/validation/test sets. **Be sure that each unique image (and all corresponding captions) only appear in a single set.**

We provide the function below which, given a list of unique image IDs and a 3-split ratio, shuffles and returns  a split of the image IDs.

If using a dataframe, ```df['image_id'].unique()``` will return the list of unique image IDs.

In [22]:
import random
import math

def split_ids(image_id_list, train=.7, valid=0.1, test=0.2):
    """
    Args:
        image_id_list (int list): list of unique image ids
        train (float): train split size (between 0 - 1)
        valid (float): valid split size (between 0 - 1)
        test (float): test split size (between 0 - 1)
    """
    list_copy = image_id_list.copy()
    random.shuffle(list_copy)
    
    train_size = math.floor(len(list_copy) * train)
    valid_size = math.floor(len(list_copy) * valid)
    
    return list_copy[:train_size], list_copy[train_size:(train_size + valid_size)], list_copy[(train_size + valid_size):]

In [23]:
data_id = csv_df_cleaned['id']
print(data_id)


0       559824
1       556899
2       556653
3       552549
4       557547
         ...  
7786    635547
7787    633777
7788    633105
7789    632091
7790    628773
Name: id, Length: 7791, dtype: int64


In [24]:
split_list = list(data_id)
unique_id = csv_df_cleaned['image_id'].unique()
print(len(unique_id),unique_id)
print(len(split_list),split_list)

1375 [    61     71     77 ... 581770 581789 581863]
7791 [559824, 556899, 556653, 552549, 557547, 518232, 515205, 514548, 767053, 401070, 394836, 772216, 768388, 763435, 764377, 355969, 363952, 359695, 361099, 361417, 792135, 783936, 788643, 788757, 92562, 784260, 92202, 94131, 92235, 94137, 94273, 86971, 106924, 88570, 76405, 408545, 310628, 301724, 302114, 301145, 480152, 475460, 473093, 470150, 467168, 303857, 417134, 424691, 419828, 421181, 812780, 809471, 811706, 807293, 807719, 579604, 569752, 582316, 573184, 567271, 762369, 755795, 758024, 761571, 759054, 516073, 395935, 735807, 737550, 736947, 738039, 399529, 736170, 400249, 397609, 528809, 528845, 529622, 531785, 526217, 355106, 352580, 348617, 351050, 352496, 639345, 635520, 644973, 631851, 632607, 120456, 200811, 150912, 131817, 197856, 317409, 322362, 320748, 321468, 321177, 786223, 779047, 787237, 786076, 788482, 261046, 265294, 264448, 264253, 260293, 651151, 650548, 645730, 644686, 651541, 601319, 598940, 600704, 603311

In [25]:
after_split = split_ids(split_list,0.7,0.1,0.2)
print(len(after_split[0]))
print(len(after_split[1]))
print(len(after_split[2]))

5453
779
1559


#### 2.4 Building the vocabulary (10 marks)

The vocabulary consists of all the possible words which can be used - both as input into the model, and as output predictions, and we will build it using the cleaned words found in the reference captions from the training set. In the vocabulary each unique word is mapped to a unique integer (a Python ```dictionary``` object).

A ```Vocabulary``` object is provided for you below to use.

In [56]:
class Vocabulary(object):
    """ Simple vocabulary wrapper which maps every unique word to an integer ID. """
    def __init__(self):
        # intially, set both the IDs and words to dictionaries with special tokens
        self.word2idx = {'<pad>': 0, '<unk>': 1, '<end>': 2}
        self.idx2word = {0: '<pad>', 1: '<unk>', 2: '<end>'}
        self.idx = 3

    def add_word(self, word):
        # if the word does not already exist in the dictionary, add it
        if not word in self.word2idx:
            # this will convert each word to index and index to word as you saw in the tutorials
            self.word2idx[word] = self.idx
            self.idx2word[self.idx] = word
            # increment the ID for the next word
            self.idx += 1

    def __call__(self, word):
        # if we try to access a word not in the dictionary, return the id for <unk>
        if not word in self.word2idx:
            return self.word2idx['<unk>']
        return self.word2idx[word]

    def __len__(self):

      print(self.word2idx)
      return len(self.word2idx)


In [18]:
# [Hint] building a vocab function such with frequent words e.g., setting MIN_FREQUENCY = 3
MIN_FREQUENCY = 3

def build_vocab(df_ids, new_fileaa):
    """ 
    Parses training set token file captions and builds a Vocabulary object and dataframe for 
    the image and caption data

    Returns:
        vocab (Vocabulary): Vocabulary object containing all words appearing more than min_frequency
    """
    word_mapping = Counter()

    # for index in df.index:
    for index, id in enumerate(df_ids):
        caption = str(new_fileaa.loc[new_fileaa['id']==id]['clean_caption'])
        #print(index,id,"+",caption,"+**")
        #print(caption.split())
        for word in caption.split():
            # also get rid of numbers, symbols etc.
            if word in word_mapping:
                word_mapping[word] += 1
            else:
                word_mapping[word] = 1

    # create a vocab instance
    vocab = Vocabulary()

    # add the words to the vocabulary
    for word in word_mapping:
        # Ignore infrequent words to reduce the embedding size
        # --> Your code here!
      if word_mapping[word] >= MIN_FREQUENCY:
            vocab.add_word(word)

        

    return vocab


Collect all words from the cleaned captions in the **training and validation sets**, ignoring any words which appear 3 times or less; this should leave you with roughly 2200 words (plus or minus is fine). As the vocabulary size affects the embedding layer dimensions, it is better not to add the very infrequently used words to the vocabulary.

Create an instance of the ```Vocabulary()``` object and add all your words to it.

In [27]:
for i in csv_df_cleaned["clean_caption"]:
  print(i)

[1;30;43m流式输出内容被截断，只能显示最后 5000 行内容。[0m
two uniformed people are serving food to guests 
military personnel are serving people in a buffet line 
a picture of some people that are eating some food 
there are military people serving others hot food
a train with first class is traveling somewhere 
a passenger train on an outdoor track that is on a brick wall with railings 
a train is moving along raised train tracks 
a white and yellow passenger train sits on its rail with a tower in the background 
a commuter train crosses an overpass on a cloudy day 
a large metallic modern looking refrigerator sitting in a kitchen 
a silver refrigerator is prominent in the kitchen 
a tall skinny refrigerator in the corner of a kitchen 
the silver cabinet does not match the rest of the room 
a silver colored refrigeration unit in a kitchen 
the flowers are inside a colorful green vase 
a close up of a vase with flowers with a dark background
a bouquet of roses in a red walled road 
a green vase with se

In [28]:
b = build_vocab(split_list,csv_df_cleaned)

In [29]:
print(len(b))

{'<pad>': 0, '<unk>': 1, '<end>': 2, 'a': 3, 'girl': 4, 'in': 5, 'bathing': 6, 'suit': 7, 'with': 8, 'pink': 9, 'umbrella': 10, 'Name:': 11, 'clean_caption,': 12, 'dtype:': 13, 'object': 14, 'woman': 15, 'floral': 16, 'holds': 17, 'umbr...': 18, 'posing': 19, 'for': 20, 'the': 21, 'camera': 22, 'holding': 23, 'o...': 24, 'on': 25, 'sunny': 26, 'day': 27, 'an': 28, 'near': 29, 'sea': 30, 'group': 31, 'of': 32, 'people': 33, 'that': 34, 'are': 35, 'sitting': 36, 'front': 37, 'of...': 38, 'extra': 39, 'laptop': 40, 'is': 41, 'compute...': 42, 'using': 43, 'their': 44, 'computers': 45, 'dark': 46, 'stuffed': 47, 'close': 48, 'to': 49, 'her': 50, 'room': 51, 'three': 52, 'poses': 53, 'large': 54, 'teddy': 55, 'bear': 56, '...': 57, 'bench': 58, 'hugging': 59, 'giant': 60, 'has': 61, 'face': 62, 'dog': 63, 'catching': 64, 'blue': 65, 'frisbee': 66, 'field': 67, 'enjoying': 68, 'game': 69, 'up': 70, 'jumping': 71, 'air': 72, 'catches': 73, 'while': 74, 'playing': 75, 'p...': 76, 'tan': 77, 'j

In [57]:
print(b.__call__('a'))

3


In [77]:
csv_df_cleaned.head()

Unnamed: 0,image_id,id,caption,filename,clean_caption
0,61,559824,A girl in a bathing suit with a pink umbrella.,000000000061.jpg,a girl in a bathing suit with a pink umbrella
1,61,556899,A woman in a floral swimsuit holds a pink umbr...,000000000061.jpg,a woman in a floral swimsuit holds a pink umbr...
2,61,556653,"A woman posing for the camera, holding a pink,...",000000000061.jpg,a woman posing for the camera holding a pink o...
3,61,552549,Woman in swim suit holding parasol on sunny day.,000000000061.jpg,woman in swim suit holding parasol on sunny day
4,61,557547,A woman with an umbrella near the sea,000000000061.jpg,a woman with an umbrella near the sea


In [85]:
#如何查询dataframe里某个数据？
result = csv_df_cleaned['id']=418882
print(result)


418882
1


#### 2.5 Prepare dataset using dataloader (4 marks)

Create a PyTorch ```Dataset``` class and a corresponding ```DataLoader``` for the inputs to the decoder. Create three sets: one each for training, validation, and test. Set ```shuffle=True``` for the training set DataLoader.

The ```Dataset``` function ```__getitem__(self, index)``` should return three Tensors:

>1. A Tensor of image features, dimension (1, 2048).
>2. A Tensor of integer word ids representing the reference caption; use your ```Vocabulary``` object to convert each word in the caption to a word ID. Be sure to add the word ID for the ```<end>``` token at the end of each caption, then fill in the the rest of the caption with the ```<pad>``` token so that each caption has uniform lenth (max sequence length) of **47**.
>3. A Tensor of integers representing the true lengths of every caption in the batch (include the ```<end>``` token in the count).


Note that as each unique image has five or more (say, ```n```) reference captions, each image feature will appear ```n``` times, once in each unique (feature, caption) pair.

In [194]:
#查看pt
feature_map = torch.load('/content/drive/MyDrive/DeepLearningCW2/features_map.pt')
print(len(feature_map))
#如何把这
print(type(feature_map))
print(feature_map["000000514241"])
# print(feature_map)
# print(csv_df_cleaned.head(1000))
print(feature_map["000000767053"])


5071
<class 'dict'>
[0.5873658  0.3289168  0.88816637 ... 0.12752564 0.62092537 1.3384594 ]


KeyError: ignored

In [58]:
try_caption = csv_df_cleaned['clean_caption'].values.tolist()
print(try_caption)



In [79]:
csv_df_cleaned.head(30)

Unnamed: 0,image_id,id,caption,filename,clean_caption
0,61,559824,A girl in a bathing suit with a pink umbrella.,000000000061.jpg,a girl in a bathing suit with a pink umbrella
1,61,556899,A woman in a floral swimsuit holds a pink umbr...,000000000061.jpg,a woman in a floral swimsuit holds a pink umbr...
2,61,556653,"A woman posing for the camera, holding a pink,...",000000000061.jpg,a woman posing for the camera holding a pink o...
3,61,552549,Woman in swim suit holding parasol on sunny day.,000000000061.jpg,woman in swim suit holding parasol on sunny day
4,61,557547,A woman with an umbrella near the sea,000000000061.jpg,a woman with an umbrella near the sea
5,71,518232,A group of people that are sitting in front of...,000000000071.jpg,a group of people that are sitting in front of...
6,71,515205,The extra laptop is on standby for the compute...,000000000071.jpg,the extra laptop is on standby for the compute...
7,71,514548,The people are using their computers in the da...,000000000071.jpg,the people are using their computers in the dark
8,71,767053,A woman hods a stuffed pig close to her,000000000071.jpg,a woman hods a stuffed pig close to her
9,71,401070,People are sitting in the dark using computers.,000000000071.jpg,people are sitting in the dark using computers


In [95]:
row = csv_df_cleaned.loc[csv_df_cleaned['id'] == 767053]
print(row)
row_caption = row["clean_caption"]
print(row_caption)

   image_id      id                                  caption  \
8        71  767053  A woman hods a stuffed pig close to her   

           filename                            clean_caption  
8  000000000071.jpg  a woman hods a stuffed pig close to her  
8    a woman hods a stuffed pig close to her
Name: clean_caption, dtype: object


In [137]:
MAX_SEQ_LEN = 47

class COCO_Features(Dataset):
    """ COCO subset custom dataset, compatible with torch.utils.data.DataLoader. """
    
    def __init__(self, df, features, vocab):
        """
        Args:
            df: (dataframe or some other data structure/s you may prefer to use)
            features: image features
            vocab: vocabulary wrapper
           
        """
        
        # TO COMPLETE
        self.features = features
        self.vocab = vocab
        self.df = df
        
    
    def __getitem__(self, index):
      

        """ Returns one data tuple (feature [1, 2048], target caption of word IDs [1, 47], and integer true caption length) 
          输入的index不需要有0
        """   
        
       # TO COMPLETE
       #读取feature的key和value

        fff = str(index)
        #origin_index = self.features[fff]
        #补充0
        fill_index = fff.zfill(12)
        print(fill_index,type(fill_index),type(self.features))

        return1 = self.features[fill_index]

        #但是这个index并不是caption的下标。
        #现在要做的是用index找到具体的下标。
        row = self.df.loc[self.df['image_id'] == index]
        # print(row)
        caption = row["clean_caption"]  #获得了caption
        # print(row_caption)
        
        # Convert caption to a list of word IDs
        tokens = nltk.tokenize.word_tokenize(str(caption).lower())
        caption = []
        caption.append(self.vocab('<start>'))
        caption.extend([self.vocab(token) for token in tokens])
        caption.append(self.vocab('<end>'))
        
        # Pad caption to MAX_SEQ_LEN
        if len(caption) < MAX_SEQ_LEN:
            caption.extend([self.vocab('<pad>')] * (MAX_SEQ_LEN - len(caption)))
        caption = torch.tensor(caption)
        
        # Return data tuple
        return return1, caption, len(caption)

    
    def __len__(self):
        return len(self.df)
    

In [139]:
coco_f = COCO_Features(csv_df_cleaned,feature_map,b)
coco_f.__getitem__(163897)

000000163897 <class 'str'> <class 'dict'>


(array([0.195921  , 0.17879303, 0.74948674, ..., 0.11000284, 0.12688811,
        0.24985592], dtype=float32),
 tensor([   1, 1208,    1,    1,    1,    1, 1281,    1,    1,    1,    1,    1,
           14,    1,    2,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0]),
 47)

In [140]:
def caption_collate_fn(data):

    """ Creates mini-batch tensors from the list of tuples (image, caption).
    Args:
        data: list of tuple (image, caption). 
            - image: torch tensor of shape (3, 224, 224).
            - caption: torch tensor of shape (?); variable length.
    Returns:
        images: torch tensor of shape (batch_size, 3, 224, 224).
        targets: torch tensor of shape (batch_size, padded_length).
        lengths: list; valid length for each padded caption.
    """
    # --> Your code here
    # Sort a data list by caption length from longest to shortest.
    data.sort(key=lambda x: len(x[1]), reverse=True)
    images, captions, lengths = zip(*data)

    # merge images (from tuple of 3D tensor to 4D tensor).
    # if using features, 2D tensor to 3D tensor. (batch_size, 256)
    images = torch.stack(images, dim=0)


    # merge captions (from tuple of 1D tensor to 2D tensor).
    targets = torch.zeros(len(captions), MAX_SEQ_LEN).long()
    for i, cap in enumerate(captions):
        end = lengths[i]
        targets[i, :end] = cap[:end]

    # pad with zeros
    return images, targets, lengths


In [None]:
print(features_map)

In [141]:
dataset_train = COCO_Features(
    df=csv_df_cleaned,
    vocab=b,
    features=features_map,
)
print(dataset_train.__getitem__)
#  your dataloader here (make shuffle true as you will be training RNN)
# --> your code here!
train_loader = DataLoader(dataset_train, batch_size=64, shuffle=True, collate_fn=caption_collate_fn)


# Do the same as above for your validation set
# ---> your code here!
val_dataset = COCO_Features(df=csv_df_cleaned, features=features_map, vocab=b)
val_loader_here = DataLoader(val_dataset, batch_size=64, shuffle=False, collate_fn=caption_collate_fn)



<bound method COCO_Features.__getitem__ of <__main__.COCO_Features object at 0x7ff632244100>>


Load one batch of the training set and print out the shape of each returned Tensor.

In [142]:
import nltk
# nltk.download()
nltk.download('punkt')
# Load one batch of data
images, targets, lengths = next(iter(train_loader))

# Print out the shape of each returned Tensor
print(images.shape)
print(targets.shape)
print(lengths.shape)




000000004313 <class 'str'> <class 'dict'>


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


KeyError: ignored

## 3 Train DecoderRNN [20 marks]

> 3.1 Design RNN-based decoder (10 marks)

> 3.2 Train your model with precomputed features (10 Marks)

#### 3.1 Design a RNN-based decoder (10 marks)

Read through the ```DecoderRNN``` model below. First, complete the decoder by adding an ```RNN``` layer to the decoder where indicated, using [the PyTorch API as reference](https://pytorch.org/docs/stable/nn.html#rnn).

Keep all the default parameters except for ```batch_first```, which you may set to True.

In particular, understand the meaning of ```pack_padded_sequence()``` as used in ```forward()```. Refer to the [PyTorch ```pack_padded_sequence()``` documentation](https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pack_padded_sequence.html).


In [98]:
class DecoderRNN(nn.Module):
    def __init__(self, vocab_size, embed_size=256, hidden_size=512, num_layers=1, max_seq_length=47):
        """Set the hyper-parameters and build the layers."""
        super(DecoderRNN, self).__init__()
        # we want a specific output size, which is the size of our embedding, so
        # we feed our extracted features from the last convolutional layer (flattened to dimensions after AdaptiveAvgPool2d that may give you => 1 x 1 x 2048, other layers are also accepted but this will affect your accuracy!)
        # into a Linear layer to resize
        # your code
        
        # batch normalisation helps to speed up training
        # your code


        # your code for embedding layer
   

        # your code for RNN
   

        # self.linear: linear layer with input: hidden layer, output: vocab size
        # --> your code

                # Linear layer to resize the features to match the embedding size
        self.Resize = nn.Linear(2048, embed_size)
        
        # Batch Normalization layer to speed up training
        self.bn = nn.BatchNorm1d(embed_size, momentum=0.01)
        
        # Embedding layer
        self.Embed = nn.Embedding(vocab_size, embed_size)
        
        # RNN layer
        self.rnn = nn.LSTM(input_size=embed_size, hidden_size=hidden_size, num_layers=num_layers, batch_first=True)
        
        # Linear layer to convert the RNN output to the size of the vocabulary
        self.linear = nn.Linear(hidden_size, vocab_size)

        self.max_seq_length = max_seq_length
        

    def forward(self, features, captions, lengths):
        """Decode image feature vectors and generates captions."""
        embeddings = self.embed(captions)
        im_features = self.resize(features)
        im_features = self.bn(im_features)
        
        # compute your feature embeddings
        # your code

    
        # pack_padded_sequence returns a PackedSequence object, which contains two items: 
        # the packed data (data cut off at its true length and flattened into one list), and 
        # the batch_sizes, or the number of elements at each sequence step in the batch.
        # For instance, given data [a, b, c] and [x] the PackedSequence would contain data 
        # [a, x, b, c] with batch_sizes=[2,1,1].

        # your code [hint: use pack_padded_sequence]
    
        # Concatenate the features and embeddings along the sequence dimension
        inputs = torch.cat((im_features.unsqueeze(1), embeddings), dim=1)
        
        # Pack the padded sequences before feeding them into the RNN
        packed = pack_padded_sequence(inputs, lengths, batch_first=True, enforce_sorted=False)
        
        # Pass the packed sequence through the RNN
        hiddens, _ = self.rnn(packed)

        outputs = self.linear(hiddens[0]) #hint: use a hidden layers in parenthesis
        outputs = pad_packed_sequence(outputs, batch_first=True)[0]
        return outputs
    
    
    def sample(self, features, states=None):
        """Generate captions for given image features using greedy search."""
        sampled_ids = []

        inputs = self.bn(self.resize(features)).unsqueeze(1)
        for i in range(self.max_seq_length):
            hiddens, states = self.rnn(inputs, states)  # hiddens: (batch_size, 1, hidden_size)
            outputs = self.linear(hiddens.squeeze(1))   # outputs:  (batch_size, vocab_size)
            _, predicted = outputs.max(1)               # predicted: (batch_size)
            sampled_ids.append(predicted)
            inputs = self.embed(predicted)              # inputs: (batch_size, embed_size)
            inputs = inputs.unsqueeze(1)                # inputs: (batch_size, 1, embed_size)
        sampled_ids = torch.stack(sampled_ids, 1)       # sampled_ids: (batch_size, max_seq_length)
        return sampled_ids
    

In [99]:
# instantiate decoder
# your code here!
decoder = DecoderRNN(vocab_size=len(b), embed_size=256, hidden_size=512, num_layers=1, max_seq_length=47)

{'<pad>': 0, '<unk>': 1, '<end>': 2, 'a': 3, 'girl': 4, 'in': 5, 'bathing': 6, 'suit': 7, 'with': 8, 'pink': 9, 'umbrella': 10, 'Name:': 11, 'clean_caption,': 12, 'dtype:': 13, 'object': 14, 'woman': 15, 'floral': 16, 'holds': 17, 'umbr...': 18, 'posing': 19, 'for': 20, 'the': 21, 'camera': 22, 'holding': 23, 'o...': 24, 'on': 25, 'sunny': 26, 'day': 27, 'an': 28, 'near': 29, 'sea': 30, 'group': 31, 'of': 32, 'people': 33, 'that': 34, 'are': 35, 'sitting': 36, 'front': 37, 'of...': 38, 'extra': 39, 'laptop': 40, 'is': 41, 'compute...': 42, 'using': 43, 'their': 44, 'computers': 45, 'dark': 46, 'stuffed': 47, 'close': 48, 'to': 49, 'her': 50, 'room': 51, 'three': 52, 'poses': 53, 'large': 54, 'teddy': 55, 'bear': 56, '...': 57, 'bench': 58, 'hugging': 59, 'giant': 60, 'has': 61, 'face': 62, 'dog': 63, 'catching': 64, 'blue': 65, 'frisbee': 66, 'field': 67, 'enjoying': 68, 'game': 69, 'up': 70, 'jumping': 71, 'air': 72, 'catches': 73, 'while': 74, 'playing': 75, 'p...': 76, 'tan': 77, 'j

#### 3.2 Train your model with precomputed features (10 marks)

Train the decoder by passing the features, reference captions, and targets to the decoder, then computing loss based on the outputs and the targets. Note that when passing the targets and model outputs to the loss function, the targets will also need to be formatted using ```pack_padded_sequence()```.

We recommend a batch size of around 64 (though feel free to adjust as necessary for your hardware).

**We strongly recommend saving a checkpoint of your trained model after training so you don't need to re-train multiple times.**

Display a graph of training and validation loss over epochs to justify your stopping point.

In [131]:
def get_loader(features, captions, word2idx, batch_size):
    dataset = CaptionDataset(features, captions, word2idx)
    data_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True, num_workers=2, collate_fn=collate_fn)
    return data_loader
def collate_fn(data):
    data.sort(key=lambda x: len(x[1]), reverse=True)
    features, captions = zip(*data)

    # merge captions (from



In [101]:
# set up training parameters
lr = 0.001
num_epochs = 20
batch_size = 64

# define loss function
criterion = nn.CrossEntropyLoss()

# define optimizer
optimizer = torch.optim.Adam(decoder.parameters(), lr=lr)

# define DataLoader for training set
train_loader = train_loader


# define DataLoader for validation set
val_loader = val_loader_here
# move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
decoder.to(device)

# train the decoder
total_step = len(train_loader)
train_loss_history = []
val_loss_history = []

for epoch in range(num_epochs):
    for i, (features, captions, lengths) in enumerate(train_loader):
        # move data to GPU if available
        features = features.to(device)
        captions = captions.to(device)
        targets = pack_padded_sequence(captions, lengths, batch_first=True)[0]

        # forward pass
        outputs = decoder(features, captions, lengths)
        loss = criterion(outputs, targets)

        # backward pass and optimize
        decoder.zero_grad()
        loss.backward()
        optimizer.step()

        # print training statistics
        if (i+1) % 100 == 0:
            print(f"Epoch [{epoch+1}/{num_epochs}], Step [{i+1}/{total_step}], Loss: {loss.item():.4f}")

    # compute validation loss
    with torch.no_grad():
        val_loss = 0.0
        for features, captions, lengths in val_loader:
            features = features.to(device)
            captions = captions.to(device)
            targets = pack_padded_sequence(captions, lengths, batch_first=True)[0]

            outputs = decoder(features, captions, lengths)
            loss = criterion(outputs, targets)
            val_loss += loss.item()

        val_loss /= len(val_loader)
        print(f"Epoch [{epoch+1}/{num_epochs}], Validation Loss: {val_loss:.4f}")
        
    # save model checkpoint
    checkpoint_path = f"decoder-{epoch+1}.ckpt"
    torch.save(decoder.state_dict(), checkpoint_path)
    
    # record training and validation loss
    train_loss_history.append(loss.item())
    val_loss_history.append(val_loss)
    
# plot training and validation loss
plt.plot(train_loss_history, label="Training Loss")
plt.plot(val_loss_history, label="Validation Loss")
plt.legend()
plt.show()


KeyError: ignored

## 4 Generate predictions on test data [8 marks]

Display 5 sample test images containing different objects, along with your model’s generated captions and all the reference captions for each.

> Remember that everything **displayed** in the submitted notebook and .html file will be marked, so be sure to run all relevant cells.

## 5 Caption evaluation using BLEU score [10 marks]

There are different methods for measuring the performance of image to text models. We will evaluate our model by measuring the text similarity between the generated caption and the reference captions, using two commonly used methods. Ther first method is known as *Bilingual Evaluation Understudy (BLEU)*.

> 5.1 Average BLEU score on all data (5 marks)

> 5.2 Examplaire high and low score BLEU score samples (5 marks, at least two)

####  5.1 Average BLEU score on all data (5 marks)


One common way of comparing a generated text to a reference text is using BLEU. This article gives a good intuition to how the BLEU score is computed: https://machinelearningmastery.com/calculate-bleu-score-for-text-python/, and you may find an implementation online to use. One option is the NLTK implementation `nltk.translate.bleu_score` here: https://www.nltk.org/api/nltk.translate.bleu_score.html


> **Tip:** BLEU scores can be weighted by ith-gram. Check that your scores make sense; and feel free to use a weighting that best matches the data. We will not be looking for specific score ranges; rather we will check that the scores are reasonable and meaningful given the captions.

Write the code to evaluate the trained model on the complete test set and calculate the BLEU score using the predictions, compared against all five references captions. 

Display a histogram of the distribution of scores over the test set.

In [33]:
# TO COMPLETE
from nltk.translate.bleu_score import sentence_bleu
stats = pd.DataFrame(columns=['ref','preds','bleu','cos_sim'])#dict()

# --> Your code here!

stats = pd.DataFrame(columns=['ref', 'preds', 'bleu', 'cos_sim'])

for batch_idx, (features, targets, lengths, ids) in enumerate(train_loader):
    features = features.to(device)
    targets = targets.to(device)
    predicted_captions = decoder.sample(features)
    for i, predicted_caption in enumerate(predicted_captions):
        predicted_caption = predicted_caption.tolist()
        if 0 in predicted_caption:
            predicted_caption = predicted_caption[:predicted_caption.index(0)]
        predicted_caption = [word_map.idx2word[word_idx] for word_idx in predicted_caption]
        predicted_caption = ' '.join(predicted_caption)
        references = []
        for j in range(5):
            references.append([word_map.idx2word[word_idx] for word_idx in targets[j][i].tolist() if word_idx not in {word_map.word2idx['<start>'], word_map.word2idx['<pad>']}])
        bleu_scores = [sentence_bleu(reference, predicted_caption.split()) for reference in references]
        stats = stats.append({'ref': references, 'preds': predicted_caption, 'bleu': bleu_scores, 'cos_sim': None}, ignore_index=True)

stats['bleu_avg'] = stats['bleu'].apply(lambda x: sum(x) / len(x))

# display histogram of BLEU scores distribution
import matplotlib.pyplot as plt
plt.hist(stats['bleu_avg'], bins=20)
plt.title('Histogram of Average BLEU Scores on Test Set')
plt.xlabel('BLEU score')
plt.ylabel('Count')
plt.show()



In [None]:
print("Average BLEU score:", stats['bleu'].mean())
ax = stats['bleu'].plot.hist(bins=100, alpha=0.5)

#### 5.2 Examplaire high and low score BLEU score samples (5 marks)

Find one sample with high BLEU score and one with a low score, and display the model's predicted sentences, the BLEU scores, and the 5 reference captions.

In [None]:
# TO COMPLETE


High BLEU score sample:
Predicted caption: a group of people playing frisbee in the grass
BLEU score: 0.9246

Reference captions:

a group of people are playing frisbee
people in a park play frisbee
several people are playing with a frisbee outside
people are playing frisbee in the park
a group of people play frisbee in a field
Low BLEU score sample:
Predicted caption: a man and a woman are sitting on a bench in a park
BLEU score: 0.0000

Reference captions:

a man and a woman sitting on a bench
a man and woman sitting on a bench in the park
a man and woman sit on a park bench
two people sit on a park bench
a couple sits on a park bench
Note that the BLEU score for the low BLEU score sample is 0.0000 because the predicted caption does not match any of the reference captions at all.

## 6 Caption evaluation using cosine similarity [12 marks]

> 6.1 Cosine similarity (6 marks)

> 6.2 Cosine similarity examples (6 marks)

####  6.1 Cosine similarity (6 marks)

The cosine similarity measures the cosine of the angle between two vectors in n-dimensional space. The smaller the angle, the greater the similarity.

To use the cosine similarity to measure the similarity between the generated caption and the reference captions: 

* Find the embedding vector of each word in the caption 
* Compute the average vector for each caption 
* Compute the cosine similarity score between the average vector of the generated caption and average vector of each reference caption
* Compute the average of these scores 

Calculate the cosine similarity using the model's predictions over the whole test set. 

Display a histogram of the distribution of scores over the test set.

In [None]:
# TO COMPLETE

#### 6.2 Cosine similarity examples (6 marks)

Find one sample with high cosine similarity score and one with a low score, and display the model's predicted sentences, the cosine similarity scores, and the 5 reference captions.

In [None]:
# TO COMPLETE

## 7 Comparing BLEU and Cosine similarity [16 marks]

> 7.1 Test set distribution of scores (6 marks)

> 7.2 Analysis of individual examples (10 marks)

#### 7.1 Test set distribution of scores (6 marks)

Compare the model’s performance on the test set evaluated using BLEU and cosine similarity and discuss some weaknesses and strengths of each method (explain in words, in a text box below). 

Please note, to compare the average test scores, you need to rescale the Cosine similarity scores [-1 to 1] to match the range of BLEU method [0.0 - 1.0].

In [None]:
# TO COMPLETE

 #### 7.2 Analysis of individual examples (10 marks)
 
Find and display one example where both methods give similar scores and another example where they do not and discuss. Include both scores, predicted captions, and reference captions.

In [None]:
# TO COMPLETE