# CSGY 6953 Deep Learning Final Project
In this project, we will implement a dual-encoder model for image search. In particular, our model will be fed a text query and will return several images that are related to the query. To do this task, our model will be trained so that it embeds both image and text data into the same space, and importantly, encodes relevant data to be close each other in the embedding space. This will be done by developing two encoders, one for image processing and the other for text encoding, and training them by a similarity-based loss function. <br>
<br>
In this notebook, I download the COCO image data to my Google drive, construct the dataset for training the dual-encoder.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

import requests
import json
import cv2
import random
import tqdm
import pickle

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
with open("drive/My Drive/finalproj/data/nocaps_val_4500_captions.json", "r") as f:
    captions = json.load(f)

In [None]:
# Downloading all the images to Google drive. It takes approx. 2 hours.
for image in tqdm.tqdm(captions["images"]):
    img = requests.get(image["coco_url"]).content
    with open("drive/My Drive/finalproj/data/images/"+str(image["id"]), "wb") as handler:
        handler.write(img)

In [None]:
image_size = (299, 299)
images = []
for i in tqdm.tqdm(range(captions["images"][-1]["id"]+1)):
    images.append(cv2.resize(cv2.imread("drive/My Drive/finalproj/data/images/"+str(i)), image_size))

100%|██████████| 4500/4500 [29:30<00:00,  2.54it/s]


In [None]:
texts = []
for dic in tqdm.tqdm(captions["annotations"]):
    texts.append(dic["caption"])

100%|██████████| 45000/45000 [00:00<00:00, 957924.42it/s]


In [None]:
with open(f"drive/My Drive/finalproj/data/dataset.pkl", "wb") as f:
    pickle.dump({"images": images, "texts": texts}, f)