### The Challenge: Build a large-scale image search engine!

You and your team of **three Cornell Tech students** are surely on the path to fame and fortune! You have been recruited by Google to disrupt Google Image Search by building a better search engine using novel statistical learning techniques.

The specifications are simple: We need a way to **search for relevant images** given a natural language query. For instance, if a user types "dog jumping to catch frisbee," your system will **rank-order the most relevant images** from a large database.

---


**During training**, you have a dataset of 10,000 samples. 

Each sample has the following data available for learning:
- A 224x224 JPG image.
- A list of tags indicating objects appeared in the image.
- Feature vectors extracted using [Resnet](https://arxiv.org/abs/1512.03385), a state-of-the-art Deep-learned CNN (You don't have to train or run ResNet -- we are providing the features for you). See [here](http://ethereon.github.io/netscope/#/gist/b21e2aae116dc1ac7b50) for the illustration of the ResNet-101 architecture. The features are extracted from pool5 and fc1000 layer.
- A five-sentence description, used to train your search engine.

**During testing**, your system matches a single five-sentence description against a pool of 2,000 candidate samples from the test set. 

Each sample has:
- A 224x224 JPEG image.
- A list of tags for that image.
- ResNet feature vectors for that image.

**Output**:
For each description, your system must rank-score each testing image with the likelihood of that image matches the given sentence. Your system then returns the name of the top 20 relevant images, delimited by space. See "sample_submission.csv" on the data page for more details on the output format.

**Evaluation metric**:
There are 2,000 descriptions, and for each description, you must compare against the entire 2,000-image test set. That is, rank-order test images for each test description. We will use **MAP@20** as the evaluation metric. If the corresponding image of a description is among your algorithm's 20 highest scoring images, this metric gives you a certain score based on the ranking of the corresponding image. Please refer to the evaluation page for more details. Use all of your skills, tools, and experience. It is OK to use libraries like numpy, scikit-learn, pandas, etc., as long as you cite them. Use cross-validation on training set to debug your algorithm. Submit your results to the Kaggle leaderboard and send your complete writeup to CMS. The data you use --- and the way you use the data --- is completely up to you.

**Note**:
The best teams of **three Cornell Tech students** might use visualization techniques for debugging (e.g., show top images retrieved by your algorithm and see whether they make sense or not), preprocessing, a nice way to compare tags and descriptions, leveraging visual features and combining them with tags and descriptions, supervised and/or unsupervised learning to best understand how to best take advantage of each data source available to them.

---

**File descriptions**:

- images_train - 10,000 training images of size 224x224.
- images_test - 2,000 test images of size 224x224.
- tags_train - image tags correspond to training images. Each image have several tags indicating the human-labeled object categories appear in the image, in the form of "supercategory:category".
- tags_test - image tags correspond to test images. Each image have several tags indicating the human-labeled object categories appear in the image, in the form of "supercategory:category".
features_train - features extracted from a pre-trained Residual Network (ResNet) on training set, including 1,000 dimensional feature from classification layer (fc1000) and 2,048 dimensional feature from final convolution layer (pool5). Each dimension of the fc1000 feature corresponds to a WordNet synset here.
- features_test - features extracted from the same Residual Network (ResNet) on test set, including 1,000 dimensional feature from classification layer (fc1000) and 2,048 dimensional feature from final convolution layer (pool5).
- descriptions_train - image descriptions correspond to training images. Each image have 5 sentences for describing the image content.
- descriptions_test - image descriptions for test images. Each image have 5 sentences for describing the image content. Notice that one test description corresponds to one test image. The task you need to do is to return top 20 images in test set for each test description.
- sample_submission.csv - a sample submission file in the correct format.

---


In [4]:
import os
import sys
import csv
import operator
import numpy as np
import pandas as pd
from PIL import Image
import matplotlib.cm as cm
from matplotlib import pylab as plt
%matplotlib inline


### Read in training data

# Sort files in ascending order
def order_keys(text):
    return int(text.split('.')[0])

# Define paths
my_path = os.getcwd()
image_train_path = os.path.join(my_path, 'images_train')
desc_train_path  = os.path.join(my_path, 'descriptions_train')
tags_train_path  = os.path.join(my_path, 'tags_train')
features_train_path = os.path.join(my_path, 'features_train')

# Define arrays
images_train_arr   = []
desc_train_arr     = []
tags_train_arr     = []
features_train_arr = []




In [14]:
# Read in the images
image_files = os.listdir(image_train_path)
image_files.sort(key = order_keys)
for image_file in image_files:
    # Open each image file
    im = Image.open(os.path.join(image_train_path, image_file), 'r')
    
    # Convert to an np array
    images_train_arr.append(np.asarray(im))

    # Close the file
    im.close()

In [5]:
# Read in the descriptions
desc_files = os.listdir(desc_train_path)

desc_files.sort(key = order_keys)
print(len(desc_files))
for desc_file in desc_files:
    # Open each text file. Strip leading/trailing whitespace
    lines = [line.strip() for line in open(os.path.join(desc_train_path, desc_file))]
    
    # Convert to an np array
    desc_train_arr.append(np.asarray(lines))
    
# print(desc_arr)
print(len(desc_train_arr))




10000
10000


In [13]:
# Read in the tags
tag_files = os.listdir(tags_train_path)
print(tag_files)
tag_files.sort(key = order_keys)
for tag_file in tag_files:
    # Open each text file. Strip leading/trailing whitespace
    lines = [line.strip() for line in open(os.path.join(tags_train_path, tag_file))]
    
    # Convert to an np array
    tags_train_arr.append(np.asarray(lines))

print(tags_train_arr)
print(len(tags_train_arr))

['0.txt', '1.txt', '10.txt', '100.txt', '1000.txt', '1001.txt', '1002.txt', '1003.txt', '1004.txt', '1005.txt', '1006.txt', '1007.txt', '1008.txt', '1009.txt', '101.txt', '1010.txt', '1011.txt', '1012.txt', '1013.txt', '1014.txt', '1015.txt', '1016.txt', '1017.txt', '1018.txt', '1019.txt', '102.txt', '1020.txt', '1021.txt', '1022.txt', '1023.txt', '1024.txt', '1025.txt', '1026.txt', '1027.txt', '1028.txt', '1029.txt', '103.txt', '1030.txt', '1031.txt', '1032.txt', '1033.txt', '1034.txt', '1035.txt', '1036.txt', '1037.txt', '1038.txt', '1039.txt', '104.txt', '1040.txt', '1041.txt', '1042.txt', '1043.txt', '1044.txt', '1045.txt', '1046.txt', '1047.txt', '1048.txt', '1049.txt', '105.txt', '1050.txt', '1051.txt', '1052.txt', '1053.txt', '1054.txt', '1055.txt', '1056.txt', '1057.txt', '1058.txt', '1059.txt', '106.txt', '1060.txt', '1061.txt', '1062.txt', '1063.txt', '1064.txt', '1065.txt', '1066.txt', '1067.txt', '1068.txt', '1069.txt', '107.txt', '1070.txt', '1071.txt', '1072.txt', '1073.t

[array(['vehicle:airplane', 'outdoor:bench', 'sports:skateboard',
       'person:person', 'vehicle:truck', 'accessory:backpack',
       'accessory:handbag', 'furniture:dining table'], dtype='<U22'), array(['kitchen:bowl', 'food:carrot', 'kitchen:spoon'], dtype='<U13'), array(['vehicle:car', 'vehicle:truck', 'outdoor:traffic light',
       'person:person'], dtype='<U21'), array(['person:person', 'outdoor:bench', 'sports:frisbee', 'vehicle:car'],
      dtype='<U14'), array(['person:person', 'sports:baseball bat'], dtype='<U19'), array(['furniture:bed', 'furniture:chair', 'electronic:mouse',
       'electronic:keyboard', 'indoor:book', 'kitchen:cup',
       'electronic:tv', 'electronic:laptop'], dtype='<U19'), array(['person:person', 'food:donut', 'vehicle:bicycle'], dtype='<U15'), array(['person:person', 'accessory:tie'], dtype='<U13'), array(['vehicle:car', 'person:person'], dtype='<U13'), array(['vehicle:car', 'vehicle:bus', 'accessory:backpack'], dtype='<U18'), array(['accessory:suitc

In [10]:
# Read in the features
feature_files = os.listdir(features_train_path)
feature_files

# for feature_file in feature_files:
#     reader = csv.reader(open(os.path.join(features_train_path, feature_file)), delimiter=",")
#     sortedlist = sorted(reader, key = lambda row: int(row[0].split('/')[1].split('.')[0]))
#     features_arr = sortedlist

reader = csv.reader(open(os.path.join(features_train_path, 'features_resnet1000_train.csv')), delimiter=",")
sortedlist = sorted(reader, key = lambda row: int(row[0].split('/')[1].split('.')[0]))
features_resnet1000_train_arr = sortedlist

# print(features_resnet1000_train_arr)

reader = csv.reader(open(os.path.join(features_train_path, 'features_resnet1000intermediate_train.csv')), delimiter=",")
sortedlist = sorted(reader, key = lambda row: int(row[0].split('/')[1].split('.')[0]))
features_resnet1000intermediate_train_arr = sortedlist

# print(features_resnet1000intermediate_train_arr)

In [17]:
# Format features into pd dataframe
# print len(images_arr)
# print len(desc_arr)
# print len(tags_arr)
# print len(features_arr)

df_train = pd.DataFrame()
df_train['images'] = images_train_arr
df_train['descriptions'] = desc_train_arr
df_train['tags'] = tags_train_arr
df_train['features_resnet1000'] = features_resnet1000_train_arr
df_train['features_resnet1000intermediate'] = features_resnet1000intermediate_train_arr



In [16]:
df_train['features_resnet1000intermediate']

0       [images_train/0.jpg, 1.0331509113311768, 0.148...
1       [images_train/1.jpg, 0.23184368014335632, 0.12...
2       [images_train/2.jpg, 0.6228247880935669, 0.250...
3       [images_train/3.jpg, 0.2176360934972763, 0.160...
4       [images_train/4.jpg, 0.024830954149365425, 0.1...
5       [images_train/5.jpg, 0.05333779752254486, 0.67...
6       [images_train/6.jpg, 0.13950727880001068, 0.50...
7       [images_train/7.jpg, 0.5640271902084351, 0.295...
8       [images_train/8.jpg, 0.9344930648803711, 1.236...
9       [images_train/9.jpg, 0.5699881315231323, 1.187...
10      [images_train/10.jpg, 0.18309567868709564, 0.7...
11      [images_train/11.jpg, 0.38731083273887634, 0.0...
12      [images_train/12.jpg, 0.5061988234519958, 0.00...
13      [images_train/13.jpg, 0.26488766074180603, 0.0...
14      [images_train/14.jpg, 0.10956430435180664, 0.2...
15      [images_train/15.jpg, 0.2634522318840027, 0.59...
16      [images_train/16.jpg, 0.3794688284397125, 0.12...
17      [image