### The Challenge: Build a large-scale image search engine!

You and your team of **three Cornell Tech students** are surely on the path to fame and fortune! You have been recruited by Google to disrupt Google Image Search by building a better search engine using novel statistical learning techniques.

The specifications are simple: We need a way to **search for relevant images** given a natural language query. For instance, if a user types "dog jumping to catch frisbee," your system will **rank-order the most relevant images** from a large database.

---


**During training**, you have a dataset of 10,000 samples. 

Each sample has the following data available for learning:
- A 224x224 JPG image.
- A list of tags indicating objects appeared in the image.
- Feature vectors extracted using [Resnet](https://arxiv.org/abs/1512.03385), a state-of-the-art Deep-learned CNN (You don't have to train or run ResNet -- we are providing the features for you). See [here](http://ethereon.github.io/netscope/#/gist/b21e2aae116dc1ac7b50) for the illustration of the ResNet-101 architecture. The features are extracted from pool5 and fc1000 layer.
- A five-sentence description, used to train your search engine.

**During testing**, your system matches a single five-sentence description against a pool of 2,000 candidate samples from the test set. 

Each sample has:
- A 224x224 JPEG image.
- A list of tags for that image.
- ResNet feature vectors for that image.

**Output**:
For each description, your system must rank-score each testing image with the likelihood of that image matches the given sentence. Your system then returns the name of the top 20 relevant images, delimited by space. See "sample_submission.csv" on the data page for more details on the output format.

**Evaluation metric**:
There are 2,000 descriptions, and for each description, you must compare against the entire 2,000-image test set. That is, rank-order test images for each test description. We will use **MAP@20** as the evaluation metric. If the corresponding image of a description is among your algorithm's 20 highest scoring images, this metric gives you a certain score based on the ranking of the corresponding image. Please refer to the evaluation page for more details. Use all of your skills, tools, and experience. It is OK to use libraries like numpy, scikit-learn, pandas, etc., as long as you cite them. Use cross-validation on training set to debug your algorithm. Submit your results to the Kaggle leaderboard and send your complete writeup to CMS. The data you use --- and the way you use the data --- is completely up to you.

**Note**:
The best teams of **three Cornell Tech students** might use visualization techniques for debugging (e.g., show top images retrieved by your algorithm and see whether they make sense or not), preprocessing, a nice way to compare tags and descriptions, leveraging visual features and combining them with tags and descriptions, supervised and/or unsupervised learning to best understand how to best take advantage of each data source available to them.

---

**File descriptions**:

- images_train - 10,000 training images of size 224x224.
- images_test - 2,000 test images of size 224x224.
- tags_train - image tags correspond to training images. Each image have several tags indicating the human-labeled object categories appear in the image, in the form of "supercategory:category".
- tags_test - image tags correspond to test images. Each image have several tags indicating the human-labeled object categories appear in the image, in the form of "supercategory:category".
features_train - features extracted from a pre-trained Residual Network (ResNet) on training set, including 1,000 dimensional feature from classification layer (fc1000) and 2,048 dimensional feature from final convolution layer (pool5). Each dimension of the fc1000 feature corresponds to a WordNet synset here.
- features_test - features extracted from the same Residual Network (ResNet) on test set, including 1,000 dimensional feature from classification layer (fc1000) and 2,048 dimensional feature from final convolution layer (pool5).
- descriptions_train - image descriptions correspond to training images. Each image have 5 sentences for describing the image content.
- descriptions_test - image descriptions for test images. Each image have 5 sentences for describing the image content. Notice that one test description corresponds to one test image. The task you need to do is to return top 20 images in test set for each test description.
- sample_submission.csv - a sample submission file in the correct format.

---


In [1]:
import os
import sys
import csv
import operator
import numpy as np
import pandas as pd
from PIL import Image
import matplotlib.cm as cm
from matplotlib import pylab as plt


In [2]:
%matplotlib inline

In [3]:
### Read in training data

# Sort files in ascending order
def order_keys(text):
    return int(text.split('.')[0])

# Define paths
my_path = os.getcwd()
image_train_path = os.path.join(my_path, 'images_train')
desc_train_path  = os.path.join(my_path, 'descriptions_train')
tags_train_path  = os.path.join(my_path, 'tags_train')
features_train_path = os.path.join(my_path, 'features_train')

# Define arrays
images_arr   = []
desc_arr     = []
tags_arr     = []
features_arr = []


In [4]:
# Read in the images
image_files = os.listdir(image_train_path)
image_files.sort(key = order_keys)
for image_file in image_files:
    # Open each image file
    im = Image.open(os.path.join(image_train_path, image_file), 'r')
    
    # Convert to an np array
    images_arr.append(np.asarray(im))

    # Close the file
    im.close()


In [5]:
# Read in the descriptions
desc_files = os.listdir(desc_train_path)
desc_files.sort(key = order_keys)
for desc_file in desc_files:
    # Open each text file. Strip leading/trailing whitespace
    lines = [line.strip() for line in open(os.path.join(desc_train_path, desc_file))]
    
    # Convert to an np array
    desc_arr.append(np.asarray(lines))


In [6]:
# Read in the tags
tag_files = os.listdir(tags_train_path)
tag_files.sort(key = order_keys)
for tag_file in tag_files:
    # Open each text file. Strip leading/trailing whitespace
    lines = [line.strip() for line in open(os.path.join(tags_train_path, tag_file))]
    
    # Convert to an np array
    tags_arr.append(np.asarray(lines))


In [14]:
# Read in the features
feature_files = os.listdir(features_train_path)
for feature_file in feature_files:
    reader = csv.reader(open(os.path.join(features_train_path, feature_file)), delimiter=",")
    sortedlist = sorted(reader, key = lambda row: int(row[0].split('/')[1].split('.')[0]))
    features_arr = sortedlist

# TODO - two diff lists
#     features_resnet1000_test.csv
#     features_resnet1000intermediate_test.csv


In [None]:
# Format features into pd dataframe
# print len(images_arr)
# print len(desc_arr)
# print len(tags_arr)
# print len(features_arr)

df = pd.DataFrame()
df['images'] = images_arr
df['descriptions'] = desc_arr
df['tags'] = tags_arr
df['features'] = features_arr

print df


In [None]:
## OLD STUFF

# Some of the training data did not have space after punctuation and was considered together as a word
# To account for this, punctuations were replaced with a blankspace to be splitted later. The words in sentences were converted to lowercase as well
# def fixpunct(s):
#     new_sent = ''
#     for i in s:
#         if i not in string.punctuation:
#             new_sent += i
#         else:
#             new_sent += ' '
#     return new_sent.lower()


# # Iterate through the training data and generate a list of unique words
# uniquewords1 = []
# lmtzr = nltk.stem.wordnet.WordNetLemmatizer()
# for sentence in training_data1:
#     sentencefixed = fixpunct(sentence)
#     tokens = nltk.word_tokenize(sentencefixed)
#     tokens = [lmtzr.lemmatize(word, 'v') for word in tokens]
#     tokens = [word for word in tokens if word not in stopset]
#     for word in tokens:
#         if word not in uniquewords1:
#             uniquewords1.append(word)
            
            

In [None]:
# Output results to CSV

# with open('test_results.csv', 'w') as csvfile:
#     writer = cimagessv.writer(csvfile, delimiter=',')
#     writer.writerow(['Description_ID','Top_20_Image_IDs'])


In [None]:
# Rando notes

# Preprocessing
# Lowercase all of the words.
# • Lemmatization of all the words (i.e., convert every word to its root so that all of “running,”
# “run,” and “runs” are converted to “run” and and all of “good,” “well,” “better,” and “best”
# are converted to “good”; this is easily done using nltk.stem).
# • Strip punctuation.
# • Strip the stop words, e.g., “the”, “and”, “or”.

#Processing
# bag of words, 2-gram and PCA for bag of words
# - Common words like "the", "a", "to" are almost always the terms with highest frequency in the text. 
# Thus, having a high raw count does not necessarily mean that the corresponding word is more important. 
# To address this problem, one of the most popular ways to "normalize" the term frequencies is to weight 
#a term by the inverse of document frequency, or tf–idf.
# https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering


#Postprocessing
# log-normalization
# l1 normalization
# l2 normalization
# Standardize the data by subtracting the mean and dividing by the variance.


# https://medium.com/@14prakash/understanding-and-implementing-architectures-of-resnet-and-resnext-for-state-of-the-art-image-cf51669e1624




# maybe preprocessing / bag of words to match query with an image
# then k-means on the feature vectors of images to find closely related images
