### The Challenge: Build a large-scale image search engine!

You and your team of **three Cornell Tech students** are surely on the path to fame and fortune! You have been recruited by Google to disrupt Google Image Search by building a better search engine using novel statistical learning techniques.

The specifications are simple: We need a way to **search for relevant images** given a natural language query. For instance, if a user types "dog jumping to catch frisbee," your system will **rank-order the most relevant images** from a large database.

---


**During training**, you have a dataset of 10,000 samples. 

Each sample has the following data available for learning:
- A 224x224 JPG image.
- A list of tags indicating objects appeared in the image.
- Feature vectors extracted using [Resnet](https://arxiv.org/abs/1512.03385), a state-of-the-art Deep-learned CNN (You don't have to train or run ResNet -- we are providing the features for you). See [here](http://ethereon.github.io/netscope/#/gist/b21e2aae116dc1ac7b50) for the illustration of the ResNet-101 architecture. The features are extracted from pool5 and fc1000 layer.
- A five-sentence description, used to train your search engine.

**During testing**, your system matches a single five-sentence description against a pool of 2,000 candidate samples from the test set. 

Each sample has:
- A 224x224 JPEG image.
- A list of tags for that image.
- ResNet feature vectors for that image.

**Output**:
For each description, your system must rank-score each testing image with the likelihood of that image matches the given sentence. Your system then returns the name of the top 20 relevant images, delimited by space. See "sample_submission.csv" on the data page for more details on the output format.

**Evaluation metric**:
There are 2,000 descriptions, and for each description, you must compare against the entire 2,000-image test set. That is, rank-order test images for each test description. We will use **MAP@20** as the evaluation metric. If the corresponding image of a description is among your algorithm's 20 highest scoring images, this metric gives you a certain score based on the ranking of the corresponding image. Please refer to the evaluation page for more details. Use all of your skills, tools, and experience. It is OK to use libraries like numpy, scikit-learn, pandas, etc., as long as you cite them. Use cross-validation on training set to debug your algorithm. Submit your results to the Kaggle leaderboard and send your complete writeup to CMS. The data you use --- and the way you use the data --- is completely up to you.

**Note**:
The best teams of **three Cornell Tech students** might use visualization techniques for debugging (e.g., show top images retrieved by your algorithm and see whether they make sense or not), preprocessing, a nice way to compare tags and descriptions, leveraging visual features and combining them with tags and descriptions, supervised and/or unsupervised learning to best understand how to best take advantage of each data source available to them.

---

**File descriptions**:

- images_train - 10,000 training images of size 224x224.
- images_test - 2,000 test images of size 224x224.
- tags_train - image tags correspond to training images. Each image have several tags indicating the human-labeled object categories appear in the image, in the form of "supercategory:category".
- tags_test - image tags correspond to test images. Each image have several tags indicating the human-labeled object categories appear in the image, in the form of "supercategory:category".
features_train - features extracted from a pre-trained Residual Network (ResNet) on training set, including 1,000 dimensional feature from classification layer (fc1000) and 2,048 dimensional feature from final convolution layer (pool5). Each dimension of the fc1000 feature corresponds to a WordNet synset here.
- features_test - features extracted from the same Residual Network (ResNet) on test set, including 1,000 dimensional feature from classification layer (fc1000) and 2,048 dimensional feature from final convolution layer (pool5).
- descriptions_train - image descriptions correspond to training images. Each image have 5 sentences for describing the image content.
- descriptions_test - image descriptions for test images. Each image have 5 sentences for describing the image content. Notice that one test description corresponds to one test image. The task you need to do is to return top 20 images in test set for each test description.
- sample_submission.csv - a sample submission file in the correct format.

---


In [1]:
import os
import re
import sys
import csv
import nltk
import random
import operator
import numpy as np
import pandas as pd
from PIL import Image
import matplotlib.cm as cm
from string import punctuation
from sklearn.utils import shuffle
from nltk.corpus import stopwords
from scipy.spatial import distance
from matplotlib import pylab as plt
from sklearn import model_selection
from nltk.stem import SnowballStemmer
from sklearn.preprocessing import normalize
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler 
from sklearn.feature_extraction.text import CountVectorizer

nltk.download('wordnet')
nltk.download('stopwords')


[nltk_data] Downloading package wordnet to /Users/Conor/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/Conor/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
%matplotlib inline

In [3]:
# Define paths for training/testing data
my_path = os.getcwd()
tags_train_path = os.path.join(my_path, 'tags_train')
tags_test_path  = os.path.join(my_path, 'tags_test')
desc_train_path = os.path.join(my_path, 'descriptions_train')
desc_test_path  = os.path.join(my_path, 'descriptions_test')
image_train_path = os.path.join(my_path, 'images_train')
image_test_path  = os.path.join(my_path, 'images_test')
features_train_res_path = os.path.join(my_path, 'features_train/features_resnet1000_train.csv')
features_test_res_path  = os.path.join(my_path, 'features_test/features_resnet1000_test.csv')
features_train_res_int_path = os.path.join(my_path, 'features_train/features_resnet1000intermediate_train.csv')
features_test_res_int_path  = os.path.join(my_path, 'features_test/features_resnet1000intermediate_test.csv')

# Sort files in ascending order
def order_keys(text):
    return int(text.split('.')[0])

# Read/return list of images within a folder
def read_images(folder_path):
    images = []
    image_files = os.listdir(folder_path)
    image_files.sort(key = order_keys)
    for image_file in image_files:
        # Open each image file
        im = Image.open(os.path.join(folder_path, image_file), 'r')
        # Convert to an np array
        images.append(np.asarray(im))
        # Close the file
        im.close()
    return images

# Read/return lists of elements from files within a folder
def read_files(folder_path):
    elements = []
    files = os.listdir(folder_path)
    files.sort(key = order_keys)
    for f in files:
        # Open each text file. Strip leading/trailing whitespace
        lines = [line.strip() for line in open(os.path.join(folder_path, f))]
        elements.append(' '.join(lines))
    return elements

# Read/return list of features
def read_features(file_path):
    reader = csv.reader(open(file_path), delimiter=",")
    features = sorted(reader, key = lambda row: int(row[0].split('/')[1].split('.')[0]))
    # Skip the image name
    for ind, image in enumerate(features):
        feats = [float(i) for i in image[1:]]
        features[ind] = np.array(feats, dtype=float)
    return features

# Print random rows from a dataframe
def aux_print_1(df, num_rows):    
    for index in range(num_rows):
        rand_index = random.randrange(0, len(df.index))
        row = df.iloc[rand_index]
        print "Image num. {}".format(rand_index)
        plt.imshow(row['image'])
        plt.show()
        print row['tag']
        print row['description']
        print row['resnet_feats'][:10]
        print row['resnet_feats_int'][:10]

# Print random rows from a dataframe
def aux_print_2(df, num_rows):    
    for index in range(num_rows):
        rand_index = random.randrange(0, len(df.index))
        row = df.iloc[rand_index]
        print "Image num. {}".format(rand_index)
        plt.imshow(row['image'])
        plt.show()
        print row['tag']
        print row['description']
        print row['resnet_feats'][:10]
        print row['resnet_feats_int'][:10]
        print zip(row['description_vectors'], vectorizer.get_feature_names())

# Transform input to lowercase
def to_lowercase(sent):
    return sent.lower()

# Lemmatize input
lemmatizer = WordNetLemmatizer()
def lemmatize(sent):
    words = sent.split(' ')
    # Lemmatize as verbs
    # lemmed_words = [lemmatizer.lemmatize(word, 'v') for word in words]
    lemmed_words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(lemmed_words)

# Stem input
stemmer = SnowballStemmer('english')
def stem(sent):
    words = sent.split(' ')
    stemmed_words = [stemmer.stem(word) for word in words]
    return ' '.join(stemmed_words)
    
# Remove stop words from input
def remove_stopwords(sent):
    stops = set(stopwords.words("english"))
    words = sent.split(' ')
    unstopped_words = [word for word in words if not word in stops]
    return ' '.join(unstopped_words)

# Remove special characters from input
def remove_special_chars(sent):
    unspecial_words = []
    words = sent.split(' ')
    for word in words:
        unspecial_word = ''.join(char for char in word if char.isalnum())
        unspecial_words.append(unspecial_word)
    return ' '.join(unspecial_words)

# Fit with count vectorizer
vectorizer = CountVectorizer()
def count_vectorize_fit(corpus):
    vectorizer.fit(corpus)
    
def count_vectorize(corpus):
    features = []
    count_vects = vectorizer.transform(corpus)
    count_vects = count_vects.toarray()
    for vects in count_vects:
        counts = [float(i) for i in vects]
        features.append(np.array(counts, dtype=float))
    return features

# Normalize with magnitude on vectorized input
def get_norm(sent):
    return np.divide(sent, np.linalg.norm(sent))


In [4]:
# Read in the training data
train_df = pd.DataFrame()
train_df['tag'] = read_files(tags_train_path)[:1000]
train_df['image'] = read_images(image_train_path)[:1000]
train_df['description'] = read_files(desc_train_path)[:1000]
train_df['resnet_feats'] = read_features(features_train_res_path)[:1000]
train_df['resnet_feats_int'] = read_features(features_train_res_int_path)[:1000]


In [5]:
# Preprocess the training description data
for index, desc in enumerate(train_df['description']):
    desc = to_lowercase(desc)
    desc = remove_special_chars(desc)
    desc = remove_stopwords(desc)
    desc = lemmatize(desc)
    # desc = stem(desc)
    train_df['description'][index] = desc

# Count vectorize the description data
count_vectorize_fit(train_df['description'])
train_df['description_vectors'] = count_vectorize(train_df['description'])

# Normalize the vectorized data
for index, desc in enumerate(train_df['description_vectors']):
    desc = get_norm(desc)
    train_df['description_vectors'][index] = desc


2863
2863


In [15]:
# Merge resnet_feats with images to create our feature vectors
train_df['final_train_features'] = ""
for index in range(len(train_df['resnet_feats'])):
    train_df['final_train_features'][index] = np.append(train_df['resnet_feats'][index], train_df['description_vectors'][index])


In [16]:
# Format the training data. 
# We will combine each resnet feature vector with the correct corresponding normalized vector count
# and label that as 1. We will then combine the same resnet feature vector with a different normalized
# vector count and label that as 0. In this way, we are providing the NN with what we consider to be 
# correct (1) or incorrect (0) values. For each resent feature vector, we will provide 1 correct and
# 2 incorrect values.
X_training_vals   = []
X_training_labels = []
for ind_out, img_feat in enumerate(train_df['resnet_feats']):
    # Get euclidean distance between chosen resnet feature with all
    # other resnet features
    dist = []
    for ind_in, comp_img_feat in enumerate(train_df['resnet_feats']):
        euc_dist = distance.euclidean(img_feat, comp_img_feat)
        dist.append((ind_in, euc_dist))
    dist.sort()

    # Add the correct resnet feature vector and vector count
    X_training_vals.append(np.append(img_feat, train_df['description_vectors'][ind_out]))
    X_training_labels.append(1)
    
    # Add the incorrect resnet feature vector and vector count
    ind = dist.pop()[0]
    X_training_vals.append(np.append(train_df['resnet_feats'][ind], train_df['description_vectors'][ind]))
    X_training_labels.append(0)
    ind = dist.pop()[0]
    X_training_vals.append(np.append(train_df['resnet_feats'][ind], train_df['description_vectors'][ind]))
    X_training_labels.append(0)

X_train = pd.DataFrame(X_training_vals, X_training_labels)
X_train = shuffle(X_train)


In [21]:
splits = 3
kf = model_selection.KFold(n_splits = splits)

# Cross validation on our model
for train_index, test_index in kf.split(X_train):
    
    X_cross_train = X_train.iloc[train_index]
    X_cross_test  = X_train.iloc[test_index]
    
    # Set up our model
    clf = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)
    clf.fit(X_cross_train.values, X_cross_train.index) 
    pred_probs = clf.predict_proba(X_cross_test.values)
    preds =  clf.predict(X_cross_test.values)
    print zip(X_cross_test.index, pred_probs, preds)
    print clf.score(X_cross_test.values, X_cross_test.index)
    

[(1, array([0.00355364, 0.99644636]), 1), (0, array([0.72691678, 0.27308322]), 0), (0, array([0.72691678, 0.27308322]), 0), (0, array([0.72691678, 0.27308322]), 0), (0, array([0.72691678, 0.27308322]), 0), (1, array([8.67784802e-04, 9.99132215e-01]), 1), (0, array([0.73784197, 0.26215803]), 0), (1, array([0.68888419, 0.31111581]), 0), (0, array([0.72691678, 0.27308322]), 0), (0, array([0.72691678, 0.27308322]), 0), (0, array([0.72691678, 0.27308322]), 0), (0, array([0.73784197, 0.26215803]), 0), (0, array([0.73784197, 0.26215803]), 0), (0, array([0.72691678, 0.27308322]), 0), (0, array([0.73784197, 0.26215803]), 0), (0, array([0.72691678, 0.27308322]), 0), (1, array([4.85407842e-08, 9.99999951e-01]), 1), (1, array([0.29109558, 0.70890442]), 1), (1, array([0.00247118, 0.99752882]), 1), (0, array([0.73784197, 0.26215803]), 0), (0, array([0.72691678, 0.27308322]), 0), (0, array([0.72691678, 0.27308322]), 0), (1, array([0.11771993, 0.88228007]), 1), (1, array([2.35032238e-09, 9.99999998e-0

[(0, array([0.99758138, 0.00241862]), 0), (1, array([0., 1.]), 1), (0, array([0.99807357, 0.00192643]), 0), (0, array([0.99807357, 0.00192643]), 0), (0, array([0.99807357, 0.00192643]), 0), (0, array([0.99807357, 0.00192643]), 0), (1, array([0., 1.]), 1), (0, array([0.99758138, 0.00241862]), 0), (1, array([0., 1.]), 1), (0, array([0.99807357, 0.00192643]), 0), (0, array([0.99807357, 0.00192643]), 0), (1, array([0., 1.]), 1), (0, array([0.99807357, 0.00192643]), 0), (0, array([0.99758138, 0.00241862]), 0), (1, array([0., 1.]), 1), (0, array([0.99807357, 0.00192643]), 0), (0, array([0.99807357, 0.00192643]), 0), (1, array([0., 1.]), 1), (1, array([0., 1.]), 1), (1, array([0., 1.]), 1), (1, array([0., 1.]), 1), (0, array([0.99807357, 0.00192643]), 0), (0, array([0.99807357, 0.00192643]), 0), (1, array([0., 1.]), 1), (1, array([0., 1.]), 1), (0, array([0.99758138, 0.00241862]), 0), (0, array([0.99758138, 0.00241862]), 0), (1, array([0., 1.]), 1), (1, array([0.93336627, 0.06663373]), 0), (1

[(0, array([0.99844622, 0.00155378]), 0), (0, array([0.99844622, 0.00155378]), 0), (0, array([0.99844622, 0.00155378]), 0), (1, array([0., 1.]), 1), (1, array([0., 1.]), 1), (0, array([0.99844622, 0.00155378]), 0), (0, array([0.99844622, 0.00155378]), 0), (0, array([0.99844622, 0.00155378]), 0), (0, array([0.99844622, 0.00155378]), 0), (0, array([0.99844622, 0.00155378]), 0), (1, array([0., 1.]), 1), (0, array([0.99844622, 0.00155378]), 0), (0, array([0.99844622, 0.00155378]), 0), (0, array([0.99844622, 0.00155378]), 0), (1, array([0., 1.]), 1), (0, array([0.99844622, 0.00155378]), 0), (1, array([0., 1.]), 1), (0, array([0.99844622, 0.00155378]), 0), (1, array([0., 1.]), 1), (0, array([0.99844622, 0.00155378]), 0), (0, array([0.99844622, 0.00155378]), 0), (0, array([0.99844622, 0.00155378]), 0), (1, array([0., 1.]), 1), (0, array([0.99844622, 0.00155378]), 0), (0, array([0.99844622, 0.00155378]), 0), (0, array([0.99844622, 0.00155378]), 0), (0, array([0.99844622, 0.00155378]), 0), (0, 

In [9]:
# # Fit the model
# clf = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)
# clf.fit(X_train.values, X_train.index)


In [10]:
# Read in the testing data
test_df = pd.DataFrame()
test_df['tag'] = read_files(tags_test_path)
test_df['image'] = read_images(image_test_path)
test_df['description'] = read_files(desc_test_path)
test_df['resnet_feats'] = read_features(features_test_res_path)
test_df['resnet_feats_int'] = read_features(features_test_res_int_path)


In [11]:
# Preprocess the testing data
for index, desc in enumerate(test_df['description']):
    desc = to_lowercase(desc)
    desc = remove_special_chars(desc)
    desc = remove_stopwords(desc)
    desc = lemmatize(desc)
    # desc = stem(desc)
    test_df['description'][index] = desc

# Count vectorize the testing data
test_df['description_vectors'] = count_vectorize(test_df['description'])

# Normalize the vectorized data
for index, desc in enumerate(test_df['description_vectors']):
    desc = get_norm(desc)
    test_df['description_vectors'][index] = desc


2863
2000


In [14]:
# Output results to CSV

# with open('test_results.csv', 'w') as csvfile:
#     writer = csv.writer(csvfile, delimiter=',')
#     writer.writerow(['Description_ID','Top_20_Image_IDs'])


2863


---

**Preprocessing**:
- Remove stop words
- Remove stemming
- Remove special characters 
- Lowercase everything
- https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72

**Postprocessing (?)**:
- log-normalization
- l1 normalization
- l2 normalization
- Standardize the data by subtracting the mean and dividing by the variance.

**Clustering descriptions**:
- Bag of words, 2-gram, maybe PCA
- With BoW, use tf–idf
- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- https://towardsdatascience.com/natural-language-processing-count-vectorization-with-scikit-learn-e7804269bb5e

**Clustering images**:
- https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html
- Random Decision Forests
- K-means or K-medoids
- PCA is good for reducing noise, but the input data should be "manifoldy", we are probably dealing with "clumpy"
- Niave bayes

**CNN/RNN links**:
- https://becominghuman.ai/extract-a-feature-vector-for-any-image-with-pytorch-9717561d1d4c
- http://adventuresinmachinelearning.com/word2vec-tutorial-tensorflow/
- http://adventuresinmachinelearning.com/keras-lstm-tutorial/
- http://adventuresinmachinelearning.com/keras-tutorial-cnn-11-lines/
- http://colah.github.io/posts/2015-08-Understanding-LSTMs/
- https://medium.freecodecamp.org/learn-to-build-a-convolutional-neural-network-on-the-web-with-this-easy-tutorial-- 2d617ffeaef3
- https://blog.insightdatascience.com/the-unreasonable-effectiveness-of-deep-learning-representations-4ce83fc663cf
- https://papers.nips.cc/paper/5204-devise-a-deep-visual-semantic-embedding-model
- https://medium.com/@14prakash/understanding-and-implementing-architectures-of-resnet-and-resnext-for-state-of-the-art-image-cf51669e1624
