### The Challenge: Build a large-scale image search engine!

You and your team of **three Cornell Tech students** are surely on the path to fame and fortune! You have been recruited by Google to disrupt Google Image Search by building a better search engine using novel statistical learning techniques.

The specifications are simple: We need a way to **search for relevant images** given a natural language query. For instance, if a user types "dog jumping to catch frisbee," your system will **rank-order the most relevant images** from a large database.

---


**During training**, you have a dataset of 10,000 samples. 

Each sample has the following data available for learning:
- A 224x224 JPG image.
- A list of tags indicating objects appeared in the image.
- Feature vectors extracted using [Resnet](https://arxiv.org/abs/1512.03385), a state-of-the-art Deep-learned CNN (You don't have to train or run ResNet -- we are providing the features for you). See [here](http://ethereon.github.io/netscope/#/gist/b21e2aae116dc1ac7b50) for the illustration of the ResNet-101 architecture. The features are extracted from pool5 and fc1000 layer.
- A five-sentence description, used to train your search engine.

**During testing**, your system matches a single five-sentence description against a pool of 2,000 candidate samples from the test set. 

Each sample has:
- A 224x224 JPEG image.
- A list of tags for that image.
- ResNet feature vectors for that image.

**Output**:
For each description, your system must rank-score each testing image with the likelihood of that image matches the given sentence. Your system then returns the name of the top 20 relevant images, delimited by space. See "sample_submission.csv" on the data page for more details on the output format.

**Evaluation metric**:
There are 2,000 descriptions, and for each description, you must compare against the entire 2,000-image test set. That is, rank-order test images for each test description. We will use **MAP@20** as the evaluation metric. If the corresponding image of a description is among your algorithm's 20 highest scoring images, this metric gives you a certain score based on the ranking of the corresponding image. Please refer to the evaluation page for more details. Use all of your skills, tools, and experience. It is OK to use libraries like numpy, scikit-learn, pandas, etc., as long as you cite them. Use cross-validation on training set to debug your algorithm. Submit your results to the Kaggle leaderboard and send your complete writeup to CMS. The data you use --- and the way you use the data --- is completely up to you.

**Note**:
The best teams of **three Cornell Tech students** might use visualization techniques for debugging (e.g., show top images retrieved by your algorithm and see whether they make sense or not), preprocessing, a nice way to compare tags and descriptions, leveraging visual features and combining them with tags and descriptions, supervised and/or unsupervised learning to best understand how to best take advantage of each data source available to them.

---

**File descriptions**:

- images_train - 10,000 training images of size 224x224.
- images_test - 2,000 test images of size 224x224.
- tags_train - image tags correspond to training images. Each image have several tags indicating the human-labeled object categories appear in the image, in the form of "supercategory:category".
- tags_test - image tags correspond to test images. Each image have several tags indicating the human-labeled object categories appear in the image, in the form of "supercategory:category".
features_train - features extracted from a pre-trained Residual Network (ResNet) on training set, including 1,000 dimensional feature from classification layer (fc1000) and 2,048 dimensional feature from final convolution layer (pool5). Each dimension of the fc1000 feature corresponds to a WordNet synset here.
- features_test - features extracted from the same Residual Network (ResNet) on test set, including 1,000 dimensional feature from classification layer (fc1000) and 2,048 dimensional feature from final convolution layer (pool5).
- descriptions_train - image descriptions correspond to training images. Each image have 5 sentences for describing the image content.
- descriptions_test - image descriptions for test images. Each image have 5 sentences for describing the image content. Notice that one test description corresponds to one test image. The task you need to do is to return top 20 images in test set for each test description.
- sample_submission.csv - a sample submission file in the correct format.

---


In [1]:
import os
import csv
import nltk
import scipy
import gensim
import random
import numpy as np
import pandas as pd
from PIL import Image
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from sklearn import preprocessing
from nltk.corpus import stopwords
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer


In [2]:
%matplotlib inline

In [3]:
# Define paths for training/testing data
my_path = os.getcwd()
tags_train_path = os.path.join(my_path, 'tags_train')
tags_test_path  = os.path.join(my_path, 'tags_test')
desc_train_path = os.path.join(my_path, 'descriptions_train')
desc_test_path  = os.path.join(my_path, 'descriptions_test')
image_train_path = os.path.join(my_path, 'images_train')
image_test_path  = os.path.join(my_path, 'images_test')
features_train_res_path = os.path.join(my_path, 'features_train/features_resnet1000_train.csv')
features_test_res_path  = os.path.join(my_path, 'features_test/features_resnet1000_test.csv')
features_train_res_int_path = os.path.join(my_path, 'features_train/features_resnet1000intermediate_train.csv')
features_test_res_int_path  = os.path.join(my_path, 'features_test/features_resnet1000intermediate_test.csv')

# Sort files in ascending order
def order_keys(text):
    return int(text.split('.')[0])

# Read/return list of images within a folder
def read_images(folder_path):
    images = []
    image_files = os.listdir(folder_path)
    image_files.sort(key = order_keys)
    for image_file in image_files:
        # Open each image file
        im = Image.open(os.path.join(folder_path, image_file), 'r')
        # Convert to an np array
        images.append(np.asarray(im))
        # Close the file
        im.close()
    return images

# Read/return list of strings from files within a folder
def read_strings(folder_path):
    elements = []
    files = os.listdir(folder_path)
    files.sort(key = order_keys)
    for f in files:
        # Open each text file. Strip leading/trailing whitespace
        lines = [line.strip() for line in open(os.path.join(folder_path, f))]
        elements.append(' '.join(lines))
    return elements

# Read/return list of tags from files within a folder
def read_tags(folder_path):
    elements = []
    files = os.listdir(folder_path)
    files.sort(key = order_keys)
    for f in files:
        # Open each text file. Strip leading/trailing whitespace
        lines = [line.strip() for line in open(os.path.join(folder_path, f))]
        tags  = [line.split(':') for line in lines]
        tags  = [item for sublist in tags for item in sublist]
        elements.append(tags)
    return elements

# Read/return list of features
def read_features(file_path):
    reader = csv.reader(open(file_path), delimiter=",")
    features = sorted(reader, key = lambda row: int(row[0].split('/')[1].split('.')[0]))
    # Skip the image name
    for ind, image in enumerate(features):
        feats = [float(i) for i in image[1:]]
        features[ind] = feats
    return features

# Transform input to lowercase
def to_lowercase(sent):
    return sent.lower()

# OLD - Lemmatize input
# lemmatizer = WordNetLemmatizer()
# def lemmatize(sent):
#     words = sent.split(' ')
#     lemmed_words = [lemmatizer.lemmatize(word) for word in words]
#     return ' '.join(lemmed_words)

# OLD - Stem input
# stemmer = SnowballStemmer('english')
# def stem(sent):
#     words = sent.split(' ')
#     stemmed_words = [stemmer.stem(word) for word in words]
#     return ' '.join(stemmed_words)
    
# OLD - Remove stop words from input
def remove_stopwords(sent):
    stops = set(stopwords.words("english"))
    words = sent.split(' ')
    unstopped_words = [word for word in words if not word in stops]
    return ' '.join(unstopped_words)

# Remove special characters from input
def remove_special_chars(sent):
    unspecial_words = []
    words = sent.split(' ')
    for word in words:
        unspecial_word = ''.join(char for char in word if char.isalnum())
        unspecial_words.append(unspecial_word)
    return ' '.join(unspecial_words)

# Fit with count vectorizer
vectorizer = CountVectorizer()
def count_vectorize_fit(corpus):
    vectorizer.fit(corpus)

# Perform counter vectorization
def count_vectorize(corpus):
    features = []
    count_vects = vectorizer.transform(corpus)
    count_vects = count_vects.toarray()
    for vects in count_vects:
        counts = [float(i) for i in vects]
        features.append(counts)
    return features

# ALT - Fit with count vectorizer
tfidf = TfidfVectorizer()
def tfid_vectorize_fit(corpus):
    tfidf.fit(corpus)
    
# ALT - Perform TFID vectorization
def tfid_vectorize(corpus):
    features = []
    count_vects = tfidf.transform(corpus)
    count_vects = count_vects.toarray()
    for vects in count_vects:
        counts = [float(i) for i in vects]
        features.append(counts)
    return features

# OLD - Normalize with magnitude
def get_norm(sent):
    return np.divide(sent, np.linalg.norm(sent))


In [4]:
# Read in the training data
X_tags   = read_tags(tags_train_path)
X_images = read_images(image_train_path)
X_descs  = read_strings(desc_train_path)
X_feats  = read_features(features_train_res_path)
X_feats_int = read_features(features_train_res_int_path)


In [5]:
# Preprocess the training descriptions
for index in range(len(X_descs)):
    X_descs[index] = to_lowercase(X_descs[index])
    X_descs[index] = remove_special_chars(X_descs[index])


In [6]:
# TFID the description data
tfid_vectorize_fit(X_descs)
X_descs_vect = tfid_vectorize(X_descs)


In [7]:
# # Format the training data. 
# # We will combine each resnet feature vector with the correct corresponding normalized vector count
# # and label that as 1. We will then combine the same resnet feature vector with a different normalized
# # vector count and label that as 0. In this way, we are providing the NN with what we consider to be 
# # correct (1) or incorrect (0) values. For each resent feature vector, we will provide 1 correct and
# # 2 incorrect values.
# X_training_vals   = []
# X_training_labels = []

# for index, img_feat in enumerate(X_feats):
#     # Add the correct resnet feature vector and vector count
#     X_training_vals.append(np.append(X_descs_vect[index], img_feat))
#     X_training_labels.append(1)
    
#     for i in range(20):
#         # Choose random index other than the correct one
#         rand_index = random.randrange(0, len(X_feats))
#         while (rand_index == index):
#             rand_index = random.randrange(0, len(X_feats))

#         # Add the incorrect resnet feature vector and vector count
#         X_training_vals.append(np.append(X_descs_vect[rand_index], X_feats[rand_index]))
#         X_training_labels.append(0)

# X_train = pd.DataFrame(X_training_vals, X_training_labels)
# X_train = shuffle(X_train)


In [8]:
# Fit KNN model - Regression
knn = KNeighborsRegressor(n_neighbors = 15)
knn.fit(X_descs_vect, X_feats)


KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=15, p=2,
          weights='uniform')

In [9]:
# # Fit KNN model - Classification
# knn = KNeighborsClassifier(n_neighbors = 15)
# knn.fit(X_train.values, X_train.index)


In [10]:
# Read in the testing data
X_tags_test   = read_tags(tags_test_path)
X_images_test = read_images(image_test_path)
X_descs_test  = read_strings(desc_test_path)
X_feats_test  = read_features(features_test_res_path)
X_feats_int_test = read_features(features_test_res_int_path)


In [11]:
# Preprocess the testing descriptions
for index in range(len(X_descs_test)):
    X_descs_test[index] = to_lowercase(X_descs_test[index])
    X_descs_test[index] = remove_special_chars(X_descs_test[index])


In [12]:
# TFID the description data
X_descs_vect_test = tfid_vectorize(X_descs_test)


In [13]:
# Run the model with testing data
result_images = knn.predict(X_descs_vect_test)

# Get matrix of distances between matching features
feat_dist = scipy.spatial.distance.cdist(X_feats_test, result_images)


In [14]:
# Write the outputs to CSV - Regression
with open('test_results.csv', 'w') as csvfile:
    writer = csv.writer(csvfile, delimiter=',')
    
    writer.writerow(['Descritpion_ID','Top_20_Image_IDs'])

    for index, dist in enumerate(feat_dist):
        file_names = []
        sorted_results = np.argsort(dist)
        results = sorted_results[:20]

        # Convert results to a string
        for res in results:
            file_names.append(str(res) + '.jpg')

        # Write results to file
        writer.writerow([str(index) + '.txt',' '.join(file_names)])
    

In [15]:
# # Run the model with testing data and write the outputs to CSV - Classification
# with open('test_results.csv', 'w') as csvfile:
#     writer = csv.writer(csvfile, delimiter=',')
    
#     writer.writerow(['Description','Top_20_Image_IDs'])

#     # Compare each description to each resnet feature
#     for ind_outer, desc in enumerate(X_descs_vect_test):
        
#         indices = []
#         file_names = []
#         testing_vals = []

#         for ind_inner, feature in enumerate(X_feats_test):
#             indices.append(ind_inner)
#             testing_vals.append(np.append(desc, feature))
        
#         # Predict
#         results_prob = knn.predict_proba(testing_vals)
#         results_pred = knn.predict(testing_vals)
#         results_class_1, results_class_2 = zip(*results_prob)
#         results = zip(indices, results_pred, results_class_1, results_class_2)

#         print results
        
#         # Sort results by positive matches
#         sorted_results = sorted(results, key=lambda tup: tup[2])
            
#         # Convert results to a string
#         for res in sorted_results[:20]:
#             file_names.append(str(res[0]) + '.jpg')

#         # Write results to file
#         writer.writerow([str(ind_outer) + '.txt',' '.join(file_names)])


---

**Preprocessing**:
- Remove stop words
- Remove stemming
- Remove special characters 
- Lowercase everything
- https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72

**Postprocessing (?)**:
- log-normalization
- l1 normalization
- l2 normalization
- Standardize the data by subtracting the mean and dividing by the variance.

**Clustering descriptions**:
- Bag of words, 2-gram, maybe PCA
- With BoW, use tf–idf
- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- https://towardsdatascience.com/natural-language-processing-count-vectorization-with-scikit-learn-e7804269bb5e

**Clustering images**:
- https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html
- Random Decision Forests
- K-means or K-medoids
- PCA is good for reducing noise, but the input data should be "manifoldy", we are probably dealing with "clumpy"
- Niave bayes

**CNN/RNN links**:
- https://becominghuman.ai/extract-a-feature-vector-for-any-image-with-pytorch-9717561d1d4c
- http://adventuresinmachinelearning.com/word2vec-tutorial-tensorflow/
- http://adventuresinmachinelearning.com/keras-lstm-tutorial/
- http://adventuresinmachinelearning.com/keras-tutorial-cnn-11-lines/
- http://colah.github.io/posts/2015-08-Understanding-LSTMs/
- https://medium.freecodecamp.org/learn-to-build-a-convolutional-neural-network-on-the-web-with-this-easy-tutorial-- 2d617ffeaef3
- https://blog.insightdatascience.com/the-unreasonable-effectiveness-of-deep-learning-representations-4ce83fc663cf
- https://papers.nips.cc/paper/5204-devise-a-deep-visual-semantic-embedding-model
- https://medium.com/@14prakash/understanding-and-implementing-architectures-of-resnet-and-resnext-for-state-of-the-art-image-cf51669e1624


In [16]:
# PCA set up:
# pca = PCA(n_components=1100)
# Y = np.array(list(train_df['final_train_features'].values), dtype=np.float)
# X = pca.fit_transform(Y)
# pca_final_train_features = X
# for index in range(len(X)):
#     train_df['pca_final_train_features'][index] = X[index]
# train_df['pca_final_train_features']