## Learning by applying code
  
From ["Finding similar images using Deep learning and Locality Sensitive Hashing"](https://towardsdatascience.com/finding-similar-images-using-deep-learning-and-locality-sensitive-hashing-9528afee02f5) on towardsdatascience.com.
  
>"A simple walkthrough on finding similar images through image embedding by a ResNet 34 using FastAI & Pytorch. Also doing fast semantic similarity search in huge image embeddings collections."
  
  
> The process to achieve the above result can be broken down in these few steps -
1. Transfer learning from a ResNet-34 model(trained on ImageNet) to detect 101 classes in Caltech-101 dataset using FastAI and Pytorch.
2. Take the output of second last fully connected layer from trained ResNet 34 model to get embedding for all 9,144 Caltech-101 images.
3. Use Locality Sensitive hashing to create LSH hashing for our image embedding which enables fast approximate nearest neighbor search
4. Then given an image, we can convert it into image embedding using our trained model and then search similar images using Approximate nearest neighbor on Caltech-101 dataset.

### Part 1: Data understanding and transfer learning

"The first exercise in our project is to obtain a deep learning network which can classify these categories accurately. For this task, we will use a pre-trained ResNet 34 network which is trained on the ImageNet database and transfer learn it to classify 101 categories of Caltech-101 database using Pytorch 1.0 and FastAI library."

Testing my own dataset - ~120 images, 8 categories. This could be terrible lol.

In [None]:
import pandas as pd
import pickle
import numpy as np
from fastai.vision import *
from fastai.callbacks.hooks import *
from lshash_2 import LSHash
#from lshash import LSHash
from PIL import Image
import matplotlib.pyplot as plt
from tqdm import notebook
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'
pd.set_option('display.max_columns', 500)

In [None]:
path = '../../repo/data/model_images/'

In [None]:
# get_Transforms is a fastai function
tfms = get_transforms(
    do_flip=False, 
    flip_vert=False, 
    max_rotate=0, 
    max_lighting=0, 
    max_zoom=1, 
    max_warp=0
)
data = (ImageList.from_folder(path)
        .split_by_rand_pct(0.2)
        .label_from_folder()
        .transform(tfms=tfms, size=224)
        .databunch(bs=64))

In [None]:
# data

In [None]:
# print('Number of classes {0}'.format(data.c))
# print(data.classes)

In [None]:
# print('Train dataset size: {0}'.format(len(data.train_ds.x)))
# print('Test dataset size: {0}'.format(len(data.valid_ds.x)))

In [None]:
## Show sample data
data.show_batch(rows=3, figsize=(10,6), hide_axis=False) 

In [None]:
## Creating the model
learn = cnn_learner(data, models.resnet34, pretrained=True, metrics=accuracy)

In [None]:
### This took about 23 mins
## Finding Ideal learning late
learn.lr_find()
learn.recorder.plot()

In [None]:
# Fitting 5 epochs <--- what does this mean??
learning_rate = 1e-2 # choose based on a loss ~0.25
learn.fit_one_cycle(5)

In [None]:
# Saving stage 1 model weights
learn.save('stg1-rn34')

In [None]:
## Unfreeing layer and finding ideal learning rate
learn.unfreeze()
learn.lr_find() # this took 30 mins for 100 images
learn.recorder.plot()

In [None]:
## Fitting 5 epochs
learn.fit_one_cycle(5, slice(1e-5, 1e-2/5))

In [None]:
## Saving model weights
learn.save('stg2-rn34')

## Extracting Feature

Creating a hook right after convolutional part of resnet 50 and max pooling layer which generates a 4096 length vector for a particular image of 256*256.

In [None]:
# this is a hook (learned about it here: 
# https://forums.fast.ai/t/how-to-find-similar-images-based-on-final-embedding-layer/16903/13)
# hooks are used for saving intermediate computations
class SaveFeatures():
    features=None
    def __init__(self, m): 
        self.hook = m.register_forward_hook(self.hook_fn)
        self.features = None
    def hook_fn(self, module, input, output): 
        out = output.detach().cpu().numpy()
        if isinstance(self.features, type(None)):
            self.features = out
        else:
            self.features = np.row_stack((self.features, out))
    def remove(self): 
        self.hook.remove()
        
sf = SaveFeatures(learn.model[1][5]) ## Output before the last FC layer

Creating Feature Vector

In [None]:
## By running this, feature vectors would be saved in sf variable initated above
_= learn.get_preds(data.train_ds)
_= learn.get_preds(DatasetType.Valid)

Converting in a dictionary of {img_path:featurevector}

In [None]:
img_path = [str(x) for x in (list(data.train_ds.items)+list(data.valid_ds.items))]
feature_dict = dict(zip(img_path,sf.features))

In [None]:
## Exporting as pickle
pickle.dump(feature_dict, open(path+"feature_dict.p", "wb"))

## Using Locality Sensitive hashing to find near similar images

In [None]:
## Loading Feature dictionary
feature_dict = pickle.load(open(path+'feature_dict.p','rb'))

In [None]:
len(feature_dict)

In [None]:
len(feature_dict)

In [None]:
## Locality Sensitive Hashing
# params
k = 10 # hash size
L = 5  # number of tables
d = 512 # Dimension of Feature vector
lsh = LSHash(hash_size=k, input_dim=d, num_hashtables=L)

# LSH on all the images
for img_path, vec in notebook.tqdm(feature_dict.items()):
    lsh.index(vec.flatten(), extra_data=img_path)

In [None]:
## Exporting as pickle
pickle.dump(lsh, open(path+'lsh.p', "wb"))

## Visualizing Output

In [None]:
## Loading Feature dictionary
feature_dict = pickle.load(open(path+'feature_dict.p','rb'))
lsh = pickle.load(open(path+'lsh.p','rb'))

In [None]:
feature_dict

In [None]:
np.shape(lsh.query(feature_dict[list(feature_dict.keys())[0]].flatten()))

In [None]:
def get_similar_item(idx, feature_dict, lsh_variable, n_items=5):
    response = lsh_variable.query(feature_dict[list(feature_dict.keys())[idx]].flatten(), 
                     num_results=n_items+1, distance_func='hamming')
    
    columns = 3
    rows = int(np.ceil(n_items+1/columns))
    fig=plt.figure(figsize=(2*rows, 3*rows))
    for i in range(1, columns*rows +1):
        if i<n_items+2:
            img = Image.open(response[i-1][0][1])
            fig.add_subplot(rows, columns, i)
            plt.imshow(img)
    return plt.show()

In [None]:
get_similar_item(0, feature_dict, lsh,5)

In [None]:
get_similar_item(50, feature_dict, lsh,5)

In [None]:
get_similar_item(20, feature_dict, lsh, 8)

In [None]:
get_similar_item(30, feature_dict, lsh,11)

In [None]:
get_similar_item(100, feature_dict, lsh,11)

In [None]:
get_similar_item(90, feature_dict, lsh,11)