<a href="https://colab.research.google.com/github/anastaszi/255_datamining/blob/main/HW4_Anastasia_Zimina.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Approximate Nearest Neighbors
* LSH
* Exhaustive search
* Product Quantization
* Trees and Graphs
* HNSW

# Library Imports

In [1]:
!pip install datasketch



In [2]:
import numpy as np
import pandas as pd
import re
import time
import datasketch
from datasketch import MinHash, MinHashLSHForest

In [3]:
from google.colab import drive
drive.mount('/content/gdrive/')

%matplotlib inline

Drive already mounted at /content/gdrive/; to attempt to forcibly remount, call drive.mount("/content/gdrive/", force_remount=True).


# LSH
>  LSH is a hashing based algorithm to identify approximate nearest neighbors. LSH generates a hash value for a data item embeddings while keeping spatiality of data in mind; in particular; data items that are similar in high-dimension will have a higher chance of receiving the same hash value.

Dataset: [Indian Food Receipts](https://www.kaggle.com/nehaprabhavalkar/indian-food-101)

Task: Recommendation Engine of Similar Dishes based on ingredients

In [20]:
df_indian_food = pd.read_csv('/content/gdrive/MyDrive/DataMining/indian_food.csv')

In [21]:
df_indian_food.head(1)

Unnamed: 0,name,ingredients,diet,prep_time,cook_time,flavor_profile,course,state,region
0,Balu shahi,"Maida flour, yogurt, oil, sugar",vegetarian,45,25,sweet,dessert,West Bengal,East


## Shingles
Convert the query text to shingles (tokens)

In [23]:
def preprocessIngredients(text):
  text = text.lower()
  tokens = text.split(', ')
  tokens = list(filter(lambda x: len(x) > 0, tokens))
  return tokens

In [24]:
preprocessIngredients('Maida flour, yogurt, oil, sugar')

['maida flour', 'yogurt', 'oil', 'sugar']

## MinHash and LSH
Apply MinHash and LSH to the shingle set, which maps it to a specific bucket

In [8]:
#Number of permutations
permutations = 128#@param
#Number of Recommendations to return
num_recommendations = 5#@param

In [25]:
df_indian_food.shape

(255, 9)

In [42]:
def create_forest(data, perms, label):
    start_time = time.time()
    minhash = []
    for text in data[label]:
        # process each item in the dataset and convert it to an array of shingles
        tokens = preprocessIngredients(text)
        # set number of permutations in MinHash
        m = MinHash(num_perm=perms)
        # MinHash the string, i.e add to MinHash each shingle in it
        for s in tokens:
            m.update(s.encode('utf8'))
        # store MinHash of the string in the array
        minhash.append(m)
    # build a forest of all MinHash values computed during the previous step 
    forest = MinHashLSHForest(num_perm=perms)
    # index the forest to make it searchable
    for i,m in enumerate(minhash):
        forest.add(i,m)  
    forest.index()
    print('It took %s seconds to build forest.' %(time.time()-start_time))
    return forest

In [43]:
forest = create_forest(df_indian_food, permutations, 'ingredients')

It took 0.38170933723449707 seconds to build forest.


## Similarity search
Conduct a similarity search between the query item and the other items in the bucket

In [44]:
def get_recommendations(text, database, labels, perms, num_results, forest):
    start_time = time.time()
    # tokenize input text 
    tokens = preprocessIngredients(text)
    m = MinHash(num_perm=perms)
    for s in tokens:
        m.update(s.encode('utf8')) 
    idx_array = np.array(forest.query(m, num_results))
    if len(idx_array) == 0:
        return None # if your query is empty, return none
    result = database.loc[idx_array, labels]
    print('It took %s seconds to query forest.' %(time.time()-start_time))
    return result

In [52]:
ingredients = 'sugar, milk'
result = get_recommendations(ingredients, df_indian_food, ['name', 'ingredients'], permutations, num_recommendations, forest)
print('\n Top Recommendation(s) is(are) \n', result)

It took 0.008231878280639648 seconds to query forest.

 Top Recommendation(s) is(are) 
               name                  ingredients
8         Kalakand  Milk, cottage cheese, sugar
11           Lassi    Yogurt, milk, nuts, sugar
43  Kakinada khaja           Wheat flour, sugar
21   Chhena kheeri          Chhena, sugar, milk
56         Basundi            Sugar, milk, nuts


# Exhaustive search

Dataset: [Million Headlines](https://www.kaggle.com/therohk/million-headlines?select=abcnews-date-text.csv)

In [4]:
data = pd.read_csv('/content/gdrive/MyDrive/DataMining/abcnews-date-text.csv')

In [5]:
data.head(1)

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...


In [6]:
def preprocess(text):
    text = text.lower()
    tokens = text.split(' ')
    tokens = list(filter(lambda x: len(x) > 0, tokens))
    return tokens

In [7]:
preprocess(' this IS a test string with  With extra whitespaces ')

['this', 'is', 'a', 'test', 'string', 'with', 'with', 'extra', 'whitespaces']

# References

* [Finding similar images using Deep learning and Locality Sensitive Hashing](https://towardsdatascience.com/finding-similar-images-using-deep-learning-and-locality-sensitive-hashing-9528afee02f5)
* [Implementing LSH in Python](https://www.learndatasci.com/tutorials/building-recommendation-engine-locality-sensitive-hashing-lsh-python/)

In [None]:
https://colab.research.google.com/github/pinecone-io/examples/blob/master/deduplication/deduplication_scholarly_articles.ipynb