Discussion 6. LSH (Locality-Sensitive Hashing)

1. Using the code provided, repeat the experiment run in the article: Predicting NIPS articles based on titles.

2. Scroll through the NIPS article titles and identify 1 that would be of interest to you. Next find the top 5 recommendations using the LSH method.

3. Assess the recommendations -- are these titles of interest similar to your original paper titles. Share your results in the discussion!!! 

Code reference: https://www.learndatasci.com/tutorials/building-recommendation-engine-locality-sensitive-hashing-lsh-python/


In [3]:
# load Python packages
import numpy as np
import pandas as pd
import re
import time
from datasketch import MinHash, MinHashLSHForest

In [4]:
# preprocess data
def preprocess(text):
    text = re.sub(r'[^\w\s]','',text)
    tokens = text.lower()
    tokens = tokens.split()
    return tokens

In [5]:
# choose parameters
#Number of Permutations
permutations = 128

#Number of Recommendations to return
num_recommendations = 1

In [6]:
def get_forest(data, perms):
    start_time = time.time()
    
    minhash = []
    
    for text in data['text']:
        tokens = preprocess(text)
        m = MinHash(num_perm=perms)
        for s in tokens:
            m.update(s.encode('utf8'))
        minhash.append(m)
        
    forest = MinHashLSHForest(num_perm=perms)
    
    for i,m in enumerate(minhash):
        forest.add(i,m)
        
    forest.index()
    
    print('It took %s seconds to build forest.' %(time.time()-start_time))
    
    return forest

In [7]:
# evaluate queries
def predict(text, database, perms, num_results, forest):
    start_time = time.time()
    
    tokens = preprocess(text)
    m = MinHash(num_perm=perms)
    for s in tokens:
        m.update(s.encode('utf8'))
        
    idx_array = np.array(forest.query(m, num_results))
    if len(idx_array) == 0:
        return None # if your query is empty, return none
    
    result = database.iloc[idx_array]['title']
    
    print('It took %s seconds to query forest.' %(time.time()-start_time))
    
    return result

In [8]:
# Test recommendation engine on NIPS Conference Papers
db = pd.read_csv('papers.csv')
db['text'] = db['title'] + ' ' + db['abstract']
forest = get_forest(db, permutations)

It took 10.54420280456543 seconds to build forest.


In [9]:
num_recommendations = 5
title = 'Using a neural net to instantiate a deformable model'
result = predict(title, db, permutations, num_recommendations, forest)
print('\n Top Recommendation(s) is(are) \n', result)

It took 1.6689658164978027 seconds to query forest.

 Top Recommendation(s) is(are) 
 995     Neural Network Weight Matrix Synthesis Using O...
5       Using a neural net to instantiate a deformable...
5191    A Self-Organizing Integrated Segmentation and ...
2069    Analytic Solutions to the Formation of Feature...
2457    Inferring Neural Firing Rates from Spike Train...
Name: title, dtype: object


In [15]:
title = 'A Neurodynamical Approach to Visual Attention'
result = predict(title, db, permutations, num_recommendations, forest)
print('\n Top Recommendation(s) is(are) \n', result)

It took 0.08273482322692871 seconds to query forest.

 Top Recommendation(s) is(are) 
 7040    Functional Models of Selective Attention and C...
457     Correlates of Attention in a Model of Dynamic ...
2064    Top-Down Control of Visual Attention: A Ration...
7057    Neurobiology, Psychophysics, and Computational...
821         A Neurodynamical Approach to Visual Attention
Name: title, dtype: object


In [20]:
for i in result:
    print(i)

Functional Models of Selective Attention and Context Dependency
Correlates of Attention in a Model of Dynamic Visual Recognition
Top-Down Control of Visual Attention: A Rational Account
Neurobiology, Psychophysics, and Computational Models of Visual Attention
A Neurodynamical Approach to Visual Attention
