## LSH

### Get data

In [34]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [35]:
!pip install faiss-cpu --no-cache



In [36]:
import pandas as pd
import numpy as np
import faiss
pd.set_option('display.max_colwidth', -1)
path = "/content/gdrive/My Drive/data"

  pd.set_option('display.max_colwidth', -1)


In [37]:
train = pd.read_csv(path + "/gensim/ag_news_train.csv")

In [38]:
train.shape

(120000, 3)

In [39]:
train.head(5)

Unnamed: 0,Class Index,Title,Description
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again."
1,3,Carlyle Looks Toward Commercial Aerospace (Reuters),"Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market."
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\about the economy and the outlook for earnings are expected to\hang over the stock market next week during the depth of the\summer doldrums.
3,3,Iraq Halts Oil Exports from Main Southern Pipeline (Reuters),"Reuters - Authorities have halted oil export\flows from the main pipeline in southern Iraq after\intelligence showed a rebel militia could strike\infrastructure, an oil official said on Saturday."
4,3,"Oil prices soar to all-time record, posing new menace to US economy (AFP)","AFP - Tearaway world oil prices, toppling records and straining wallets, present a new economic menace barely three months before the US presidential elections."


In [40]:
train.columns

Index(['Class Index', 'Title', 'Description'], dtype='object')

In [41]:
train['Description'][0:5]

0    Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.                                                                                                                        
1    Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market.
2    Reuters - Soaring crude prices plus worries\about the economy and the outlook for earnings are expected to\hang over the stock market next week during the depth of the\summer doldrums.                              
3    Reuters - Authorities have halted oil export\flows from the main pipeline in southern Iraq after\intelligence showed a rebel militia could strike\infrastructure, an oil official said on Saturday.                   
4    AFP - Tearaway world oil prices, toppling records and straining wallets, present a new economic menace barely three

In [42]:
sentences = train['Description'][0:10000]

In [43]:
!pip install sentence_transformers
from sentence_transformers import SentenceTransformer



In [44]:
model = SentenceTransformer('bert-base-nli-mean-tokens')

In [None]:
sentence_embeddings = model.encode(sentences)
sentence_embeddings.shape

(10000, 768)

In [None]:
sentence_embeddings.shape[0]

10000

### Save embeddings

In [None]:
with open(path + '/AG_news.npy', 'wb') as file:
    np.save(file, sentence_embeddings)

### Load embeddings

In [45]:
with open (path + '/AG_news.npy', 'rb') as f:
    sentence_embeddings = np.load(f, allow_pickle=True)

In [46]:
sentence_embeddings.shape

(10000, 768)

In [47]:
sentence_embeddings[0:5]

array([[-0.26105028,  0.8585296 ,  0.03941074, ...,  1.0689917 ,
         1.1770816 , -0.74388623],
       [-0.2222097 , -0.03594436,  0.5209106 , ...,  0.15727971,
        -0.3867779 ,  0.49948674],
       [-0.3001758 , -0.41582862,  0.86036515, ..., -0.6246218 ,
         0.52692914, -0.36817163],
       [ 0.3295024 ,  0.22334357,  0.30229023, ..., -0.41823167,
         0.01728885, -0.05920589],
       [-0.22277102,  0.7840586 ,  0.2004052 , ..., -0.9121561 ,
         0.2918987 , -0.12284964]], dtype=float32)

In [48]:
dim = sentence_embeddings.shape[1]
dim

768

### Build LSH index

In [58]:
import faiss
n_bits = 48 #, 32, 48. The search results are different between 32 and 48
n_dim = 768
lshIndex = faiss.IndexLSH(n_dim, n_bits)

In [59]:
lshIndex.add(sentence_embeddings)

In [60]:
lshIndex.ntotal

10000

### Search example 1:

In [61]:
qry1 = model.encode(["economic booming and stock market"])

In [62]:
k=5
d, I = lshIndex.search(qry1, k)

In [63]:
print(I)

[[4464 9767 1876 2580 2869]]


In [64]:
%%time
d, I = lshIndex.search(qry1, k)
print(I)

[[4464 9767 1876 2580 2869]]
CPU times: user 14.4 ms, sys: 0 ns, total: 14.4 ms
Wall time: 20.9 ms


In [65]:
print(d)

[[10. 10. 11. 11. 11.]]


In [66]:
for i in I[0]:
  print(train['Description'][i])

The Tehran Stock Exchange has performed magnificently, but the market's list of risks is outsized.
Astronomers are claiming to have found a  quot;super-Earth quot; orbiting a star some 50 light years away. They say the finding could significantly boost the hunt for worlds beyond our Solar System.
Aug. 17 (Bloomberg) -- Air Canada creditors including a General Electric Co. unit and Deutsche Bank AG cleared a plan that gives them most of the company #39;s equity when the carrier emerges from bankruptcy protection at the end of September. 
 APPLIED MATERIALS INC. &lt;A HREF="http://www.investor.reuters.com/FullQuote.aspx?ticker=AMAT.O target=/stocks/quickinfo/fullquote"&gt;AMAT.O&lt;/A&gt;:
If you're going to spend a lot producing a slick annual report, go all in.


### Search example 2:

In [None]:
qry2 = model.encode(["Red sox won the game"])
k=5

In [None]:
%%time
d, I = lshIndex.search(qry2, k)
print(I)

[[9291 1319 6667 6669 8463]]
CPU times: user 2.22 ms, sys: 0 ns, total: 2.22 ms
Wall time: 1.25 ms


In [None]:
for i in I[0]:
  print(train['Description'][i])

Derek Jeter, who suffered a bone bruise on his left elbow when he was plunked by a pitch Monday, not only played Tuesday, but he turned out to be the star.
AP - Daryle Ward, Albert Pujols and Chipper Jones had big nights at the plate at the expense of National League pitchers.
Reuters - Sudanese Darfur rebels arrived in Nigeria\on Sunday ahead of peace talks under the African Union (AU) to\resolve a conflict that has killed up to 50,000 and displaced\more than a million people.
 ABUJA (Reuters) - Sudanese Darfur rebels arrived in Nigeria  on Sunday ahead of peace talks under the African Union (AU) to  resolve a conflict that has killed up to 50,000 and displaced  more than a million people.
ST. LOUIS -- Over time, Cal Eldred has learned to embrace the bullpen. The righthander came to the St. Louis Cardinals in the spring of 2003 looking to compete for a spot in the rotation. Before undergoing reconstructive elbow surgery, he had been a productive starter for the Brewers, and won 10 gam