# **Milestone 2:**
Semantic Search with ML and BERT


In [3]:
!pip install faiss-cpu
!pip install transformers

Collecting transformers
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 5.4 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 476 kB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 50.8 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 59.0 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 73.7 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attem

### **Setting up the environment**

In [1]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


###**Importing the required modules**

In [1]:
# import libraries
import json
import torch
import numpy as np
import faiss
from transformers import AutoModel, AutoTokenizer
from pprint import pprint

### **Getting the data**

In [2]:
DATA_DIR = '/content/drive/MyDrive/SearchToolwNLP/02_Implement Semantic Search with ML and BERT/data/'

In [3]:
# load the json file
with open(DATA_DIR + 'sentences.json', 'r') as outfile:
    sentences = json.load(outfile)

In [4]:
# print sample sentences
g = (s for s in sentences)
[next(g) for i in range(2)]

['A pandemic is an epidemic of an infectious disease that has spread across a large region, for instance multiple continents or worldwide, affecting a substantial number of people.',
 'The most fatal pandemic in recorded history was the Black Death (also known as The Plague), which killed an estimated 75–200 million people in the 14th century.']

In [5]:
print(len(sentences))

11


### **Vectorizing the dataset**

In [6]:
# load the BERT tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModel.from_pretrained("distilbert-base-uncased")

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [7]:
# function that vectorizes the text
def encode(doc):
  tokens = tokenizer(doc, return_tensors='pt')
  print(tokens)
  print("---")
  vector = model(**tokens)[0].detach().squeeze()
  return torch.mean(vector, dim=0)

In [None]:
# vectorize the documents
vectors = [encode(d) for d in sentences]

In [11]:
print(vectors[0][:10])

tensor([ 0.0486,  0.0974, -0.0493, -0.2006,  0.2463, -0.2616,  0.2512,  0.9330,
        -0.1771, -0.0981])


### **Building a faiss index**

In [12]:
# create a flat faiss index
index = faiss.IndexIDMap(faiss.IndexFlatIP(768))
# add the vectors into the index
index.add_with_ids(np.array([vec.numpy() for vec in vectors]), # convert to numpy array
                   np.array(range(0, len(sentences)))) # IDs from 0 to len(sentences)


### **Searching the index**

In [13]:
# function to search faiss
def search(query, k=5):
  query_encoded = encode(query).unsqueeze(dim=0).numpy()
  top_k = index.search(query_encoded, k)
  scores = top_k[0][0]
  results = [sentences[_id] for _id in top_k[1][0]]
  return list(zip(results, scores))

In [14]:
# test a query
pprint(search("cholera infection dangerous", k=5))

{'input_ids': tensor([[  101, 25916,  8985,  4795,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}
---
[('Cholera is an infection of the small intestine by some strains of the '
  'bacterium Vibrio cholerae.',
  45.404118),
 ('Current pandemics include COVID-19 (SARS-CoV-2) and HIV/AIDS.', 42.768066),
 ('The Spanish flu, also known as the 1918 flu pandemic, was an unusually '
  'deadly influenza pandemic caused by the H1N1 influenza A virus.',
  42.646935),
 ('A pandemic is an epidemic of an infectious disease that has spread across a '
  'large region, for instance multiple continents or worldwide, affecting a '
  'substantial number of people.',
  41.176533),
 ('As of 2018, approximately 37.9 million people are infected with HIV '
  'globally.',
  39.692703)]
