<a href="https://colab.research.google.com/github/VinishUchiha/Fine-Tuning-BERT/blob/master/Semantic_Similarity_Search/semantic_similarity_search_using_fine_tuned_bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
# copy the saved document classification model
!cp -r /content/drive/My\ Drive/BERT_doc_classification/saved_model /content/saved_model/

In [8]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/48/35/ad2c5b1b8f99feaaf9d7cdadaeef261f098c6e1a6a2935d4d07662a6b780/transformers-2.11.0-py3-none-any.whl (674kB)
[K     |████████████████████████████████| 675kB 4.4MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 12.4MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 24.3MB/s 
Collecting tokenizers==0.7.0
[?25l  Downloading https://files.pythonhosted.org/packages/14/e5/a26eb4716523808bb0a799fcfdceb6ebf77a18169d9591b2f46a9adb87d9/tokenizers-0.7.0-cp36-cp36m-manylinux1_x86_64.whl (3.8MB)
[K     |█████████

In [10]:
# check the gpu availability and set the device
import torch

if torch.cuda.is_available():
  device = torch.device('cuda')
  print('GPU :',torch.cuda.get_device_name(0))
else:
  device = torch.device('cpu')

GPU : Tesla P100-PCIE-16GB


In [0]:
import urllib
# download annotated comments and annotations

ANNOTATED_COMMENTS_URL = 'https://ndownloader.figshare.com/files/7554634' 
ANNOTATIONS_URL = 'https://ndownloader.figshare.com/files/7554637' 


def download_file(url, fname):
    urllib.request.urlretrieve(url, fname)

                
download_file(ANNOTATED_COMMENTS_URL, 'attack_annotated_comments.tsv')
download_file(ANNOTATIONS_URL, 'attack_annotations.tsv')

In [0]:
# read the dataset
import pandas as pd

comments = pd.read_csv('attack_annotated_comments.tsv',sep='\t',index_col = 0)
annotations = pd.read_csv('attack_annotations.tsv',sep='\t')

In [0]:
# labels a comment as an atack if the majority of annotators did so
labels = annotations.groupby('rev_id')['attack'].mean() > 0.5

# join labels and comments
comments['attack'] = labels

# remove newline and tab tokens
comments['comment'] = comments['comment'].apply(lambda x: x.replace("NEWLINE_TOKEN", " "))
comments['comment'] = comments['comment'].apply(lambda x: x.replace("TAB_TOKEN", " "))

In [0]:
from transformers import BertForSequenceClassification, BertTokenizer

model_dir = '/content/saved_model'

model = BertForSequenceClassification.from_pretrained(model_dir,
                                                      output_hidden_states=True)
#load the tokenizer
tokenizer = BertTokenizer.from_pretrained(model_dir)

model.to(device)

In [0]:
import torch
from keras.preprocessing.sequence import pad_sequences

# text to embedding function
def text_to_embedding(tokenizer,model,text):

  MAX_LEN = 128
  input_ids = tokenizer.encode(text,
                               add_special_tokens=True,
                               max_length = MAX_LEN)
  results = pad_sequences([input_ids],maxlen=MAX_LEN,dtype='long',
                          truncating='post',padding='post')
  
  input_ids = results[0]

  attn_mask = [int(i>0) for i in input_ids]

  # convert to tensors
  input_ids = torch.tensor(input_ids)
  attn_mask = torch.tensor(attn_mask)

  # add one extra dim
  input_ids = input_ids.unsqueeze(0)
  attn_mask = attn_mask.unsqueeze(0)

  model.eval()

  # move to GPU
  input_ids = input_ids.to(device)
  attn_mask = attn_mask.to(device)

  with torch.no_grad():

    logits, encoded_layers = model(input_ids,token_type_ids=None,
                                   attention_mask = attn_mask)
    
  layer = 12 # last bert layer before the classifier
  batch = 0
  token = 0

  vec = encoded_layers[layer][batch][token]

  # move to cpu
  vec = vec.detach().cpu().numpy()

  return (vec)

In [22]:
# text from one of the comments
input_text = comments.iloc[10].comment

print(input_text)

vec = text_to_embedding(tokenizer, model, input_text)

print('Embedding Shape',vec.shape)

  :Correct. Full biographical details will put down his birth details, etc. It is just a marker to me at the moment to detail the WR aspect. He certainly wasn't Belarus; as a geo-political entity it had no real existence at the time. I have put a tbc marker on this article for now. 
Embedding Shape (768,)


In [0]:
# Helper function
import time
import datetime

def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

In [25]:
t0 = time.time()

embeddings = []

num_comments = len(comments)

print(f'Generating Sentence Embedding of {num_comments} comments')

row_num = 0

for index, row in comments.iterrows():
  if row_num % 2000 == 0 and not row_num == 0:
    elapsed = format_time(time.time() - t0)

    # calculate the remaining time
    row_per_sec = (time.time() - t0) / row_num
    remaining_sec = row_per_sec * (num_comments - row_num)
    remaining = format_time(remaining_sec)

    print(f'  Comment {row_num} of {num_comments} Elapsed: {elapsed} Remaining: {remaining}')

  # Vectorize the comment
  vec = text_to_embedding(tokenizer,model,row.comment)

  #store the embedding
  embeddings.append(vec)

  row_num += 1

Generating Sentence Embedding of 115864 comments
  Comment 2000 of 115864 Elapsed: 0:00:23 Remaining: 0:21:32
  Comment 4000 of 115864 Elapsed: 0:00:46 Remaining: 0:21:16
  Comment 6000 of 115864 Elapsed: 0:01:08 Remaining: 0:20:50
  Comment 8000 of 115864 Elapsed: 0:01:31 Remaining: 0:20:26
  Comment 10000 of 115864 Elapsed: 0:01:53 Remaining: 0:19:53
  Comment 12000 of 115864 Elapsed: 0:02:15 Remaining: 0:19:27
  Comment 14000 of 115864 Elapsed: 0:02:38 Remaining: 0:19:07
  Comment 16000 of 115864 Elapsed: 0:03:00 Remaining: 0:18:42
  Comment 18000 of 115864 Elapsed: 0:03:22 Remaining: 0:18:21
  Comment 20000 of 115864 Elapsed: 0:03:45 Remaining: 0:17:57
  Comment 22000 of 115864 Elapsed: 0:04:07 Remaining: 0:17:33
  Comment 24000 of 115864 Elapsed: 0:04:29 Remaining: 0:17:09
  Comment 26000 of 115864 Elapsed: 0:04:50 Remaining: 0:16:43
  Comment 28000 of 115864 Elapsed: 0:05:12 Remaining: 0:16:19
  Comment 30000 of 115864 Elapsed: 0:05:34 Remaining: 0:15:56
  Comment 32000 of 115864

In [26]:
import numpy as np

# convert the list of vec into 2D array
vecs = np.stack(embeddings)

vecs.shape

(115864, 768)

In [27]:
# k-NN with faiss(Facebook AI Similarity Search)
!pip install faiss

Collecting faiss
[?25l  Downloading https://files.pythonhosted.org/packages/bd/1c/4ae6cb87cf0c09c25561ea48db11e25713b25c580909902a92c090b377c0/faiss-1.5.3-cp36-cp36m-manylinux1_x86_64.whl (4.7MB)
[K     |████████████████████████████████| 4.7MB 4.5MB/s 
Installing collected packages: faiss
Successfully installed faiss-1.5.3


In [28]:
!pip install faiss-gpu

Collecting faiss-gpu
[?25l  Downloading https://files.pythonhosted.org/packages/a8/69/0e3f56024bb1423a518287673071ae512f9965d1faa6150deef5cc9e7996/faiss_gpu-1.6.3-cp36-cp36m-manylinux2010_x86_64.whl (35.5MB)
[K     |████████████████████████████████| 35.5MB 89kB/s 
Installing collected packages: faiss-gpu
Successfully installed faiss-gpu-1.6.3


In [30]:
import faiss

# build a flat cpu index
cpu_index = faiss.IndexFlatL2(vecs.shape[1])

print('Number of Available GPUs : ',faiss.get_num_gpus())

# for multiple GPU
co = faiss.GpuMultipleClonerOptions()
co.shard = True

# Make it into gpu index
gpu_index = faiss.index_cpu_to_all_gpus(cpu_index,co = co,ngpu = 1)

# add vec to our gpu index
t0 = time.time()
gpu_index.add(vecs)
elapsed = time.time() - t0
print(f'Time Taken to add vec: {elapsed}')

Number of Available GPUs :  1
Time Taken to add vec: 0.07732367515563965


In [33]:
# Semantic Similarity Search
print(f'Comment #4: {comments.iloc[4].comment}')

#find top 5 similar content
D,I = gpu_index.search(vecs[4].reshape(1,768),k=5)

print("   Top 5 Results   ")

for i in range(I.shape[1]):
  result = I[0,i]
  text = comments.iloc[result].comment
  print(f'Comment Number {result}')
  print(f'L2 Distance {D[0,i]}')
  print(text)
  print()

Comment #4: This page will need disambiguation. 
   Top 5 Results   
Comment Number 4
L2 Distance 0.0
This page will need disambiguation. 

Comment Number 2872
L2 Distance 13.7762451171875
DISAMBIGUATION PAGE needed  

Comment Number 39578
L2 Distance 14.284393310546875
  This page needs to be expand.   

Comment Number 45760
L2 Distance 14.841827392578125
So what is m? This page fails to define it.

Comment Number 77417
L2 Distance 15.710906982421875
       A couple of these images should be added to the article.   



In [34]:
# query a text
query_text = 'the content in this page is fake'

# vectorize the text
query_vec = text_to_embedding(tokenizer,model,query_text)

#find top 5 similar content
D,I = gpu_index.search(query_vec.reshape(1,768),k=5)

print("   Top 5 Results   ")

for i in range(I.shape[1]):
  result = I[0,i]
  text = comments.iloc[result].comment
  print(f'Comment Number {result}')
  print(f'L2 Distance {D[0,i]}')
  print(text)
  print()

   Top 5 Results   
Comment Number 97265
L2 Distance 43.85711669921875
  This pages is not accurate it is full of information that refer to jokes and not actual data. Abstain from using it for any project. I recomend for this page to be locked    JP

Comment Number 94394
L2 Distance 45.39288330078125
I don't think this page has any issues !

Comment Number 90833
L2 Distance 45.830902099609375
 :Where is that permission given? The linked page does not contain any such statement. There is no evidence of any CC-BY-SA licence on that page either.   

Comment Number 35394
L2 Distance 45.919952392578125
  == Delete == I think this page should be deleted its obviously not in standard for wikipedia.

Comment Number 53121
L2 Distance 46.393035888671875
  : This article is completely fake. Pyrrhus was ILLIRYAN, NOT GREEK.

