As part of this notebook, I implement assymetric semantic search with the retrieve and re-rank pipeline - [Retrive-Rank](https://www.sbert.net/examples/applications/retrieve_rerank/README.html). The sentence encoding part is similar to all other notebooks, but we don't use FAISS here. We first fetch top k passages from our dataset for the query, with k being a number >= 100. Then we use Cross-Encoder to rank the top k responses.

In [1]:
!pip install sentence_transformers

Collecting sentence_transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: sentence_transformers
  Building wheel for sentence_transformers (setup.py) ... [?25ldone
[?25h  Created wheel for sentence_transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125938 sha256=80ad1cac22e0934d8615a0198798e904185053fcb7d9ff1d3780f761587670b3
  Stored in directory: /root/.cache/pip/wheels/83/71/2b/40d17d21937fed496fb99145227eca8f20b4891240ff60c86f
Successfully built sentence_transformers
Installing collected packages: sentence_transformers
Successfully installed sentence_transformers-2.2.2
[0m

In [4]:
import numpy as np
import pandas as pd

from string import digits
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

import re
from tqdm import tqdm, notebook

from sentence_transformers import SentenceTransformer, CrossEncoder, util
import torch

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
/kaggle/input/arxiv-cs-papers-abstract-from-2010/cs_arxiv_from_2010.csv


In [5]:
docs_df = pd.read_csv('/kaggle/input/arxiv-cs-papers-abstract-from-2010/cs_arxiv_from_2010.csv')
docs_df.head()

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,id,authors,title,category,abstract
0,704.0213,Ketan D. Mulmuley Hariharan Narayanan,Geometric Complexity Theory V: On deciding non...,['cs.CC'],This article has been withdrawn because it h...
1,704.1409,Yao HengShuai,Preconditioned Temporal Difference Learning,"['cs.LG', 'cs.AI']",This paper has been withdrawn by the author....
2,704.1829,"Stefan Felsner, Kamil Kloch, Grzegorz Matecki,...",On-line Chain Partitions of Up-growing Semi-or...,['cs.DM'],On-line chain partition is a two-player game...
3,705.0561,Jing-Chao Chen,Iterative Rounding for the Closest String Problem,"['cs.DS', 'cs.CC']",The closest string problem is an NP-hard pro...
4,705.1025,David Eppstein,Recognizing Partial Cubes in Quadratic Time,['cs.DS'],We show how to test whether a graph with n v...


In [6]:
dim=384

In [7]:
device = 'cuda'
if torch.cuda.is_available():      
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))

else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla P100-PCIE-16GB


In [8]:
docs_text = (docs_df['title'] + ' ' + docs_df['abstract']).values.tolist()
docs_text[:5]

["Geometric Complexity Theory V: On deciding nonvanishing of a generalized\n  Littlewood-Richardson coefficient   This article has been withdrawn because it has been merged with the earlier\narticle GCT3 (arXiv: CS/0501076 [cs.CC]) in the series. The merged article is\nnow available as:\n  Geometric Complexity Theory III: on deciding nonvanishing of a\nLittlewood-Richardson Coefficient, Journal of Algebraic Combinatorics, vol. 36,\nissue 1, 2012, pp. 103-110. (Authors: Ketan Mulmuley, Hari Narayanan and Milind\nSohoni)\n  The new article in this GCT5 slot in the series is:\n  Geometric Complexity Theory V: Equivalence between blackbox derandomization\nof polynomial identity testing and derandomization of Noether's Normalization\nLemma, in the Proceedings of FOCS 2012 (abstract), arXiv:1209.5993 [cs.CC]\n(full version) (Author: Ketan Mulmuley)\n",
 'Preconditioned Temporal Difference Learning   This paper has been withdrawn by the author. This draft is withdrawn for its\npoor quality in

In [9]:
def clean_text(text, remove_stopwords=True):
    text = text.lower()
    text = re.sub(r'https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)
    text = re.sub(r'\<a href', ' ', text)
    text = re.sub(r'&amp;', '', text) 
    text = re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', text)
    text = re.sub(r'<br />', ' ', text)
    text = re.sub(r'\'', ' ', text)
    
    if remove_stopwords:
        text = text.split()
        stops = set(stopwords.words('english'))
        text = [w for w in text if w not in stops]
        text = ' '.join(text)
        
    return text

In [10]:
def clean_data(data):
    cleaned_data = []
    for doc in notebook.tqdm(data):
        text = clean_text(doc, False)
        cleaned_data.append(text)
    return cleaned_data

In [12]:
cleaned_docs = clean_data(docs_text)

  0%|          | 0/484027 [00:00<?, ?it/s]

In [11]:
embedder = SentenceTransformer('all-MiniLM-L6-v2', device=device)
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2', device=device)

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [13]:
def get_embeddings(data):
    return embedder.encode(data, convert_to_tensor=True)

In [15]:
abstract_embeddings = get_embeddings(cleaned_docs)

Batches:   0%|          | 0/15126 [00:00<?, ?it/s]

In [37]:
query = ['temporal expression extraction']
query_embedding = get_embeddings(query)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Getting the top 100 hits for reranking

In [22]:
hits = util.semantic_search(query_embedding, abstract_embeddings, top_k=100)
hits = hits[0]

In [23]:
hits[:5]

[{'corpus_id': 16069, 'score': 0.7779459953308105},
 {'corpus_id': 154036, 'score': 0.6949880719184875},
 {'corpus_id': 66896, 'score': 0.6799792051315308},
 {'corpus_id': 13600, 'score': 0.6713610291481018},
 {'corpus_id': 192765, 'score': 0.6707550287246704}]

Re-ranking all the hits using cross encoder

Preparing cross encoder input

In [31]:
cross_inp = [[query[0], cleaned_docs[hit['corpus_id']]] for hit in hits]

In [32]:
cross_inp[:5]

[['temporal expression extraction',
  'temporal expression normalisation in natural language texts   automatic annotation of temporal expressions is a research challenge of great\ninterest in the field of information extraction  in this report  i describe a\nnovel rule based architecture  built on top of a pre existing system  which is\nable to normalise temporal expressions detected in english texts  gold standard\ntemporally annotated resources are limited in size and this makes research\ndifficult  the proposed system outperforms the state of the art systems with\nrespect to tempeval 2 shared task  value attribute  and achieves substantially\nbetter results with respect to the pre existing system on top of which it has\nbeen developed  i will also introduce a new free corpus consisting of 2822\nunique annotated temporal expressions  both the corpus and the system are\nfreely available on line \n'],
 ['temporal expression extraction',
  'temporal information extraction by predicting 

In [33]:
cross_scores = cross_encoder.predict(cross_inp)

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

In [34]:
cross_scores

array([  7.040183  ,   6.9775553 ,   8.60662   ,   6.6921773 ,
         4.8595524 ,   4.453336  ,   0.23782401,   5.2521605 ,
        -0.03346934,   5.7105794 ,  -2.97583   ,  -5.008688  ,
         4.5531425 ,  -1.3967086 ,  -0.59001285,   5.164037  ,
         1.280134  ,   4.7221255 ,   3.7857106 ,   6.349493  ,
        -0.31447238,   4.8841915 ,   1.5201616 ,   3.6251032 ,
         1.9316106 ,   2.231379  ,  -4.74449   ,   3.029835  ,
        -4.7363033 ,   2.6730936 ,   3.7265553 ,  -1.1291184 ,
        -5.1323185 ,   0.6796724 ,   2.254343  ,   1.5796994 ,
         4.3050575 ,  -2.2977514 ,  -2.5092812 ,  -1.3380922 ,
         3.170241  ,  -1.7978094 ,   1.0735011 ,  -0.3843995 ,
        -4.662538  ,  -5.0860114 ,   4.263301  ,   3.6782656 ,
         2.560791  ,   5.199521  ,  -0.58925045,  -6.0470433 ,
        -9.1827545 ,  -1.4201727 ,   1.0260693 ,   2.036706  ,
        -1.8110634 ,  -4.968667  ,   6.988722  ,  -4.896039  ,
        -4.125615  ,   4.179108  ,   3.663168  ,  -2.48

In [35]:
for idx in range(len(cross_scores)):
    hits[idx]['cross-score'] = cross_scores[idx]
    
cross_score_sorted_hits = sorted(hits, key=lambda x: x['cross-score'], reverse=True)
cross_score_sorted_hits[:5]

[{'corpus_id': 66896, 'score': 0.6799792051315308, 'cross-score': 8.60662},
 {'corpus_id': 16069, 'score': 0.7779459953308105, 'cross-score': 7.040183},
 {'corpus_id': 400663, 'score': 0.5173180103302002, 'cross-score': 6.988722},
 {'corpus_id': 154036, 'score': 0.6949880719184875, 'cross-score': 6.9775553},
 {'corpus_id': 13600, 'score': 0.6713610291481018, 'cross-score': 6.6921773}]

In [36]:
hits[:5]

[{'corpus_id': 16069, 'score': 0.7779459953308105, 'cross-score': 7.040183},
 {'corpus_id': 154036, 'score': 0.6949880719184875, 'cross-score': 6.9775553},
 {'corpus_id': 66896, 'score': 0.6799792051315308, 'cross-score': 8.60662},
 {'corpus_id': 13600, 'score': 0.6713610291481018, 'cross-score': 6.6921773},
 {'corpus_id': 192765, 'score': 0.6707550287246704, 'cross-score': 4.8595524}]

If you see from the result, the bi-encoder was also able to do fairly well as 4 out of top 5 results overlap between the bi-encoder and cross-encoder output.

Let's write a method combining all the previous steps to perform search for our query on the abstract corpus.

In [56]:
def get_top_hits(query):
    query_embedding = get_embeddings([query])
    hits = util.semantic_search(query_embedding, abstract_embeddings, top_k=100)[0]
    
    cross_inp = [[query, cleaned_docs[hit['corpus_id']]] for hit in hits]
    cross_scores = cross_encoder.predict(cross_inp)
    
    for idx in range(len(cross_scores)):
        hits[idx]['cross-score'] = cross_scores[idx]
    
    cross_score_sorted_hits = sorted(hits, key=lambda x: x['cross-score'], reverse=True)
    
    ## get the abstract and title for top 3 hits before and after re-ranking for comparison
    for idx in range(2):
        print('Hit {} before ranking - '.format(idx+1))
        print('\nScore - ', hits[idx]['score'])
        print('\nAbstract - ', docs_df.iloc[hits[idx]['corpus_id']].abstract)
        print('\nTitle - ', docs_df.iloc[hits[idx]['corpus_id']].title)
        
        print('\n\nHit {} after ranking - '.format(idx+1))
        print('\nCross encoder Score - ', cross_score_sorted_hits[idx]['cross-score'])
        print('\nAbstract - ', docs_df.iloc[cross_score_sorted_hits[idx]['corpus_id']].abstract)
        print('\nTitle - ', docs_df.iloc[cross_score_sorted_hits[idx]['corpus_id']].title)

In [57]:
get_top_hits('what is temporal expression extraction?')

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

Hit 1 before ranking - 

Score -  0.6939192414283752

Abstract -    Automatic annotation of temporal expressions is a research challenge of great
interest in the field of information extraction. In this report, I describe a
novel rule-based architecture, built on top of a pre-existing system, which is
able to normalise temporal expressions detected in English texts. Gold standard
temporally-annotated resources are limited in size and this makes research
difficult. The proposed system outperforms the state-of-the-art systems with
respect to TempEval-2 Shared Task (value attribute) and achieves substantially
better results with respect to the pre-existing system on top of which it has
been developed. I will also introduce a new free corpus consisting of 2822
unique annotated temporal expressions. Both the corpus and the system are
freely available on-line.


Title -  Temporal expression normalisation in natural language texts


Hit 1 after ranking - 

Cross encoder Score -  7.6009026

Ab

In [58]:
get_top_hits('what is cross entropy loss?')

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

Hit 1 before ranking - 

Score -  0.5716948509216309

Abstract -    Minimizing cross-entropy is a widely used method for training artificial
neural networks. Many training procedures based on backpropagation use
cross-entropy directly as their loss function. Instead, this theoretical essay
investigates a dual process model with two processes, in which one process
minimizes the Kullback-Leibler divergence while its dual counterpart minimizes
the Shannon entropy. Postulating that learning consists of two dual processes
complementing each other, the model defines an equilibrium state for both
processes in which the loss function assumes its minimum. An advantage of the
proposed model is that it allows deriving the optimal learning rate and
momentum weight to update network weights for backpropagation. Furthermore, the
model introduces the golden ratio and complex numbers as important new concepts
in machine learning.


Title -  A Dual Process Model for Optimizing Cross Entropy in Neural N

For the first query, both the resonses are fairl even before and after ranking, we do get much better and relevant responses after reranking for the second query.

In [59]:
get_top_hits('rebalancing in kafka')

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

Hit 1 before ranking - 

Score -  0.5709049701690674

Abstract -    Apache Kafka addresses the general problem of delivering extreme high volume
event data to diverse consumers via a publish-subscribe messaging system. It
uses partitions to scale a topic across many brokers for producers to write
data in parallel, and also to facilitate parallel reading of consumers. Even
though Apache Kafka provides some out of the box optimizations, it does not
strictly define how each topic shall be efficiently distributed into
partitions. The well-formulated fine-tuning that is needed in order to improve
an Apache Kafka cluster performance is still an open research problem. In this
paper, we first model the Apache Kafka topic partitioning process for a given
topic. Then, given the set of brokers, constraints and application requirements
on throughput, OS load, replication latency and unavailability, we formulate
the optimization problem of finding how many partitions are needed and show
that it is 

We see for the above query, both the systems has returned the same abstract but in different order. Ranking the hits have actually helped the more relevant abstract to be ranked above the second abstract. Whereas before ranking, the more relevant abstract is second in the order.