# Semantic Search: Sentence Transformers and Elasticsearch


### Table of Contents

* [I. Requirements](#I)
    * [I.1. Installations](#I1)
    * [I.2 Packages](#I2)
    * [I.3 Datasets](#I3)
    
* [II. Sentence Transformers](#II)
    * [II.1 Model](#II1)
    * [II.2 Search with Transformer](#II2)

* [III. Elasticsearch](#III)
    * [III.1 Insert data](#III1)
    * [III.2 Search with Elastic](#III2)


### I. Requirements <a id="I"></a>

#### I.1. Installations <a id="I1"></a>

In [1]:
# Install sentence-transformers
!pip install sentence-transformers

# Install elasticsearch 
!pip install elasticsearch



#### I.2. Packages <a id="I2"></a>

In [2]:
import json
import os

from IPython.display import display, HTML

from sentence_transformers import SentenceTransformer, util

from elasticsearch import Elasticsearch
from elasticsearch import helpers

#### I.3. Datasets <a id="I3"></a>

As corpus, we use all EMNLP publications from 2016 - 2018


In [3]:
def get_papers_dataset(ulr_dataset="https://sbert.net/datasets/emnlp2016-2018.json", dataset="data/emnlp2016-2018.json"):

    if not os.path.exists(dataset):
        util.http_get(ulr_dataset, dataset)

    with open(dataset) as f:
        papers = json.load(f)
        return papers

In [4]:
papers = get_papers_dataset()
print(json.dumps(papers[0], indent=4))

{
    "title": "Rule Extraction for Tree-to-Tree Transducers by Cost Minimization",
    "abstract": "Finite-state transducers give efficient representations of many Natural Language phenomena. They allow to account for complex lexicon restrictions encountered, without involving the use of a large set of complex rules difficult to analyze. We here show that these representations can be made very compact, indicate how to perform the corresponding minimization, and point out interesting linguistic side-effects of this operation.",
    "url": "http://aclweb.org/anthology/D16-1002",
    "venue": "EMNLP",
    "year": "2016"
}


In [5]:
#To encode the papers, we must combine the title and the abstracts to a single string
papers_texts = [paper['title'] + '[SEP]' + paper['abstract'] for paper in papers]

### II. Sentence Transformers <a id="II"></a>

Sentence-BERT(SBERT) is a modification of the BERT(Bidirectional Encoder Representations from
Transformers) capable of detecting semantic similarities between sentences. Thus, unlike traditional search that only finds documents based on matching, the SBERT approach can also find synonyms (semantic relations). 
<center><img src="image/sbert.png" width="800" height="400"/></center>

Ressources:

- [Documentation Sentence transformers](https://pypi.org/project/sentence-transformers/)
- [SBERT models](https://www.sbert.net/docs/pretrained_models.html#model-overview)
- [Arxiv SBERT](https://arxiv.org/pdf/1908.10084.pdf)
- [Arxiv BERT](https://arxiv.org/pdf/1810.04805.pdf)

#### II.1. Model <a id="II1"></a>

In [6]:
#Load model SentenceTransformers
model = SentenceTransformer('all-MiniLM-L6-v2')

#Compute embeddings for all articles
corpus_embeddings = model.encode(papers_texts, convert_to_tensor=True)

#### II.2. Search with Transformer <a id="II2"></a>

In [7]:
def search_papers(title, search_hits, papers=papers):
    html = """
    <style type='text/css'>
    @import url('https://fonts.googleapis.com/css?family=Oswald&display=swap');
    table {
      border-collapse: collapse;
      width: 800px;
    }
    th, td {
        border: 1px solid #9e9e9e;
        padding: 5px;
        font: 13px Oswald;
    }
    </style>
    """

    html += "<h3></h3><table><thead><tr><th>Score</th><th>Title</th><th>Venu</th><th>Year</th></tr></thead>"
    for hit in search_hits:
        related_papers = papers[hit['corpus_id']]
        html += "<tr><td>%.4f</td><td>%s</td><td>%s</td><td>%s</td></tr>" % (hit['score'], related_papers['title'], related_papers['venue'], related_papers['year'])
    html += "</table>"
    
    print("Paper:", title)
    
    display(HTML(html))

In [8]:
title='Digital Voicing of Silent Speech'
abstract='In this paper, we consider the task of digitally voicing silent speech, where silently mouthed words are converted to audible speech based on electromyography (EMG) sensor measurements that capture muscle impulses. While prior work has focused on training speech synthesis models from EMG collected during vocalized speech, we are the first to train from EMG collected during silently articulated speech. We introduce a method of training on silent EMG by transferring audio targets from vocalized to silent signals. Our method greatly improves intelligibility of audio generated from silent EMG compared to a baseline that only trains with vocalized data, decreasing transcription word error rate from 64% to 4% in one data condition and 88% to 68% in another. To spur further development on this task, we share our new dataset of silent and vocalized facial EMG measurements.'

paper_embedding = model.encode(title+'[SEP]'+ abstract, convert_to_tensor=True)
search_hits = util.semantic_search(paper_embedding, corpus_embeddings, top_k=10)

search_papers(title, search_hits[0])

Paper: Digital Voicing of Silent Speech


Score,Title,Venu,Year
0.4329,Nonparametric Bayesian Models for Spoken Language Understanding,EMNLP,2016
0.4313,Speech segmentation with a neural encoder model of working memory,EMNLP,2017
0.4186,Using Context Information for Dialog Act Classification in DNN Framework,EMNLP,2017
0.4144,Session-level Language Modeling for Conversational Speech,EMNLP,2018
0.4136,Charmanteau: Character Embedding Models For Portmanteau Creation,EMNLP,2017
0.396,ICON: Interactive Conversational Memory Network for Multimodal Emotion Detection,EMNLP,2018
0.3958,A Co-Attention Neural Network Model for Emotion Cause Analysis with Emotional Context Awareness,EMNLP,2018
0.3893,Learning a Lexicon and Translation Model from Phoneme Lattices,EMNLP,2016
0.3808,Reasoning about Pragmatics with Neural Listeners and Speakers,EMNLP,2016
0.3786,Supervised Domain Enablement Attention for Personalized Domain Classification,EMNLP,2018


### III. Elasticsearch <a id="III"></a>

Elasticsearch is a distributed, RESTful search and analytics engine capable of addressing a growing number of use cases[[elastic.co](https://www.elastic.co/elasticsearch/)]. ElasticSearch give the possibility to index dense vectors and use them for document scoring. 

An advantage of ElasticSearch is that it is easy to add new documents to an index and that we can store also other data along with our vectors. A disadvantage is the slow performance, as it compares the query embeddings with all stored embeddings. This has a linear run-time and might be too slow for large (>100k) corpora[[elasticsearch semantic search](https://github.com/UKPLab/sentence-transformers/tree/master/examples/applications/semantic-search#elasticsearch)].


Steps for running this part:
- [Download elasticsearch](https://www.elastic.co/fr/downloads/elasticsearch)
- On your terminal: Go to the folder elasticsearch and run ``/bin/elasticsearch``

Ressources:

- [Elasticsearch](https://pypi.org/project/elasticsearch/)

#### III.1. Insert data <a id="III1"></a>

In [9]:
es = Elasticsearch()

es.info()

{'name': 'LAPTOP-GV57OKPL',
 'cluster_name': 'elasticsearch',
 'cluster_uuid': '3Jcz9bwdTYCp20csRb0JTQ',
 'version': {'number': '7.9.0',
  'build_flavor': 'default',
  'build_type': 'zip',
  'build_hash': 'a479a2a7fce0389512d6a9361301708b92dff667',
  'build_date': '2020-08-11T21:36:48.204330Z',
  'build_snapshot': False,
  'lucene_version': '8.6.0',
  'minimum_wire_compatibility_version': '6.8.0',
  'minimum_index_compatibility_version': '6.0.0-beta1'},
 'tagline': 'You Know, for Search'}

In [10]:
mappings = {
        "properties": {
        "paper": {
            "type": "text"
        },
        "paper_vector": {
            "type": "dense_vector",
            "dims": 384
        }
    }
}

es.indices.create(index='papers', mappings=mappings, ignore=[400])

{'acknowledged': True, 'shards_acknowledged': True, 'index': 'papers'}

In [11]:
rows = 0
requests = []

for _id, (paper, embedding) in enumerate(zip(papers, corpus_embeddings)):
    requests.append({"_op_type": "index",
                "_index": "papers",
                "_id": _id,
                "_source": {
                    "paper": paper["title"],
                    "venue": paper["venue"],
                    "year": paper["year"],
                    "paper_vector": embedding.numpy()
                    }
                })
    rows += 1
helpers.bulk(es, requests)

print("Total papers inserted: {}".format(rows))

Total papers inserted: 974


In [12]:
def get_papers_elastic(query_text, size=10):
    
  query_embedding = model.encode(query_text)  
  query={
    "script_score": {
        "query": {
        "match_all": {}
        },
        "script": {
        "source": "cosineSimilarity(params.queryVector, 'paper_vector') + 1.0",
        "params": {
            "queryVector": query_embedding
        }
        }
    }
}

  results = es.search(index="papers", query=query, size=size)["hits"]["hits"]
  results = [{"score": (result["_score"] - 1.0), "paper": result["_source"]["paper"],  "venue": result["_source"]["venue"], "year": result["_source"]["year"]} for result in results]

  return results

In [13]:
def display_papers(query, rows):
    html = """
    <style type='text/css'>
    @import url('https://fonts.googleapis.com/css?family=Oswald&display=swap');
    table {
      border-collapse: collapse;
      width: 800px;
    }
    th, td {
        border: 1px solid #9e9e9e;
        padding: 5px;
        font: 13px Oswald;
    }
    </style>
    """

    html += "<h3></h3><table><thead><tr><th>Score</th><th>Title</th><th>Venu</th><th>Year</th></tr></thead>"
    for result in rows:
        html += "<tr><td>%.4f</td><td>%s</td><td>%s</td><td>%s</td></tr>" % (result["score"], result["paper"], result["venue"], result["year"])
    html += "</table>"
    
    print("Paper:", query)
    
    display(HTML(html))

#### III.2. Search with Elastic <a id="III2"></a>

In [14]:
title='Digital Voicing of Silent Speech'
abstract='In this paper, we consider the task of digitally voicing silent speech, where silently mouthed words are converted to audible speech based on electromyography (EMG) sensor measurements that capture muscle impulses. While prior work has focused on training speech synthesis models from EMG collected during vocalized speech, we are the first to train from EMG collected during silently articulated speech. We introduce a method of training on silent EMG by transferring audio targets from vocalized to silent signals. Our method greatly improves intelligibility of audio generated from silent EMG compared to a baseline that only trains with vocalized data, decreasing transcription word error rate from 64% to 4% in one data condition and 88% to 68% in another. To spur further development on this task, we share our new dataset of silent and vocalized facial EMG measurements.'
rows = get_papers_elastic(title+'[SEP]'+ abstract)
display_papers(title, rows)

Paper: Digital Voicing of Silent Speech


Score,Title,Venu,Year
0.4329,Nonparametric Bayesian Models for Spoken Language Understanding,EMNLP,2016
0.4313,Speech segmentation with a neural encoder model of working memory,EMNLP,2017
0.4186,Using Context Information for Dialog Act Classification in DNN Framework,EMNLP,2017
0.4144,Session-level Language Modeling for Conversational Speech,EMNLP,2018
0.4136,Charmanteau: Character Embedding Models For Portmanteau Creation,EMNLP,2017
0.396,ICON: Interactive Conversational Memory Network for Multimodal Emotion Detection,EMNLP,2018
0.3893,Learning a Lexicon and Translation Model from Phoneme Lattices,EMNLP,2016
0.3808,Reasoning about Pragmatics with Neural Listeners and Speakers,EMNLP,2016
0.3786,Supervised Domain Enablement Attention for Personalized Domain Classification,EMNLP,2018
0.3718,Estimating Marginal Probabilities of n-grams for Recurrent Neural Language Models,EMNLP,2018
