# Sentence Embeddings using Siamese BERT-Networks

Based on the code from: 
- https://github.com/UKPLab/sentence-transformers/blob/master/examples/application_semantic_search.py
- https://github.com/UKPLab/sentence-transformers/blob/master/examples/training_nli_bert.py
- https://github.com/UKPLab/sentence-transformers/blob/master/examples/application_clustering.py
- FAISs: https://github.com/bhavsarpratik/transformers/blob/master/5.%20semantic-search-USE.ipynb

This is a simple test application for sentence embeddings: semantic search in norwegian

We have a corpus with various sentences. Then, for a given query sentence, we want to find the most similar sentence in this corpus.

This script outputs for various queries the top 5 most similar sentences in the corpus.


In [1]:
import numpy as np

# FAISS search

In [2]:
import faiss

class FAISS:
    def __init__(self, dimensions:int):
        self.dimensions = dimensions
        self.index = faiss.IndexFlatL2(dimensions)
        self.vectors = {}
        self.counter = 0
    
    def add(self, text:str, v:list):
        self.index.add(v)
        self.vectors[self.counter] = (text, v)
        self.counter += 1
        
    def search(self, v:list, k:int=10):
        distance, item_index = self.index.search(v, k)
        for dist, i in zip(distance[0], item_index[0]):
            if i==-1:
                break
            else:
                print(f'{self.vectors[i][0]}, %.2f'%dist)

In [12]:
from sentence_transformers import SentenceTransformer, models

class SENTENCE_TRANSFORMERS():
    """Sentence Transformers models
    For semantic search these are recommended:
     - https://github.com/UKPLab/sentence-transformers/blob/master/docs/pretrained-models/sts-models.md
     
     Input:
        model_name: name of the model. See the list.
        device: cpu,cuda
    """
    def __init__(self, model_name='distilbert-base-nli-stsb-mean-tokens', device='cpu'):
        self.embeddings = SentenceTransformer(model_name, device=device)
        
    def encode(self, text:list):
        result = self.embeddings.encode(text)
        return np.asarray(result, dtype=np.float32)


class DistilBERT():
    """DistilBERT models. These come from HuggingFace transformers. 
    The assumtion is that these produce a bit worse senctence embeddings for semantic search.
    
     Input:
        model_name: name of the model. See the list.
        device: cpu,cuda
    """
    def __init__(self, model_name='distilbert-base-multilingual-cased', use_mean_pooling=True, use_cls_pooling=False, use_max_pooling=False, device='cpu'):
        model = models.DistilBERT(model_name)
        pooling =  models.Pooling(model.get_word_embedding_dimension(),
                               pooling_mode_mean_tokens=use_mean_pooling,
                               pooling_mode_cls_token=use_cls_pooling,
                               pooling_mode_max_tokens=use_max_pooling)
        self.embeddings = SentenceTransformer(modules=[model, pooling], device=device) 
        
    def encode(self, text:list):
        result = self.embeddings.encode(text)
        return np.asarray(result, dtype=np.float32)

In [13]:
from tqdm import tqdm

class SemanticSearch():
    def __init__(self, encoder, dimension):
        self.encoder = encoder
        self.index = FAISS(dimension)
            
    def ingest(self, text:list):
        """text: a list of strings"""
        for t in tqdm(text):
            emb = self.encoder.encode([t])
            self.index.add(t, emb)
    
    def search(self, query, top:int=10):
        emb = self.encoder.encode([query])
        return self.index.search(emb, top)

# Elasticsearch setup

In [27]:
from elasticsearch import Elasticsearch
from datetime import datetime
es = Elasticsearch()

doc = {
    'author': 'kimchy',
    'text': 'Elasticsearch: cool. bonsai cool.',
    'timestamp': datetime.now(),
}
res = es.index(index="test-index", id=1, body=doc)
print(res['result'])

res = es.get(index="test-index", id=1)
print(res['_source'])

es.indices.refresh(index="test-index")

res = es.search(index="test-index", body={"query": {"match_all": {}}})
print("Got %d Hits:" % res['hits']['total']['value'])
for hit in res['hits']['hits']:
    print("%(timestamp)s %(author)s: %(text)s" % hit["_source"])

created
{'author': 'kimchy', 'text': 'Elasticsearch: cool. bonsai cool.', 'timestamp': '2020-02-21T13:44:28.654084'}
Got 1 Hits:
2020-02-21T13:44:28.654084 kimchy: Elasticsearch: cool. bonsai cool.


# Test Vector Search

In [14]:
encoder = SENTENCE_TRANSFORMERS()

100%|██████████| 245M/245M [00:30<00:00, 7.92MB/s] 


In [17]:
dimension = encoder.encode(['hello']).size
print(dimension)

768


In [18]:
index = FAISS(dimension)

# index word
t1 = 'hello'
v1 = encoder.encode([t1])
index.add(t1, v1)

# index word
t1 = 'bye'
v1 = encoder.encode([t1])
index.add(t1, v1)

# search similar word
t1 = 'hi'
v1 = encoder.encode([t1])
print('word,  distance')
index.search(v1)

word,  distance
hello, 127.27
bye, 348.01


# Test with real data 

## Test with News

### https://en.wikipedia.org/wiki/The_Goldfinch_(painting)

In [16]:
article = "The Goldfinch (Dutch: Het puttertje) is a painting by the Dutch Golden Age artist Carel Fabritius of a life-size chained goldfinch. Signed and dated 1654, it is now in the collection of the Mauritshuis in The Hague, Netherlands. The work is a trompe-l'œil oil on panel measuring 33.5 by 22.8 centimetres (13.2 in × 9.0 in) that was once part of a larger structure, perhaps a window jamb or a protective cover. It is possible that the painting was in its creator's workshop in Delft at the time of the gunpowder explosion that killed him and destroyed much of the city. A common and colourful bird with a pleasant song, the goldfinch was a popular pet, and could be taught simple tricks including lifting a thimble-sized bucket of water. It was reputedly a bringer of good health, and was used in Italian Renaissance painting as a symbol of Christian redemption and the Passion of Jesus. The Goldfinch is unusual for the Dutch Golden Age painting period in the simplicity of its composition and use of illusionary techniques. Following the death of its creator, it was lost for more than two centuries before its rediscovery in Brussels. It plays a central role in the Pulitzer Prize-winning novel The Goldfinch by Donna Tartt and its film adaptation."
lines = article.split('. ')
print(lines)

['The Goldfinch (Dutch: Het puttertje) is a painting by the Dutch Golden Age artist Carel Fabritius of a life-size chained goldfinch', 'Signed and dated 1654, it is now in the collection of the Mauritshuis in The Hague, Netherlands', "The work is a trompe-l'œil oil on panel measuring 33.5 by 22.8 centimetres (13.2 in × 9.0 in) that was once part of a larger structure, perhaps a window jamb or a protective cover", "It is possible that the painting was in its creator's workshop in Delft at the time of the gunpowder explosion that killed him and destroyed much of the city", 'A common and colourful bird with a pleasant song, the goldfinch was a popular pet, and could be taught simple tricks including lifting a thimble-sized bucket of water', 'It was reputedly a bringer of good health, and was used in Italian Renaissance painting as a symbol of Christian redemption and the Passion of Jesus', 'The Goldfinch is unusual for the Dutch Golden Age painting period in the simplicity of its composit

In [19]:
encoder_1 = SENTENCE_TRANSFORMERS()
dimension_1 = encoder_1.encode(['hello']).size
ss_1 = SemanticSearch(encoder_1,dimension_1)

In [20]:
ss_1.ingest(lines)

100%|██████████| 9/9 [00:00<00:00, 25.27it/s]


In [21]:
ss_1.search('What is the Goldfinch', top=5)

The Goldfinch (Dutch: Het puttertje) is a painting by the Dutch Golden Age artist Carel Fabritius of a life-size chained goldfinch, 168.03
The Goldfinch is unusual for the Dutch Golden Age painting period in the simplicity of its composition and use of illusionary techniques, 244.26
It plays a central role in the Pulitzer Prize-winning novel The Goldfinch by Donna Tartt and its film adaptation., 263.63
A common and colourful bird with a pleasant song, the goldfinch was a popular pet, and could be taught simple tricks including lifting a thimble-sized bucket of water, 351.30
The work is a trompe-l'œil oil on panel measuring 33.5 by 22.8 centimetres (13.2 in × 9.0 in) that was once part of a larger structure, perhaps a window jamb or a protective cover, 369.19


In [28]:
encoder_2 = DistilBERT()
dimension_2 = encoder_2.encode(['hello']).size
ss_2 = SemanticSearch(encoder_2,dimension_2)

In [29]:
ss_2.ingest(lines)

100%|██████████| 9/9 [00:00<00:00, 12.78it/s]


In [30]:
ss_2.search('What is the Goldfinch', top=5)

The Goldfinch (Dutch: Het puttertje) is a painting by the Dutch Golden Age artist Carel Fabritius of a life-size chained goldfinch, 39.16
The Goldfinch is unusual for the Dutch Golden Age painting period in the simplicity of its composition and use of illusionary techniques, 42.99
A common and colourful bird with a pleasant song, the goldfinch was a popular pet, and could be taught simple tricks including lifting a thimble-sized bucket of water, 43.97
It is possible that the painting was in its creator's workshop in Delft at the time of the gunpowder explosion that killed him and destroyed much of the city, 52.11
It plays a central role in the Pulitzer Prize-winning novel The Goldfinch by Donna Tartt and its film adaptation., 52.82
