In this tutorial, we are going to look at how AquilaDB vector database can help efficient Semantic Retrieval with Google's Universal Sentence Encoder (USE).

This tutorial is following the same idea as described [in latest post on Universal Sentence Encoder at Google AI blog](https://ai.googleblog.com/2019/07/multilingual-universal-sentence-encoder.html). One difference is that, we are using a model which is different from the one mentioned in the blog post. We use `universal-sentence-encoder-large` which belongs the same `USE` family. We encourage you to read that blog before proceeding. Because it is very useful to get a context on what we are going to do below.

This is an image taken from that blog post. A recommended pipeline for textual similarity. `AquilaDB` will cover `pre-encoded Candidates` data store and `ANN search` modules in this pipeline. Cool.. Right?

![A prototypical semantic retrieval pipeline, used for textual similarity](https://1.bp.blogspot.com/-q1g13xLR-9E/XSi8ZewIXzI/AAAAAAAAETQ/Oek9K51ZrAQvbZL3t3rme5HcegzCNm98QCEwYBhgL/s640/image1.png)

In [1]:
# Let's import required modules

import tensorflow as tf
import tensorflow_hub as hub

### Load pretrained encoder
We need to load pretrained USE model from Tensorflow Hub. We use this model to encode our sentances before sending it to AquilaDB for indexing and querying.

In [2]:
use_module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/3"

# load Universal Sentence Encoder module from tensor hub
embed_module = hub.Module(use_module_url)

Let's test our loaded model with some random texts before proceeding.

In [3]:
# let's create some test sentanaces
test_messages = ["AquilaDB is a Resillient, Replicated, Decentralized, Host neutral storage for Feature Vectors along with Document Metadata.", 
            "Do k-NN retrieval from anywhere, even from the darkest rifts of Aquila (in progress). It is easy to setup and scales as the universe expands."]

We feed our text array to model for embedding. Don't forget to wrap the embedding logic into a method to reuse it.

In [4]:
# helper function to generate embedding for input array of sentances
def generate_embeddings (messages_in):
    # generate embeddings
    with tf.Session() as session:
        session.run([tf.global_variables_initializer(), tf.tables_initializer()])
        message_embeddings = session.run(embed_module(messages_in))
        
    return message_embeddings

In [5]:
# print generated embeddings
print(generate_embeddings(test_messages))

[[-0.00570544  0.01024008  0.04416275 ...  0.03282805 -0.01723128
   0.00956334]
 [ 0.0124177   0.09862255  0.06958324 ... -0.00700251  0.02332876
  -0.09377097]]


As you can see above, we were able to encode out random texts into corresponding sentance embedding with `USE`.

### Let's load actual data
We will be loading some text from a text file. This is a small wiki article set in plain text format.

In [7]:
with open('article_set.txt', 'r') as file_in:
    lines = file_in.readlines()

Let's write some helper functions. These functions will help us communicate with AquilaDB. You don't have to worry about this part now. Just keep it as is except for the IP address `192.168.1.100`. Replace that with the IP address where your AquilaDB installation is. Most probably, it is the same machine you are using now - then give `localhost` as address.

In [8]:
# helper functions to generate documents

import grpc

import vecdb_pb2
import vecdb_pb2_grpc

channel = grpc.insecure_channel('192.168.1.100:50051')
stub = vecdb_pb2_grpc.VecdbServiceStub(channel)

# API interface to add documents to AquilaDB
def addDocuments (documents_in):
    response = stub.addDocuments(vecdb_pb2.addDocRequest(documents=documents_in))
    return response


import base64
import json

# helper function to convert native data to API friendly data
def convertDocuments(vector, document):
    return {
            "vector": {
                "e": vector
            },
            "b64data": json.dumps(document, separators=(',', ':')).encode('utf-8')
        }


# API interface to get nearest documents from AquilaDB
def getNearest (matrix_in, k_in):
    response = stub.getNearest(vecdb_pb2.getNearestRequest(matrix=matrix_in, k=k_in))
    return response


# helper function to convert native data to API friendly data
def convertMatrix(vector):
    return [{
            "e": vector
    }]

### Send documents to AquilaDB for indexing
As mentioned previously, we need to store pre encoded candidates in a vector database to perform semantic similarity retrieval later. So, what we are going to do here is to take each line from wiki articles, encode them with `USE` model, attach the original wiki text with the resulting vector as metadata and send them to AquilaDB for indexing.

In [9]:
import time

# set a batch length
batch_len = 200
# counter to init batch sending of documents
counter = 0
# to keep generated documents
docs_gen = []
# to keep lines batch
lbatch = []

for line in lines:
    lbatch.append(line)
    if len(lbatch) == batch_len:
        counter = counter + 1
        # generate embeddings
        vectors = generate_embeddings(lbatch)
        for i in range(len(vectors)):
            docs_gen.append(convertDocuments(vectors[i], {"text": lbatch[i]}))
        # add documents to AquilaDB
        response = addDocuments(docs_gen)
        print("index: "+str(counter), "inserted: "+str(len(response._id)))
        docs_gen = []
        lbatch = []

index: 1 inserted: 199
index: 2 inserted: 178
index: 3 inserted: 144
index: 4 inserted: 200
index: 5 inserted: 198
index: 6 inserted: 199
index: 7 inserted: 197
index: 8 inserted: 185
index: 9 inserted: 196
index: 10 inserted: 170
index: 11 inserted: 200
index: 12 inserted: 193
index: 13 inserted: 197
index: 14 inserted: 194
index: 15 inserted: 190
index: 16 inserted: 200
index: 17 inserted: 151
index: 18 inserted: 200
index: 19 inserted: 195
index: 20 inserted: 200
index: 21 inserted: 183
index: 22 inserted: 200
index: 23 inserted: 186
index: 24 inserted: 198
index: 25 inserted: 200
index: 26 inserted: 197
index: 27 inserted: 200
index: 28 inserted: 163
index: 29 inserted: 198
index: 30 inserted: 179
index: 31 inserted: 200
index: 32 inserted: 190
index: 33 inserted: 188
index: 34 inserted: 195
index: 35 inserted: 198
index: 36 inserted: 200
index: 37 inserted: 194
index: 38 inserted: 184
index: 39 inserted: 200
index: 40 inserted: 192
index: 41 inserted: 197
index: 42 inserted: 199
i

### Query the database
Now, we need to retrieve semantically similar sentance to our input query from the database. It is straight forward. Just encode the query text with the same `USE` model and then perform k-NN query on the database.

In [12]:
# Method to query for nearest neighbours
def query_nn (query):
    query = [query]
    vector = generate_embeddings(query)[0]
    
    converted_vector = convertMatrix(vector)
    nearest_docs_result = getNearest(converted_vector, 1)
    nearest_docs_result = json.loads(nearest_docs_result.documents)
    
    return nearest_docs_result

In [15]:
# Let's try an example query.

print(query_nn('what are the subfamilies of duck')[0]['doc']['text'])

Swans are birds of the family Anatidae, which also includes geese and ducks. Swans are grouped with the closely related geese in the subfamily Anserinae where they form the tribe Cygnini. Sometimes, they are considered a distinct subfamily, Cygninae. Swans usually mate for life, though 'divorce' does sometimes occur, particularly following nesting failure. The number of eggs in each clutch ranges from three to eight.



That's all for this tutorial. Thanks, happy hacking..!

created with ❤️ a-mma.indic (a_മ്മ)