Apache License

In [16]:
'\nCopyright 2019 Carlos Rodriguez\n\nLicensed under the Apache License, Version 2.0 (the "License");\nyou may not use this file except in compliance with the License.\nYou may obtain a copy of the License at\n\n    http://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software\ndistributed under the License is distributed on an "AS IS" BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\nSee the License for the specific language governing permissions and\nlimitations under the License.\n'

'\nCopyright 2019 Carlos Rodriguez\n\nLicensed under the Apache License, Version 2.0 (the "License");\nyou may not use this file except in compliance with the License.\nYou may obtain a copy of the License at\n\n    http://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software\ndistributed under the License is distributed on an "AS IS" BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\nSee the License for the specific language governing permissions and\nlimitations under the License.\n'

# Writing a Q&A Chat-bot from scratch using the Universal Sentence Encoder and KNN Vector Search

This tutorial experiments with using semantic textual similarity and neighborhood search to map natural language questions to answers.


First, let's make sure Google Colab is running tf2

In [3]:
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass

TensorFlow 2.x selected.


Install the [NMSLib](https://github.com/nmslib/nmslib) library. NMSLib is an efficient similarity search library and a toolkit for evaluation of k-NN methods for generic non-metric spaces.

In [0]:

!pip3 install --quiet nmslib

Importing dependencies

In [0]:
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
import nmslib

import os
import time
import sys
import datetime
import random
from dataclasses import dataclass, asdict

We'll leverage version 4 of Google's Universal Sentence Encoder. The `universal-sentence-encoder-large` model is trained with a Transformer encoder and is optimized for sentences towards tasks like semantic similarity, classification, and clustering.

In [0]:
# initialize encoder
USENC_4: str = "https://tfhub.dev/google/universal-sentence-encoder-large/4"
encoder: hub.module.Module = hub.load(USENC_4)

Let's create some atomic functions that define our encoder and search index.

In [0]:
"""Encoder"""

# extract embeddings as numpy array
def encode(messages: list, encoder: hub.module.Module=encoder) -> np.ndarray:
    return encoder(messages)["outputs"]

"""Search"""

# initialize a search index
def create_index(embeddings: np.ndarray, method: str='hnsw') -> nmslib.dist.FloatIndex:
    # ref: https://github.com/nmslib/nmslib/blob/master/manual/methods.md

    # initialize a new index, using a HNSW index on Cosine Similarity
    search_index: nmslib.dist.FloatIndex = nmslib.init(method=method, space='cosinesimil')
    search_index.addDataPointBatch(embeddings)
    search_index.createIndex({'post': 2}, print_progress=True)

    return search_index

# perform a knn search
def search(query_vector: np.ndarray, search_index: nmslib.dist.FloatIndex, n_results:int = 3) -> tuple:
    idx, dist = search_index.knnQuery(query_vector, k=n_results)
    return (idx, dist)

Let's a create a simple Q&A bot that defines a few sample questions and their corresponding answers. 

In [0]:
"""Bot"""

@dataclass
class QABot:
    queries: dict
    answers: dict

    @property
    def keyphrases(self) -> list:
        return list(self.queries.keys())


Now we can seed our bot with some data.

In [0]:
# provide some sample key phrases
sample_queries: dict = {
    "favorite baseball team": "fav_baseball",
    "best baseball team": "fav_baseball",
    "favorite basketball team": "fav_basketball",
    "best basketball team": "fav_basketball",
    "best NBA team": "fav_basketball",
    "grew up": "hometown",
    "hometown": "hometown",
    "grow up": "hometown"
}

# provide some answers
answers: dict = {
    'fav_baseball': ["NY Yankees, obviously", "have to say...Yankees"],
    'fav_basketball': ["Grew up in the Jordan era...Bulls", "Bulls", "Chicago"],
    'hometown': ["South Norwalk, CT", "Connecticut", "Southern Connecticut right outside of NY"]
}

qa_bot: dict = QABot(sample_queries, answers)

Whenever we update the seed data, we'll want to re-create a search index.

In [0]:
# re-create the search index anytime new data is added to the bot 
keyphrase_embeddings: np.ndarray = encode(qa_bot.keyphrases)
search_index: nmslib.dist.FloatIndex = create_index(keyphrase_embeddings)

Now, let's create a simple chat interface.

In [0]:
"""Chat Interface"""

def _bubbles(pause: int):
    # credit https://gist.github.com/Y4suyuki/6805818
    animation = "|/-\\"

    for i in range(pause):
        time.sleep(0.1)
        sys.stdout.write("\r" + animation[i % len(animation)])
        sys.stdout.flush()

def chat(message: str, bot: QABot=qa_bot, search_index: nmslib.dist.FloatIndex=search_index):
    # delay animation
    _bubbles(5)

    # encode query
    vectory_query: np.ndarray = encode([message])

    # get search results
    idx, dist = search(vectory_query, search_index)

    # traverse to the first answer
    if idx.any():
        if dist[0] < 0.75:
            # match the search result index to a corresponding key-phrase
            search_result: str = bot.keyphrases[idx[0]]

            # use the plain text key-phrase to map to an answer
            answer_key: str = bot.queries[search_result]
            answer: str = bot.answers[answer_key]

            # randomize the answer for variety
            return random.choice(answer)
        else:
            return "Sorry, I don't have an answer for that." # TODO no english

In [217]:
chat("What's your favorite baseball team?")

|

'NY Yankees, obviously'

In [212]:
chat("Where did you grow up?")

|

'South Norwalk, CT'

In [213]:
chat("Where's your hometown'?")

|

'South Norwalk, CT'

In [214]:
chat("What's the best team in the NBA?")

|

'Grew up in the Jordan era...Bulls'

In [218]:
chat("what's the weather?")

|

"Sorry, I don't have an answer for that."