# Ask the docs anything about SuperDuperDB

In this notebook we show you how to implement the much-loved document Q&A task, using SuperDuperDB
together with MongoDB.

In [None]:
!pip install superduperdb

In [1]:
import os
os.environ['OPENAI_API_KEY'] = ''
if 'OPENAI_API_KEY' not in os.environ:
    raise Exception('Environment variable "OPENAI_API_KEY" not set')

In [2]:
import os


# Uncomment one of the following lines to use a bespoke MongoDB deployment
# For testing the default connection is to mongomock

# mongodb_uri = os.getenv("MONGODB_URI", "mongomock://test")
mongodb_uri = "mongodb://localhost:27017/testTest"
# mongodb_uri = "mongodb://superduper:superduper@mongodb:27017/documents"
# mongodb_uri = "mongodb://<user>:<pass>@<mongo_cluster>/<database>"
# mongodb_uri = "mongodb+srv://<username>:<password>@<atlas_cluster>/<database>"

# Super-Duper your Database!
from superduperdb import superduper
db = superduper(mongodb_uri)

In this example we use the internal textual data from the `superduperdb` project's API documentation, with the "meta"-goal of 
creating a chat-bot to tell us about the project which we are using!

Uncomment the following cell if you have the superduperdb project locally, and would like to load the latest version of the API.
Otherwise you can load the data in the following cells.

In [None]:
# import glob

# ROOT = '../docs/content/docs'

# STRIDE = 5       # stride in numbers of lines
# WINDOW = 10       # length of window in numbers of lines

# content = sum([open(file).readlines() 
#                for file in glob.glob(f'{ROOT}/*/*.md') 
#                + glob.glob('{ROOT}/*.md')], [])
# chunks = ['\n'.join(content[i: i + WINDOW]) for i in range(0, len(content), STRIDE)]

In [3]:
!curl -O https://superduperdb-public.s3.eu-west-1.amazonaws.com/superduperdb_docs.json

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  139k  100  139k    0     0   510k      0 --:--:-- --:--:-- --:--:--  511k


In [4]:
import json

with open('superduperdb_docs.json') as f:
    chunks = json.load(f)

You can see that the chunks of text contain bits of code, and explanations, 
which can become useful in building a document Q&A chatbot.

In [5]:
from IPython.display import Markdown
Markdown(chunks[1])

  ML directly to the datastore: to databases, object-storage, data-lakes and data-warehouses.

- Empower developers, data scientists and architects to leverage the vast AI

  **open-source ecosystem** in their datastore deployments.

- Enable ways-of-working with AI and data which **enable scalability** and industrial scale deployment,

  as well as providing easy-to-use tools for the **individual developer**.

- Make possible continued use of **already existing or deployed datastores**, in combination with AI and ML;

  **no migration of data** to a fancy, new fangled vector-database required.

- Follow a **fully open-source approach**, in particular prioritizing open-source integrations

  in our roadmap going forward

- Enable individuals and organizations to **circumvent vendor lock-in strategies** now ubiquitous


As usual we insert the data:

In [7]:
from superduperdb.container.document import Document
import json
from superduperdb.db.mongodb.query import Collection
from superduperdb.container.document import Document as D
collection = Collection('questiondocs')
db.execute(collection.insert_many([Document({'txt': chunk}) for chunk in chunks]))

INFO:root:found 0 uris


(<pymongo.results.InsertManyResult at 0x2a0521270>,
 TaskWorkflow(database=<superduperdb.db.base.db.DB object at 0x2a0774d50>, G=<networkx.classes.digraph.DiGraph object at 0x2a04c9b90>))

We set up a standard `superduperdb` vector-search index using `openai` (although there are many options
here: `torch`, `sentence_transformers`, `transformers`, ...)

In [8]:
from superduperdb.container.vector_index import VectorIndex
from superduperdb.container.listener import Listener
from superduperdb.ext.openai.model import OpenAIEmbedding

db.add(
    VectorIndex(
        identifier='my-index',
        indexing_listener=Listener(
            model=OpenAIEmbedding(model='text-embedding-ada-002'),
            key='txt',
            select=collection.find(),
        ),
    )
)

INFO:root:Adding model text-embedding-ada-002 to db
INFO:root:Done.
373it [00:00, 18593.94it/s]
100%|██████████| 4/4 [00:06<00:00,  1.66s/it]
INFO:root:loading hashes: 'my-index'
Loading vectors into vector-table...: 373it [00:00, 1989.43it/s]


[]

Now we create a chat-completion component, and add this to the system:

In [10]:
from superduperdb.ext.openai.model import OpenAIChatCompletion

chat = OpenAIChatCompletion(
    model='gpt-3.5-turbo',
    prompt=(
        'Use the following description and code-snippets aboout SuperDuperDB to answer this question about SuperDuperDB\n'
        'Do not use any other information you might have learned about other python packages\n'
        'Only base your answer on the code-snippets retrieved\n'
        '{context}\n\n'
        'Here\'s the question:\n'
    ),
)

db.add(chat)

[]

We can view that this is now registered in the system:

In [11]:
print(db.show('model'))

['gpt-3.5-turbo', 'text-embedding-ada-002']


Finally, asking questions about the documents can be targeted with a particular query.
Using the power of MongoDB, this allows users to use vector-search in combination with
important filtering rules:

In [12]:
from superduperdb.container.document import Document
from IPython.display import display, Markdown

q = 'can I use vector search with MongoDB community'

output, context = db.predict(
    model_name='gpt-3.5-turbo',
    input=q,
    context_select=(
        collection
            .like(Document({'txt': q}), vector_index='my-index', n=5)
            .find()
    ),
    context_key='txt',
)

Markdown(output.content)

INFO:root:loading hashes: 'my-index'
Loading vectors into vector-table...: 373it [00:00, 1866.17it/s]


Yes, you can use vector search with MongoDB community. The code snippets and description provided mention that SuperDuperDB includes a composite API that enables support for vector search together with the query API of the database. It also mentions that if one or more `VectorIndex` instances have been configured together with the `DB`, they can be used in hybrid queries together with standard databasing queries.