Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation issues #134

Open
pkpro opened this issue Feb 26, 2024 · 4 comments
Open

Documentation issues #134

pkpro opened this issue Feb 26, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@pkpro
Copy link

pkpro commented Feb 26, 2024

Describe the bug

I'm referring to the documentation at the following URL: https://epsilla-inc.gitbook.io/epsilladb/vector-database

  1. Missing documentation on querying existing databases
  2. Missing documentation on querying existing tables
  3. Missing documentation on querying fields of existing tables
  4. Missing documentation on filtering syntax
  5. Missing documentation on indexing (see additional context).

Additional context

There is some rudimentary documentation on indexing, but some key points are missing and some are unclear:
a) No information on how to create the table with an index on the embedding field with VECTOR_FLOAT dataType, which is not created by "model", but provided as part of the data during insert.

My use case: inserting billions of language sentences (STRINGs), with their embeddings, and query them with embedding vector later on to retrieve a sentence.

b) It is not clear if the "Embedding" name of the field in the table is a keyword or if the embedding vector column can have an arbitrary name.
c) It is also unclear if externally built embedding is inserted along with the data into the table, will it be indexed automatically (by its name "Embedding" or by its type "VECTOR_FLOAT").
d) Is it possible to index any other dataType then STRING? From the documentation:
When creating tables, you can define indices to let Epsilla automatically create embeddings for the STRING fields
And then later on:
Then you can insert records in their raw format and let Epsilla handle the embedding followed by an example with insert of the text data and their embeddings, though the "Embedding" column is not defined in the table (in the previous code snippet) and despite the fact that Epsilla is promised to create the embeddings automatically.

@pkpro pkpro added the bug Something isn't working label Feb 26, 2024
@pkpro
Copy link
Author

pkpro commented Feb 26, 2024

One more missing point: How to query amount of records in a table?

@pkpro
Copy link
Author

pkpro commented Feb 27, 2024

Just to be clear, in the following points it is all about the metadata (list of databases, tables, fields of the tables):
Missing documentation on querying existing databases
Missing documentation on querying existing tables
Missing documentation on querying fields of existing tables

@richard-epsilla
Copy link
Contributor

Thank you so much for identifying missing pieces of our documentation/functionality. We are on it

@pkpro
Copy link
Author

pkpro commented Mar 1, 2024

You may also state in your documentation and in the description to your database that inserts are taking constant time (at least that I've experienced with 4.7M inserted records on a single database). It is also would be an advantage to state that the index creation is not required at all and the index creation is a call to a model for embedding creation in the first place.

Also you may use following example in any form, that uses external model to create embeddings and place them into the table and then query the data:

import os
import sys
import time
import orjson
import argparse
import pprint
from pyepsilla import vectordb
from sentence_transformers import SentenceTransformer

serial = 25000001
model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-mpnet-base-v2', device='cuda:0')

parser = argparse.ArgumentParser()
parser.add_argument('--host', type=str, required=True, help='epsilla database host')
parser.add_argument('--port', type=int, required=True, help='epsilla database port')
parser.add_argument('--dbname', type=str, required=True, help='epsilla database name')
parser.add_argument('--insert', type=bool, required=False, default=0, help='Do we need to insert the example sentences (True/False)?')
parser.add_argument('--create', type=bool, required=False, default=0, help='Do we need to create the sentences table (True/False)?')
parser.add_argument('--threshold', type=float, required=False, default=0.06, help='Similarity threshold')

args = parser.parse_args()

with open("sentences.json", "r") as json_file:
    data = orjson.loads(json_file.read())
    sentences=[item['sentence'] for item in data]
    languages=[item['language'] for item in data]
    stime=time.time()
    embeddings = model.encode(sentences, batch_size=len(sentences))
    etime=time.time()
    print(f"Embeddings were generated in {etime-stime}s")

    epsilla_client=vectordb.Client(host=args.host, port=args.port)
    epsilla_client.load_db(db_name=args.dbname, db_path=f"/data/epsilla/")
    epsilla_client.use_db(db_name=args.dbname)

    if args.create:
      status_code, response = epsilla_client.create_table(
        table_name="sentences",
        table_fields=[
            {"name": "id", "dataType": "INT", "primaryKey": True},
            {"name": "sentence", "dataType": "STRING"},
            {"name": "language", "dataType": "STRING"},
            {"name": "vector", "dataType": "VECTOR_FLOAT", "dimensions": 768, "metricType": "COSINE"}
        ]
      )
      print(f"Table creation Status Code: {status_code}, response: {response}")
      if status_code not in (200, 409):
        sys.exit(1)


    if args.insert:
        data_to_insert = [
          {'id': serial, 'sentence': sentence, 'language': language, 'vector': embedding.tolist()}
          for serial, (sentence, language, embedding) in enumerate(zip(sentences, languages, embeddings), start=(serial+1))
        ]

        status_code, response = epsilla_client.insert(
          table_name="sentences",
          records=data_to_insert
        )

        print(f"Inert status: {status_code}, response: {response}")

    stime=time.time()
    status_code, response = epsilla_client.query(
      table_name="sentences",
      query_field="vector",
      response_fields=["id", "language", "sentence"],
      query_vector=embeddings[0].tolist(),
      limit=10,
      with_distance=True
    )
    etime=time.time()

    print(f"Query status: {status_code}")
    if status_code == 200:
        # Negative threshold is here to account for floating-point precision error.
        # Distance for exactly the same embedding is close to 0, but due to the precision error, it might not be exactly 0 and may well be negative.
        records=[record for record in response['result'] if record['@distance'] < args.threshold and record['@distance'] > -0.00001 ]
        pp = pprint.PrettyPrinter(indent=2, width=120, depth=None, compact=False)
        pp.pprint(records)
    else:
        print(f"Epsilla Error: {response}")
    print(f"Epsilla responded to query in {etime-stime}s")

The above code is to be used with data like:

[
  { "sentence": "- Two beautiful race cars are about to start. Which one will win, Bob?\n - Of course, the red one! Everyone knows that red cars are the fastest, Alice!", "language" : "en" },
  { "sentence": "- Два красивых гоночных автомобиля собираются начать гонку. Какой победит, Боб?\n - Конечно, красный! Все знают, что красные машины самые быстрые, Элис!", "language" : "ru" },
  { "sentence": "- Zwei wunderschöne Rennwagen stehen kurz vor dem Start. Welcher wird gewinnen, Bob?\n - Natürlich der rote! Jeder weiß, dass rote Autos die schnellsten sind, Alice!", "language" : "de" }
]
Embeddings were generated in 0.23432421684265137s
[INFO] Connected to localhost:8888 successfully.
Query status: 200
[ { '@distance': -1.1920928955078125e-07,
    'id': 25000002,
    'language': 'en',
    'sentence': '- Two beautiful race cars are about to start. Which one will win, Bob?\n'
                ' - Of course, the red one! Everyone knows that red cars are the fastest, Alice!'},
  { '@distance': 0.04510009288787842,
    'id': 25000004,
    'language': 'de',
    'sentence': '- Zwei wunderschöne Rennwagen stehen kurz vor dem Start. Welcher wird gewinnen, Bob?\n'
                ' - Natürlich der rote! Jeder weiß, dass rote Autos die schnellsten sind, Alice!'},
  { '@distance': 0.05582815408706665,
    'id': 25000003,
    'language': 'ru',
    'sentence': '- Два красивых гоночных автомобиля собираются начать гонку. Какой победит, Боб?\n'
                ' - Конечно, красный! Все знают, что красные машины самые быстрые, Элис!'}]
Epsilla responded to query in 0.011300325393676758s

You have an amazing product guys, I just stumbled upon it by a chance and I'm really glad I found your project. Amazing performance and usability. I hope your project will get more attention which it really deserves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants