# Vector Database Basics

Vector databases help us store, manage, and query the embeddings we created for generative AI, recommenders, and search engines.

Across many of the common use cases, users often find that they need to manage more than just vectors.
To make it easier for practitioners, vector databases should store and manage all of the data they need:
- embedding vectors
- categorical metadata
- numerical metadata
- timeseries metadata
- text / pdf / images / video / point clouds

And support a wide range of query workloads:
- Vector search (may require ANN-index)
- Keyword search (requires full text search index)
- SQL (for filtering)

For this exercise we'll use LanceDB since it's open source and easy to setup

In [1]:
# pip install -U --quiet lancedb pandas pydantic [This has been pre-installed for you]

## Creating tables and adding data

Let's create a LanceDB table called `cats_and_dogs` under the local database directory `~/.lancedb`.
This table should have 4 fields:
- the embedding vector
- a string field indicating the species (either "cat" or "dog")
- the breed
- average weight in pounds

We're going to use pydantic to make this easier. First let's create a pydantic model with those fields

In [1]:
from lancedb.pydantic import vector, LanceModel

class CatsAndDogs(LanceModel):
    vector: vector(2)
    species: str
    breed: str
    weight: float

Now connect to a local db at ~/.lancedb and create an empty LanceDB table called "cats_and_dogs"

In [2]:
import lancedb

db = lancedb.connect("~/.lancedb")
table_name = "cats_and_dogs"
db.drop_table(table_name, ignore_missing=True)
table = db.create_table(table_name, schema=CatsAndDogs)

Let's add some data

First some cats

In [3]:
data = [
    CatsAndDogs(
        vector=[1., 0.],
        species="cat",
        breed="shorthair",
        weight=12.,
    ),
    CatsAndDogs(
        vector=[-1., 0.],
        species="cat",
        breed="himalayan",
        weight=9.5,
    ),
]

Now call the `LanceTable.add` API to insert these two records into the table

In [4]:
table.add([dict(d) for d in data])

Let's preview the data

In [5]:
table.head().to_pandas()

Unnamed: 0,vector,species,breed,weight
0,"[1.0, 0.0]",cat,shorthair,12.0
1,"[-1.0, 0.0]",cat,himalayan,9.5


Now let's add some dogs

In [6]:
data = [
    CatsAndDogs(
        vector=[0., 10.],
        species="dog",
        breed="samoyed",
        weight=47.5,
    ),
    CatsAndDogs(
        vector=[0, -1.],
        species="dog",
        breed="corgi",
        weight=26.,
    )
]

In [7]:
table.add([dict(d) for d in data])

In [8]:
table.head().to_pandas()

Unnamed: 0,vector,species,breed,weight
0,"[1.0, 0.0]",cat,shorthair,12.0
1,"[-1.0, 0.0]",cat,himalayan,9.5
2,"[0.0, 10.0]",dog,samoyed,47.5
3,"[0.0, -1.0]",dog,corgi,26.0


## Querying tables

Vector databases allow us to retrieve data for generative AI applications. Let's see how that's done.

Let's say we have a new animal that has embedding [10.5, 10.], what would you expect the most similar animal will be?
Can you use the table we created above to answer the question?

**HINT** you'll need to use the `search` API for LanceTable and `limit` / `to_df` APIs. For examples you can refer to [LanceDB documentation](https://lancedb.github.io/lancedb/basic/#how-to-search-for-approximate-nearest-neighbors).

**SOLUTION** if you answered "samoyed" then you're right!. Here we pass the vector into `search` and make a chained call to `limit` with 1 as the param. Then we'll need to call `to_df` to execute the query and convert the results to a pandas dataframe. In addition to the data columns, you'll also see a "score" column, which contains the distance score between the query vector and the returned vector. In this case, the score is the square of the Euclidean distance.

In [9]:
table.search([10.5, 10.,]).limit(1).to_df()

Unnamed: 0,vector,species,breed,weight,_distance
0,"[0.0, 10.0]",dog,samoyed,47.5,110.25


Now what if we use cosine distance instead? Would you expect that we get the same answer? Why or why not?

**HINT** you can add a call to `metric` in the call chain

**SOLUTION** Remember that cosine distance is the angle between vectors. So while the query vector [10.5, 10.] is closer in Euclidean distance to Samoyed, the angle is slightly closer to shorthair cats.

In [10]:
table.search([10.5, 10.]).metric("cosine").limit(1).to_df()

Unnamed: 0,vector,species,breed,weight,_distance
0,"[1.0, 0.0]",cat,shorthair,12.0,0.275862


## Filtering tables

In practice, we often need to specify more than just a search vector for good quality retrieval. Oftentimes we need to filter the metadata as well.

Please write code to retrieve two most similar examples to the embedding [10.5, 10.] but only show the results that is a cat.

**HINT** In LanceDB, for additional filtering, you can add a call to `where` in the call chain and pass in a SQL-like filter string.

In [11]:
table.search([10.5, 10.,]).limit(2).where("species='cat'").to_df()

Unnamed: 0,vector,species,breed,weight,_distance
0,"[1.0, 0.0]",cat,shorthair,12.0,190.25


## Creating ANN indices

For larger tables (e.g., >1M rows), searching through all of the vectors becomes quite slow. Here is where the Approximate Nearest Neighbor (ANN) index comes into play. While there are many different ANN indexing algorithms, they all have the same purpose - to drastically limit the search space as much as possible while losing as little accuracy as possible

For this problem we will create an ANN index on a LanceDB table and see how that impacts performance

### First let's create some data

Given the constraints of the classroom workspace, we'll complete this exercise by creating 100,000 vectors with 16D in a new table. Here the embedding values don't matter, so we simply generate random embeddings as a 2D numpy array. We then use the `vec_to_table` function to convert that in to an Arrow table, which can then be added to the table.

In [12]:
from lance.vector import vec_to_table
import numpy as np

mat = np.random.randn(100_000, 16)
table_name = "exercise3_ann"
db.drop_table(table_name, ignore_missing=True)
table = db.create_table(table_name, vec_to_table(mat))

### Let's establish a baseline without an index

Before we create the index, let's make sure know what we need to compare against.

We'll generate a random query vector and record it's value in the `query` variable so we can use the same query vector with and without the ANN index.

In [13]:
query = np.random.randn(16)
table.search(query).limit(10).to_df()

Unnamed: 0,vector,_distance
0,"[-0.3439972, -1.2907236, 0.44763204, -0.762746...",3.840627
1,"[0.74268633, -0.3067523, -0.46703756, -0.21678...",4.360204
2,"[-0.48958203, -0.21373776, -1.0575016, -0.6153...",4.387673
3,"[-0.6745144, -0.08040455, -0.84383017, -0.2910...",4.670218
4,"[-0.2973392, -1.2008524, -0.03314567, 0.082003...",4.680505
5,"[0.5707078, -1.329692, -0.111194655, 0.2509397...",4.748597
6,"[0.09398429, 0.026763173, -1.0471693, 0.103852...",4.838634
7,"[0.026056314, -2.1881604, 0.396483, 0.30784622...",4.854974
8,"[0.023356318, -0.5887882, -0.88108236, -0.5575...",5.159759
9,"[-0.45675716, -0.8042018, 0.43461478, -0.67579...",5.23562


Please write code to compute the average latency of this query

**SOLUTION** There are several possible solutions. Given that we're in a notebook environment, the easiest is probably using the %timeit magic function to run the command a bunch of times and compute the average

In [14]:
%timeit table.search(np.random.randn(16)).limit(10).to_arrow();

20.4 ms ± 407 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### Now let's create an index

There are many possible index types ranging from hash based to tree based to partition based to graph based.
For this task, we'll create an IVFPQ index (partition-based index with product quantization compression) using LanceDB.

Please create an IVFPQ index on the LanceDB table such that each partition is 4000 rows and each PQ subvector is 8D.

**HINT** 
1. Total vectors / number of partitions = number of vectors in each partition
2. Total dimensions / number of subvectors = number of dimensions in each subvector

In [None]:
table.create_index(num_partitions=16, num_sub_vectors=8)

Now let's search through the data again. Notice how the answers now appear different.
This is because an ANN index is always a tradeoff between latency and accuracy.

In [None]:
table.search(query).limit(10).to_df()

Now write code to compute the average latency for querying the same table using the ANN index.

**SOLUTION** The index is implementation detail, so it should just be running the same code as above. You should see almost an order of magnitude speed-up. On larger datasets, this performance difference should be even more pronounced.

In [None]:
%timeit table.search(np.random.randn(16)).limit(10).to_arrow();

## Deleting rows

Like with other kinds of databases, you should be able to remove rows from the table.
Let's go back to our tables of cats and dogs

In [19]:
table = db["cats_and_dogs"]

In [20]:
len(table)

4

Can you use the `delete` API to remove all of the cats from the table?

**HINT** use a SQL like filter string to specify which rows to delete from the table

In [21]:
table.delete("species='cat'")

In [22]:
len(table)

2

## What if I messed up?

Errors is a common occurrence in AI. What's hard about errors in vector search is that oftentimes a bad vector doesn't cause a crash but just creates non-sensical answers. So to be able to rollback the state of the database is very important for debugging and reproducibility

So far we've accumulated 4 actions on the table:
1. creation of the table
2. added cats
3. added dogs
4. deleted cats

What if you realized that you should have deleted the dogs instead of the cats?

Here we can see the 4 versions that correspond to the 4 actions we've done

In [23]:
table.list_versions()

[{'version': 1,
  'timestamp': datetime.datetime(2023, 8, 30, 14, 26, 27, 208330),
  'metadata': {}},
 {'version': 2,
  'timestamp': datetime.datetime(2023, 8, 30, 14, 26, 27, 236584),
  'metadata': {}},
 {'version': 3,
  'timestamp': datetime.datetime(2023, 8, 30, 14, 26, 27, 294913),
  'metadata': {}},
 {'version': 4,
  'timestamp': datetime.datetime(2023, 8, 30, 14, 33, 50, 823708),
  'metadata': {}}]

Please write code to restore the version still containing the whole dataset

In [24]:
table = db["cats_and_dogs"]

In [25]:
len(table)

2

In [26]:
table.restore(3)

In [27]:
table.delete("species='dog'")

In [28]:
table.list_versions()

[{'version': 1,
  'timestamp': datetime.datetime(2023, 8, 30, 14, 26, 27, 208330),
  'metadata': {}},
 {'version': 2,
  'timestamp': datetime.datetime(2023, 8, 30, 14, 26, 27, 236584),
  'metadata': {}},
 {'version': 3,
  'timestamp': datetime.datetime(2023, 8, 30, 14, 26, 27, 294913),
  'metadata': {}},
 {'version': 4,
  'timestamp': datetime.datetime(2023, 8, 30, 14, 33, 50, 823708),
  'metadata': {}},
 {'version': 5,
  'timestamp': datetime.datetime(2023, 8, 30, 14, 33, 50, 861895),
  'metadata': {}},
 {'version': 6,
  'timestamp': datetime.datetime(2023, 8, 30, 14, 33, 50, 903319),
  'metadata': {}}]

In [29]:
table.to_pandas()

Unnamed: 0,vector,species,breed,weight
0,"[1.0, 0.0]",cat,shorthair,12.0
1,"[-1.0, 0.0]",cat,himalayan,9.5


## Dropping a table

You can also choose to drop a table, which also completely removes the data.
Note that this operation is not reversible.

In [30]:
"cats_and_dogs" in db

True

Write code to irrevocably remove the table "cats_and_dogs" from the database

In [31]:
db.drop_table("cats_and_dogs")

How would you verify that the table has indeed been deleted?

In [32]:
table.name in db

False

## Summary

Congrats, in this exercise you've learned the basic operations of vector databases from creating tables, to adding data, and to querying the data. You've learned how to create indices and you saw first hand how it changes the performance and the accuracy. Lastly, you've learned how to debug and rollback when errors happen.