# Vector-db Tutorial

In this tutorial, we will go over how we can use vector-db as a RAG pipeline. More specifically, we'll show you
all the ways you can interact with the database to get a retrieval system running.

In [1]:
!pip install -q requests

Set the base url where your database is running.

In [38]:
DB_BASE_URI = "http://localhost:8000/api"

Let's start by looking at all the existing libraries in the database.

In [39]:
import requests
import datetime
import json

In [41]:
LIBRARY_RESOURCE = "library"

libraries = requests.get(f"{DB_BASE_URI}/{LIBRARY_RESOURCE}").json()
libraries

[]

As expected, we do not have any libraries yet, so we'll create one. In order to look at a sample request, head over to `${DB_BASE_URI}/redoc#tag/Library/operation/add_library_library_post`

In [42]:
library_name = "test_library"

payload = {
  "name": library_name,
  "metadata": {
    "date_created": datetime.datetime.now().strftime("%Y-%m-%d"),
    "description": "This is a test library."
  }
}

vector_index_type = "flatl2"

response = requests.post(
    f"{DB_BASE_URI}/{LIBRARY_RESOURCE}",
    params={"index_type": vector_index_type},
    json=payload
)
print(response.json())

{'message': 'Library added successfully'}


In [43]:
# Let's now query the library:
response = requests.get(f"{DB_BASE_URI}/{LIBRARY_RESOURCE}/test_library")
print(response.json())

{'name': 'test_library', 'metadata': {'date_created': '2025-06-10', 'description': 'This is a test library.'}}


Now that we have a library created, we can head over to create a document. A document stores the text of files we want to store. 
For example pdf contents, image contents, etc. For this project, we'll think of document as some text based document, split up into
chunks. These chunks of texts are what will be returned to a user when the user queries the database.

In [44]:
DOCUMENT_RESOURCE = "document"

documents = requests.get(
    f"{DB_BASE_URI}/{DOCUMENT_RESOURCE}",
    params={"library_name": library_name}
).json()
documents
# This should be empty as we still haven't added any documents

[]

Let's create a few empty documents where we can store our data. To get a sample request format, head over to: `${DB_BASE_URI}/redoc#tag/Document/operation/add_document_document_post`

In [45]:
documents = [
    {
        "name": "document_1",
        "library_name": library_name,
        "metadata": {
            "date_created": datetime.datetime.now().strftime("%Y-%m-%d"),
            "source": "local",
        }
    },
    {
        "name": "document_2",
        "library_name": library_name,
        "metadata": {
            "date_created": datetime.datetime.now().strftime("%Y-%m-%d"),
            "source": "local",
        }
    },
    {
        "name": "document_3",
        "library_name": library_name,
        "metadata": {
            "date_created": datetime.datetime.now().strftime("%Y-%m-%d"),
            "source": "local",
        }
    }
]

for document in documents:
    try:
        response = requests.post(
            f"{DB_BASE_URI}/{DOCUMENT_RESOURCE}",
            json=document
        )
        print(response.json())
    except Exception as e:
        print(e)

{'id': 'c2ffc916-ecb0-4818-b05b-a1a111766179', 'name': 'document_1', 'num_of_chunks': 0, 'metadata': {'date_created': '2025-06-10', 'source': 'local'}}
{'id': '0cfddb3c-3881-4fe6-898f-21e16e3eeedb', 'name': 'document_2', 'num_of_chunks': 0, 'metadata': {'date_created': '2025-06-10', 'source': 'local'}}
{'id': '123f1cc2-4105-489d-9fbf-6040787509da', 'name': 'document_3', 'num_of_chunks': 0, 'metadata': {'date_created': '2025-06-10', 'source': 'local'}}


Great! Now that we have our documents created, we're gonna add some content to those documents. We'll start by defining a custom TextSplitter, that is going to split a given document into chunks of size `chunk_size`.

In [46]:
from typing import List

class TextSplitter:
    def __init__(
        self, 
        chunk_size: int
    ):
        self.chunk_size = chunk_size
        
    def split_text(self, text: str) -> List[str]:
        avg_len = len(text) / self.chunk_size
        chunks = []
        start = 0
        for i in range(self.chunk_size):
            end = round((i + 1) * avg_len)
            chunks.append(text[start:end])
            start = end
        return chunks

In [47]:
# Copied over from https://en.wikipedia.org/wiki/New_York_City
nyc = """
New York, often called New York City[b] or NYC, is the most populous city in the United States, located at the southern tip of New York State on one of the world's largest natural harbors. The city comprises five boroughs, each coextensive with a respective county. The city is the geographical and demographic center of both the Northeast megalopolis and the New York metropolitan area, the largest metropolitan area in the United States by both population and urban area. New York is a global center of finance[12] and commerce, culture, technology,[13] entertainment and media, academics and scientific output,[14] the arts and fashion, and, as home to the headquarters of the United Nations, international diplomacy.[15][16][17][18][19]

With an estimated population in 2024 of 8,478,072[5][6] distributed over 300.46 square miles (778.2 km2),[4] the city is the most densely populated major city in the United States. New York City has more than double the population of Los Angeles, the nation's second-most populous city.[20] With more than 20.1 million people in its metropolitan statistical area[21] and 23.5 million in its combined statistical area as of 2020, New York City is one of the world's most populous megacities.[22] The city and its metropolitan area are the premier gateway for legal immigration to the United States. As many as 800 languages are spoken in New York City,[23] making it the most linguistically diverse city in the world. In 2021, the city was home to nearly 3.1 million residents born outside the United States,[20] the largest foreign-born population of any city in the world.[24]

New York City traces its origins to Fort Amsterdam and a trading post founded on Manhattan Island by Dutch colonists around 1624. The settlement was named New Amsterdam in 1626 and was chartered as a city in 1653. The city came under English control in 1664 and was temporarily renamed New York after King Charles II granted the lands to his brother, the Duke of York,[25] before being permanently renamed New York in November 1674. Following independence from Great Britain, the city was the national capital of the United States from 1785 until 1790.[26] The modern city was formed by the 1898 consolidation of its five boroughs: Manhattan, Brooklyn, Queens, the Bronx, and Staten Island.
"""

In [48]:
# Let's try out the text splitter. We'll split the above text into 3 chunks
splitter = TextSplitter(chunk_size=3)
chunks = splitter.split_text(nyc)
chunks

["\nNew York, often called New York City[b] or NYC, is the most populous city in the United States, located at the southern tip of New York State on one of the world's largest natural harbors. The city comprises five boroughs, each coextensive with a respective county. The city is the geographical and demographic center of both the Northeast megalopolis and the New York metropolitan area, the largest metropolitan area in the United States by both population and urban area. New York is a global center of finance[12] and commerce, culture, technology,[13] entertainment and media, academics and scientific output,[14] the arts and fashion, and, as home to the headquarters of the United Nations, international diplomacy.[15][16][17][18][19]\n\nWith an estimated population",
 " in 2024 of 8,478,072[5][6] distributed over 300.46 square miles (778.2 km2),[4] the city is the most densely populated major city in the United States. New York City has more than double the population of Los Angeles, 

Let's put this together in document_1. In order to add chunks to a document and inspect a sample request, head over to `${DB_BASE_URI}/redoc#tag/Chunk/operation/add_chunk_chunk_post`.
The payload should look as follows:

```json
{
  "library_name": "string",
  "chunks": [
    {
      "text": "string",
      "metadata": {
        "date_created": "2025-03-30 10:20:46",
        "doc_id": "string",
        "page_number": 0,
        "summary": "string"
      }
    }
  ]
}
```

In [49]:
# Let's query the documents again just we are aware of their id's

documents = requests.get(
    f"{DB_BASE_URI}/document",
    params={"library_name": library_name}
).json()
documents

[{'id': 'c2ffc916-ecb0-4818-b05b-a1a111766179',
  'name': 'document_1',
  'num_of_chunks': 0,
  'metadata': {'date_created': '2025-06-10', 'source': 'local'}},
 {'id': '0cfddb3c-3881-4fe6-898f-21e16e3eeedb',
  'name': 'document_2',
  'num_of_chunks': 0,
  'metadata': {'date_created': '2025-06-10', 'source': 'local'}},
 {'id': '123f1cc2-4105-489d-9fbf-6040787509da',
  'name': 'document_3',
  'num_of_chunks': 0,
  'metadata': {'date_created': '2025-06-10', 'source': 'local'}}]

In [50]:
# We'll add these chunks to the following document with id: 0d825bb7-8823-4956-890e-3e90f853f6ad

payload = {
  "library_name": library_name,
  "chunks": []
}
for i, chunk in enumerate(chunks):
    chunk_payload = {
        "text": chunk,
        "metadata": {
            "date_created": datetime.datetime.now().strftime("%Y-%m-%d"),
            "doc_id": documents[0]["id"],
            "page_number": i,
            "summary": "",
        }
    }
    payload["chunks"].append(chunk_payload)

print(json.dumps(payload, indent=4))

{
    "library_name": "test_library",
    "chunks": [
        {
            "text": "\nNew York, often called New York City[b] or NYC, is the most populous city in the United States, located at the southern tip of New York State on one of the world's largest natural harbors. The city comprises five boroughs, each coextensive with a respective county. The city is the geographical and demographic center of both the Northeast megalopolis and the New York metropolitan area, the largest metropolitan area in the United States by both population and urban area. New York is a global center of finance[12] and commerce, culture, technology,[13] entertainment and media, academics and scientific output,[14] the arts and fashion, and, as home to the headquarters of the United Nations, international diplomacy.[15][16][17][18][19]\n\nWith an estimated population",
            "metadata": {
                "date_created": "2025-06-10",
                "doc_id": "c2ffc916-ecb0-4818-b05b-a1a111766179",


In [51]:
# Add chunks:

CHUNK_RESOURCE = "chunk"

response = requests.post(
    f"{DB_BASE_URI}/{CHUNK_RESOURCE}",
    json=payload
)
print(response.json())

{'message': 'Chunks added successfully'}


In [52]:
# Let's check on those chunks

response = requests.get(
    f"{DB_BASE_URI}/{CHUNK_RESOURCE}",
    params={"library_name": library_name}
)
response.json()

[{'id': 'a3c60dc2-9f7f-447f-88fb-6c5e93067ceb',
  'text': "\nNew York, often called New York City[b] or NYC, is the most populous city in the United States, located at the southern tip of New York State on one of the world's largest natural harbors. The city comprises five boroughs, each coextensive with a respective county. The city is the geographical and demographic center of both the Northeast megalopolis and the New York metropolitan area, the largest metropolitan area in the United States by both population and urban area. New York is a global center of finance[12] and commerce, culture, technology,[13] entertainment and media, academics and scientific output,[14] the arts and fashion, and, as home to the headquarters of the United Nations, international diplomacy.[15][16][17][18][19]\n\nWith an estimated population",
  'embedding': [0.023666382,
   0.026565552,
   0.022583008,
   -0.004108429,
   -0.013206482],
  'metadata': {'date_created': '2025-06-10',
   'doc_id': 'c2ffc916-

In [53]:
# Remove a Chunk. Uncomment the following lines to remove a chunk

# chunk_id = "c1cefd2a-4929-4d75-8af8-d5f782d8cdd3"

# response = requests.delete(
#     f"{DB_BASE_URI}/{CHUNK_RESOURCE}/{chunk_id}",
#     params={"library_name": library_name}
# )
# response.json()

Now that we added chunks to one document, let's do it for the 2 others.

In [54]:
times_square = """
Times Square is a major commercial intersection, tourist destination, entertainment hub, and neighborhood in the Midtown Manhattan section of New York City. It is formed by the junction of Broadway, Seventh Avenue, and 42nd Street. Together with adjacent Duffy Square, Times Square is a bowtie-shaped plaza five blocks long between 42nd and 47th Streets.[2]

Times Square is brightly lit by numerous digital billboards and advertisements as well as businesses offering 24/7 service. One of the world's busiest pedestrian areas,[3] it is also the hub of the Broadway Theater District[4] and a major center of the world's entertainment industry.[5] Times Square is one of the world's most visited tourist attractions, drawing an estimated 50 million visitors annually.[6] Approximately 330,000 people pass through Times Square daily,[7] many of them tourists,[8] while over 460,000 pedestrians walk through Times Square on its busiest days.[2] The Times Square–42nd Street and 42nd Street–Port Authority Bus Terminal stations have consistently ranked as the busiest in the New York City Subway system, transporting more than 200,000 passengers daily.[9]

Formerly known as Longacre Square, Times Square was renamed in 1904 after The New York Times moved its headquarters to the then newly erected Times Building, now One Times Square.[10] It is the site of the annual New Year's Eve ball drop, which began on December 31, 1907, and continues to attract over a million visitors to Times Square every year,[11] in addition to a worldwide audience of one billion or more on various digital media platforms.[12]

Times Square, specifically the intersection of Broadway and 42nd Street, is the eastern terminus of the Lincoln Highway, the first road across the United States for motorized vehicles.[13] Times Square is sometimes referred to as "the Crossroads of the World"[14] and "the heart of the Great White Way".[15][16][17]
"""

central_park = """ 
Central Park is an urban park between the Upper West Side and Upper East Side neighborhoods of Manhattan in New York City, and the first landscaped park in the United States. It is the sixth-largest park in the city, containing 843 acres (341 ha), and the most visited urban park in the United States, with an estimated 42 million visitors annually as of 2016. It is also one of the most filmed locations in the world.

The creation of a large park in Manhattan was first proposed in the 1840s, and a 778-acre (315 ha) park approved in 1853. In 1858, landscape architects Frederick Law Olmsted and Calvert Vaux won a design competition for the park with their "Greensward Plan". Construction began in 1857; existing structures, including a majority-Black settlement named Seneca Village, were seized through eminent domain and razed. The park's first areas were opened to the public in late 1858. Additional land at the northern end of Central Park was purchased in 1859, and the park was completed in 1876. After a period of decline in the early 20th century, New York City parks commissioner Robert Moses started a program to clean up Central Park in the 1930s. The Central Park Conservancy, created in 1980 to combat further deterioration in the late 20th century, refurbished many parts of the park starting in the 1980s.

The park's main attractions include the Ramble and Lake, Hallett Nature Sanctuary, the Jacqueline Kennedy Onassis Reservoir, and Sheep Meadow; amusement attractions such as Wollman Rink, Central Park Carousel, and the Central Park Zoo; formal spaces such as the Central Park Mall and Bethesda Terrace; and the Delacorte Theater. The biologically diverse ecosystem has several hundred species of flora and fauna. Recreational activities include carriage-horse and bicycle tours, bicycling, sports facilities, and concerts and events such as Shakespeare in the Park. Central Park is traversed by a system of roads and walkways and is served by public transportation.

Its size and cultural position make it a model for the world's urban parks. Its influence earned Central Park the designations of National Historic Landmark in 1963 and of New York City scenic landmark in 1974. Central Park is owned by the New York City Department of Parks and Recreation but has been managed by the Central Park Conservancy since 1998, under a contract with the municipal government in a public–private partnership. The Conservancy, a non-profit organization, raises Central Park's annual operating budget and is responsible for all basic care of the park.
"""

In [55]:
chunks_time_square = splitter.split_text(times_square)
chunks_central_park = splitter.split_text(central_park)

In [56]:
# Let's query the documents again just we are aware of their id's

documents = requests.get(
    f"{DB_BASE_URI}/document",
    params={"library_name": library_name}
).json()
documents

[{'id': 'c2ffc916-ecb0-4818-b05b-a1a111766179',
  'name': 'document_1',
  'num_of_chunks': 3,
  'metadata': {'date_created': '2025-06-10', 'source': 'local'}},
 {'id': '0cfddb3c-3881-4fe6-898f-21e16e3eeedb',
  'name': 'document_2',
  'num_of_chunks': 0,
  'metadata': {'date_created': '2025-06-10', 'source': 'local'}},
 {'id': '123f1cc2-4105-489d-9fbf-6040787509da',
  'name': 'document_3',
  'num_of_chunks': 0,
  'metadata': {'date_created': '2025-06-10', 'source': 'local'}}]

In [57]:
# We'll add the two chunk sets to the last two documents

# We'll add these chunks to the following document with id: 0d825bb7-8823-4956-890e-3e90f853f6ad

payload = {
  "library_name": library_name,
  "chunks": []
}
for i, chunk in enumerate(chunks_time_square):
    chunk_payload = {
        "text": chunk,
        "metadata": {
            "date_created": datetime.datetime.now().strftime("%Y-%m-%d"),
            "doc_id": documents[-2]["id"],
            "page_number": i,
            "summary": "",
        }
    }
    payload["chunks"].append(chunk_payload)
    
for i, chunk in enumerate(chunks_central_park):
    chunk_payload = {
        "text": chunk,
        "metadata": {
            "date_created": datetime.datetime.now().strftime("%Y-%m-%d"),
            "doc_id": documents[-1]["id"],
            "page_number": i,
            "summary": "",
        }
    }
    payload["chunks"].append(chunk_payload)

print(json.dumps(payload, indent=4))

{
    "library_name": "test_library",
    "chunks": [
        {
            "text": "\nTimes Square is a major commercial intersection, tourist destination, entertainment hub, and neighborhood in the Midtown Manhattan section of New York City. It is formed by the junction of Broadway, Seventh Avenue, and 42nd Street. Together with adjacent Duffy Square, Times Square is a bowtie-shaped plaza five blocks long between 42nd and 47th Streets.[2]\n\nTimes Square is brightly lit by numerous digital billboards and advertisements as well as businesses offering 24/7 service. One of the world's busiest pedestrian areas,[3] it is also the hub of the Broadway Theater District[4] and a major center of the world's entertainment indust",
            "metadata": {
                "date_created": "2025-06-10",
                "doc_id": "0cfddb3c-3881-4fe6-898f-21e16e3eeedb",
                "page_number": 0,
                "summary": ""
            }
        },
        {
            "text": "ry.[5] Tim

In [60]:
# Add chunks:

CHUNK_RESOURCE = "chunk"

response = requests.post(
    f"{DB_BASE_URI}/{CHUNK_RESOURCE}",
    json=payload
)
print(response.json())

{'message': 'Chunks added successfully'}


In [61]:
# Lets check on all the chunks:

response = requests.get(
    f"{DB_BASE_URI}/{CHUNK_RESOURCE}",
    params={"library_name": library_name}
)
response.json()

[{'id': 'a3c60dc2-9f7f-447f-88fb-6c5e93067ceb',
  'text': "\nNew York, often called New York City[b] or NYC, is the most populous city in the United States, located at the southern tip of New York State on one of the world's largest natural harbors. The city comprises five boroughs, each coextensive with a respective county. The city is the geographical and demographic center of both the Northeast megalopolis and the New York metropolitan area, the largest metropolitan area in the United States by both population and urban area. New York is a global center of finance[12] and commerce, culture, technology,[13] entertainment and media, academics and scientific output,[14] the arts and fashion, and, as home to the headquarters of the United Nations, international diplomacy.[15][16][17][18][19]\n\nWith an estimated population",
  'embedding': [0.023666382,
   0.026565552,
   0.022583008,
   -0.004108429,
   -0.013206482],
  'metadata': {'date_created': '2025-06-10',
   'doc_id': 'c2ffc916-

With all the chunks in place, let's index them so we can get started querying

In [62]:
response = requests.patch(
    f"{DB_BASE_URI}/{LIBRARY_RESOURCE}/query",
    params={"library_name": library_name},
)
response.json()

{'message': 'Index built successfully'}

And let's start querying

In [63]:
payload = {
  "library_name": library_name,
  "query": "What is NYC famous for?",
  "k": 2 # Get the 2 closest chunks
}

In [64]:
response = requests.post(
    f"{DB_BASE_URI}/{LIBRARY_RESOURCE}/query",
    params={"library_name": library_name},
    json=payload
)
response.json()

[{'id': 'a3c60dc2-9f7f-447f-88fb-6c5e93067ceb',
  'text': "\nNew York, often called New York City[b] or NYC, is the most populous city in the United States, located at the southern tip of New York State on one of the world's largest natural harbors. The city comprises five boroughs, each coextensive with a respective county. The city is the geographical and demographic center of both the Northeast megalopolis and the New York metropolitan area, the largest metropolitan area in the United States by both population and urban area. New York is a global center of finance[12] and commerce, culture, technology,[13] entertainment and media, academics and scientific output,[14] the arts and fashion, and, as home to the headquarters of the United Nations, international diplomacy.[15][16][17][18][19]\n\nWith an estimated population",
  'embedding': [0.023666382,
   0.026565552,
   0.022583008,
   -0.004108429,
   -0.013206482],
  'metadata': {'date_created': '2025-06-10',
   'doc_id': 'c2ffc916-

In [69]:
# Delete the library

response = requests.delete(
    f"{DB_BASE_URI}/{LIBRARY_RESOURCE}/{library_name}",
)
print(response.json())

{'message': 'Library removed successfully'}
