## Creating our vector database

We will use [lancedb](https://lancedb.github.io/lancedb/) here, an embedded database. Take a look at their documentation to see the pros and cons of this choice,
for our case, the small size of our database we don't need to deal with managed servers, we can create our database in a file, kind of like SQLite, and move it
around as needed.

- Install the dependencies

In [9]:
#!pip install lancedb
#!pip install datasets
#!pip install sentence-transformers
#!pip install huggingface_hub  # Install the `huggingface_hub` library to interact with the hub programatically:

In [730]:
from datasets import load_dataset
import pandas as pd
import tqdm

### Import lancedb

We will now load our fine tuned sentence transformers model to generate the embeddings for us.

In [None]:
import lancedb

# Create a database locally, called `lancedb`
db = lancedb.connect("./lancedb")

In [13]:
import torch
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry

model_name = "plaguss/bge-base-argilla-sdk-matryoshka"
device = "mps" if torch.backends.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu"

model = get_registry().get("sentence-transformers").create(name=model_name, device=device)

### Create a table with our embeddings

The next step consists on creating the table in our database to store the embeddings. We just need to run the following cell:

In [14]:
class Docs(LanceModel):
    query: str = model.SourceField()
    text: str = model.SourceField()
    vector: Vector(model.ndims()) = model.VectorField()

The `Docs` class is a special type of `Pydantic` model that represents the data, it will contain the following 3 fields:

- `query`: the queries that were generated in our `distilabel` pipeline. The `positives` that would emulate questions from a user, and whose
embeddings we want to store to make the queries, retrieving the `text` that has the content for the answer.
- `text`: these correspond to the original chunks in the documentation, the pieces we will feed our LLM to help us answering user questions.
- `vector`: the embedding vector corresponding to each of our `query` fields (the original chunks of the documentation).

And creating a table is as simple as running a single line

In [None]:
table_name = "docs"
table = db.create_table(table_name, schema=Docs)

### Download the dataset

The database and the corresponding table are ready.

Let's download the dataset previously generated with our [distilabel pipepline](https://huggingface.co/datasets/plaguss/argilla_sdk_docs_queries)
to get the content we are going to embed for our RAG application.

In [731]:
ds = load_dataset("plaguss/argilla_sdk_docs_queries", split="train")

#### Populate our database with the chunks

We will iterate over the dataset (using an arbitrary batch size of 50 rows), generate the embeddings for our chunks, and add them to the database:

In [742]:
batch_size = 50
for batch in tqdm.tqdm(ds.iter(batch_size), total=len(ds) // batch_size):
    embeddings = model.generate_embeddings(batch["positive"])
    df = pd.DataFrame.from_dict({"query": batch["positive"], "text": batch["anchor"], "vector": embeddings})
    table.add(df)


20it [00:05,  3.77it/s]                                                                                                                                                                                                                                              


These are the relationship between the fields in our synthetic dataset, and the corresponding fields in our database, plus the vectors we have just generated.

- `batch["positive"]` -> `query`
- `batch["anchor"]` -> `text`
- `vector`

The `query` in our `docs` table corresponds to the syntheticly generated query, which was stored in the `positive` column.

The `text` field in our `docs` table is obtained from the `anchor` column in our dataset, that corresponds to the chunks from the docs.

And finally the `vector` is the already generate embeddings with our model.

We can see in action how to search for a given chunk within our database.

We will embed an example query with our model, search in the table using a specific metric and select the fields we want to grab (from the schema we defined).
The data will be returned as a list with the number of registers we limited it to.

More information can be found in the [search section](https://lancedb.github.io/lancedb/search/) of lancedb documentation.

In [None]:
query = "How can I get the current user?"
embedded_query = model.generate_embeddings([query])

retrieved = (
    table
        .search(embedded_query[0])
        .metric("cosine")
        .limit(3)
        .select(["text"])  # Just grab the chunk to use for context
        .to_list()
)

In [26]:
retrieved

[{'text': 'python\nuser = client.users("my_username")\n\nThe current user of the rg.Argilla client can be accessed using the me attribute:\n\npython\nclient.me\n\nClass Reference\n\nrg.User\n\n::: argilla_sdk.users.User\n    options:\n        heading_level: 3',
  '_distance': 0.1881886124610901},
 {'text': 'python\nuser = client.users("my_username")\n\nThe current user of the rg.Argilla client can be accessed using the me attribute:\n\npython\nclient.me\n\nClass Reference\n\nrg.User\n\n::: argilla_sdk.users.User\n    options:\n        heading_level: 3',
  '_distance': 0.20238929986953735},
 {'text': 'Retrieve a user\n\nYou can retrieve an existing user from Argilla by accessing the users attribute on the Argilla class and passing the username as an argument.\n\n```python\nimport argilla_sdk as rg\n\nclient = rg.Argilla(api_url="", api_key="")\n\nretrieved_user = client.users("my_username")\n```',
  '_distance': 0.20401990413665771}]

---

### Push the database to the Hugging Face Hub.

Given the nature of this database, we can move it around as another dataset or model, and keep it together with our datasets, and retrieve it when needed.

Let's define some helper functions to compress and uncompress a folder (in this case our `lancedb` datatabase).

In [20]:
import tarfile
from pathlib import Path


def make_tarfile(source: Path) -> Path:
    """Creates a tar file from a directory and compresses it
    using gzip.

    Args:
        source: Path to a directory.

    Returns:
        path: Path of the new generated file.

    Raises:
        FileNotFoundError: If the directory doesn't exists.
    """
    print(f"Creating tar file from path: {source}...")
    source = Path(source)
    if not source.is_dir():
        raise FileNotFoundError(source)
    with tarfile.open(str(source) + ".tar.gz", "w:gz") as tar:
        tar.add(str(source), arcname=source.name)
    print(f"File generated at: {str(source) + '.tar.gz'}")
    return Path(str(source) + ".tar.gz")

def untar_file(source: Path) -> Path:
    """Untar and decompress files which have passed by `make_tarfile`.

    Args:
        source: Path pointing to a .tag.gz file.

    Returns:
        filename: The filename of the file decompressed.
    """
    # It assumes the file ends with .tar.gz
    new_filename = source.parent / source.stem.replace(".tar", "")
    with tarfile.open(source, "r:gz") as f:
        f.extractall(source.parent)
    print(f"File decompressed: {new_filename}")
    return new_filename

Compress the database to have a single file for simplicity:

In [758]:
lancedb_path = Path.cwd() / "lancedb"
lancedb_tar = make_tarfile(lancedb_path)

Creating tar file from path: /Users/agus/github_repos/argilla-io/distilabel-workbench/projects/argilla-sdk-bot/lancedb...
File generated at: /Users/agus/github_repos/argilla-io/distilabel-workbench/projects/argilla-sdk-bot/lancedb.tar.gz


### Push the database to the Hugging Face Hub

Now we are ready to push the file database to the huggingface hub. Let's create a helper function for it.

In [23]:
from pathlib import Path
from huggingface_hub import HfApi
import os


def upload_database(
    database_path: Path,
    repo_id: str,
    path_in_repo: str = "lancedb.tar.gz",
    token: str = os.getenv("HF_API_TOKEN")
):
    database_path = make_tarfile(database_path)
    HfApi().upload_file(
        path_or_fileobj=database_path,
        path_in_repo=path_in_repo,
        repo_id=repo_id,
        repo_type="dataset",
        token=token,
    )


And we specify the path to the database and the name of the repo in Hugging Face Hub, that will be the same where the dataset is stored, the remaining
arguments are optional:

In [26]:
# Path to the database in your local directory
local_dir = Path.home() / ".cache/argilla_sdk_docs_db"

upload_database(
    local_dir / "lancedb",
    repo_id="plaguss/argilla_sdk_docs_queries",
    path_in_repo="testing.tar.gz",
    token=os.getenv("HF_API_TOKEN")
)

Creating tar file from path: /Users/agus/.cache/argilla_sdk_docs_db/lancedb...
File generated at: /Users/agus/.cache/argilla_sdk_docs_db/lancedb.tar.gz


lancedb.tar.gz: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.91M/2.91M [00:00<00:00, 3.45MB/s]


### Download the database

And finally, let's download the database and check it works as expected

In [4]:
from huggingface_hub import HfApi
from huggingface_hub.file_download import hf_hub_download
import os


def download_database(
    repo_id: str,
    lancedb_file: str = "lancedb.tar.gz",
    local_dir: Path = Path.home() / ".cache/argilla_sdk_docs_db",
    token: str = os.getenv("HF_API_TOKEN")
) -> Path:
    lancedb_download = Path(
        hf_hub_download(
            repo_id,
            lancedb_file,
            repo_type="dataset",
            token=token,
            local_dir=local_dir
        )
    )
    return untar_file(lancedb_download)
     

In [None]:
db_path = download_database(repo_id)

In [11]:
# In case there is any error with the new numpy upgrade:
#!pip install numpy==1.26.4

Connect again to the database and open the table:

In [11]:
import lancedb
from pathlib import Path
db = lancedb.connect(db_path)
table_name = "docs"
table = db.open_table(table_name)

In [19]:
query = "how can I delete users?"

retrieved = (
    table
    .search(query)
    .metric("cosine")
    .limit(1)
    .to_pydantic(Docs)
)
for d in retrieved:
    print("======\nQUERY\n======")
    print(d.query)
    print("======\nDOC\n======")
    print(d.text)

QUERY
Is it possible to remove a user from Argilla by utilizing the delete function on the User class?
DOC
Delete a user

You can delete an existing user from Argilla by calling the delete method on the User class.

```python
import argilla_sdk as rg

client = rg.Argilla(api_url="", api_key="")

user_to_delete = client.users('my_username')

deleted_user = user_to_delete.delete()
```


---