# Load data into the local vector database

This notebook requires that you have downloaded a set of markdown documents into the `corpus` folder.

In my case I used the Apache 2.0 licensed repository https://github.com/simonw/til

To download a copy run the following command:

```bash
git clone https://github.com/simonw/til.git corpus
```

In [15]:
import chromadb
import os

In [2]:
client = chromadb.PersistentClient(path="db/")
collection_name = "Corpus"
device = "cuda"
corpus_dir = "corpus"

In [3]:
if len(client.list_collections()) > 0:
    print("Removing collection")
    client.delete_collection(name=collection_name)

Removing collection


In [4]:
collection = client.create_collection(name=collection_name)

In [5]:
def add_file_to_collection(full_path, file_id):
    with open(full_path, 'rt') as f:
        doc = f.read()
        collection.add(documents=[doc], metadatas=[{"source": full_path}], ids=[f"{file_id:09}"])

In [12]:
class FileIdGenerator:
    file_id = 0

    def get_id(self):
        self.file_id += 1
        return self.file_id

def recurse_directory(directory, file_id_gen):
    for file_name in os.listdir(directory):
        full_path = os.path.join(directory, file_name)
        if os.path.isfile(full_path):
            if file_name.lower().endswith('.md'):
                add_file_to_collection(full_path, file_id_gen.get_id())
        else:
            recurse_directory(full_path, file_id_gen)

recurse_directory(corpus_dir, FileIdGenerator())

In [14]:
collection.query(
    query_texts=["How do I serve traffic to a subdomain?"], 
    n_results=2
)

{'ids': [['000000398', '000000030']],
 'distances': [[1.1146669387817383, 1.115147590637207]],
 'metadatas': [[{'source': 'corpus/fly/custom-subdomain-fly.md'},
   {'source': 'corpus/azure/all-traffic-to-subdomain.md'}]],
 'embeddings': None,
 'documents': [['# Assigning a custom subdomain to a Fly app\n\nI deployed an app to [Fly](https://fly.io/) and decided to point a custom subdomain to it.\n\nMy fly app is https://datasette-apache-proxy-demo.fly.dev/\n\nI wanted the URL to be https://datasette-apache-proxy-demo.datasette.io/ (see [issue #1524](https://github.com/simonw/datasette/issues/1524)).\n\nRelevant documentation: [SSL for Custom Domains](https://fly.io/docs/app-guides/custom-domains-with-fly/).\n\n## Assign a CNAME\n\nFirst step was to add a CNAME to my `datasette.io` domain.\n\nI pointed `CNAME` of `datasette-apache-proxy-demo.datasette.io` at `datasette-apache-proxy-demo.fly.dev.` using Vercel DNS:\n\n<img width="586" alt="image" src="https://user-images.githubusercontent