# How to load a Hugging Face dataset into Qdrant?

> Loading a Hugging Face dataset into Qdrant is easy. This post shows how to do it.

- toc: true
- badges: false
- comments: true
- categories: [huggingface, datasets, qdrant, vector-search]
- search_exclude: false

In [39]:
%pip install datasets qdrant-client --q


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Loading our dataset

For this post we'll use the [Cohere/wikipedia-22-12-simple-embeddings](https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings) dataset which has already had embeddings generated for it. This dataset was created by Cohere and creates embeddings for millions of Wikipedia articles. See this [post](https://txt.cohere.com/embedding-archives-wikipedia/) for more details.

 We'll use the [Hugging Face datasets library](https://huggingface.co/docs/datasets/index.html) to load the dataset.

In [2]:
from datasets import load_dataset

dataset = load_dataset("Cohere/wikipedia-22-12-simple-embeddings", split="train")

Downloading readme:   0%|          | 0.00/3.84k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/409M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/408M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/407M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/404M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/485859 [00:00<?, ? examples/s]

  table = cls._concat_blocks(blocks, axis=0)


Let's take a quick look at the dataset.

In [3]:
dataset

Dataset({
    features: ['id', 'title', 'text', 'url', 'wiki_id', 'views', 'paragraph_id', 'langs', 'emb'],
    num_rows: 485859
})

We can see the dataset has a `emb` column which contains the embeddings for each article. Alongside this we see the `title` and `text` for the articles alongside some other metadata. Let's also take a look at the features of the dataset.

Let's also take a quick look at the features of the dataset. Hugging Face Dataset objects have a `features` attribute which contains the features of the dataset. We can see that the `emb` column is a `Sequence` of `float32` values. We also have some other columns with `string` values, `int32` and `float32` values. 


In [6]:
dataset.features

{'id': Value(dtype='int32', id=None),
 'title': Value(dtype='string', id=None),
 'text': Value(dtype='string', id=None),
 'url': Value(dtype='string', id=None),
 'wiki_id': Value(dtype='int32', id=None),
 'views': Value(dtype='float32', id=None),
 'paragraph_id': Value(dtype='int32', id=None),
 'langs': Value(dtype='int32', id=None),
 'emb': Sequence(feature=Value(dtype='float32', id=None), length=-1, id=None)}

Qdrant has [support](https://qdrant.tech/documentation/concepts/payload/) for a pretty varied range of types. All of these types in our dataset are supported by Qdrant so we don't need to do any conversion. 



### Creating a Qdrant collection

We'll use the [Qdrant Python client](https://github.com/qdrant/qdrant-client) for this post. This client is really nice since it allows you to create a local collection using pure Python i.e. no need to run a Qdrant server. This is great for testing and development. Once you're ready to deploy your collection you can use the same client to connect to a remote Qdrant server.

In [7]:
from qdrant_client import QdrantClient

We first create a client, in this case using a local path for our DB. 

In [8]:
client = QdrantClient(path="db")  # Persists changes to disk

### Configuring our Qdrant collection

Qdrant is very flexible but we need to let Qdrant now a few things about our collection. These include the name, and a config for the vectors we want to store. This config includes the dimensionality of the vectors and the distance metric we want to use. Let's first check out the dimensionality of our vectors.

In [10]:
vector_size = len(dataset[0]['emb'])

We'll also store our collection in a variable so we can use it later.

In [11]:
collection_name = "cohere_wikipedia"

In [14]:
from qdrant_client.models import Distance, VectorParams

client.recreate_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=vector_size, distance=Distance.COSINE),
)

True

## Adding our data to Qdrant

**Note** this code can be heavily optimized but gives an idea of how easy adding data to Qdrant can be. For many datasets this naive approach will work fine.

The approach we'll take below is to loop through our dataset and yield each row as a `PointStruct`. This is a Qdrant object that contains the vector and any other data, referred to as the payload, that we want to store. 

In [16]:
from qdrant_client.models import PointStruct

In [17]:
def yield_rows(dataset):
    for idx, row in enumerate(dataset, start=1):
        vector = row["emb"] # grab the vector
        payload = {k: v for k, v in row.items() if k != "emb"} # grab the rest of the fields without the vector
        yield PointStruct(id=idx, vector=vector, payload=payload)

For this post we'll use a smallish subset of the dataset. We'll use the first 100_000 rows. Big enough to be interesting but small enough to play around with quickly. 

In [22]:
sample = dataset.select(range(100_000))

We'll use the `toolz` libraries `partition_all` function to get batches from our yield_rows function. We'll use `tqdm` to show a progress bar.

In [19]:
from toolz import partition_all
from tqdm.auto import tqdm

In [23]:
%%time
bs = 100
for batch in tqdm(partition_all(bs, yield_rows(sample)), total=len(sample) // bs):
    client.upsert(collection_name=collection_name, points=list(batch), wait=False)

  0%|          | 0/1000 [00:00<?, ?it/s]

CPU times: user 30.9 s, sys: 35.7 s, total: 1min 6s
Wall time: 1min 19s


On my 2021 MacBook Pro with an M1 chip this takes about 90 seconds to run. As mentioned above this can be heavily optimized but this gives an idea of how easy it is to add data to Qdrant from a Hugging Face dataset.

## Searching our Qdrant collection

What can we do with our Qdrant collection? We can use our embeddings to find similar wikipedia articles. Let's see how we can do that.

First we'll use the `get_collection` method to see some information about our collection.

In [26]:
from rich import print

print(client.get_collection(collection_name))

We can see a bunch of information about our collection. Including the vector count, the dimensionality of the vectors and the distance metric we're using. You'll see that there are plenty of knobs to turn here to optimize your collection but that's for another post.

We can use the `scroll` method to get the first vector from our collection

In [29]:
print(client.scroll(collection_name,limit=1)[0][0])

We can also grab items from the payload for each point. 

In [31]:
print(client.scroll('cohere_wikipedia',limit=1)[0][0].payload['text'])

We can see this article is about the 24-hour clock system. Let's see what other pages are similar to this one. We can optionally get the vector for the query point.

In [32]:
vector = client.scroll('cohere_wikipedia',limit=1,with_vectors=True)[0][0].vector

We can use our vector as a query to find similar vectors in our collection. We'll use the `search` method to do this. 

In [34]:
query_vector = client.scroll(collection_name, limit=1, with_vectors=True)[0][0].vector
hits = client.search(
    collection_name=collection_name,
    query_vector=query_vector,
    limit=15,  # Return 5 closest points
)

Let's look at some of the results. We can see that the first result is the same article. The rest also seem to be about time/24 hour clock systems!

In [36]:
for hit in hits:
    print(f"{hit.payload['title']} | {hit.payload['text']}")
    print("---")

## Conclusion

This post showed how it's possible to easily convert a Hugging Face dataset into a Qdrant collection. We then showed how we can use this collection to find similar articles.

There is a lot of scope for optimization here. For example, we could use a more efficient way to add data to Qdrant. We could also use a more efficient way to search our collection. It would be very cool to directly have a `from_hf_datasets` method in the Qdrant Python client that would do all of this for us and include some optimizations! 

I hope this post has shown how easy it is to use Qdrant with Hugging Face datasets. If you have any questions or comments please let me know on [Twitter](https://twitter.com/vanstriendaniel).