# Process and load data from the CSV into Qdrant

The loading process involves the following:
0. Retrieve the data and ensure the data has the required levels of data cleanliness and quality
1. Convert the original dataset into a JSON format. Since we will possess semi-structured data in the form of text blurbs of books, these are to be converted to embeddings using a sentence transformer model
2. Import a sentence transformer model such as the Mini LM
3. Create a collection in Qdrant
4. Update the collection with the key value pairs of specific fields within the dataset. For example, with the key as the ISBN, we could update the embeddings computed from the title, from the author's name, from the publisher, or from the blurb
5. We then persist these into Qdrant by instantiating a Qdrant client object, and adding to the collection.

Standard imports for data processing and additional imports for Qdrant and Sentence Transformers

In [67]:
import pandas as pd
import numpy as np
import json
from qdrant_client import QdrantClient, models
from sentence_transformers import SentenceTransformer

Bringing in the books dataset with blurbs

In [94]:
df = pd.read_csv("../../data/books_with_blurbs.csv", header=0)

In [95]:
df.head()

Unnamed: 0,ISBN,Title,Author,Year,Publisher,Blurb
0,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,"Here, for the first time in paperback, is an o..."
1,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,"The fascinating, true story of the world's dea..."
2,399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,Winnie and Helen have kept each others worst s...
3,425176428,What If?: The World's Foremost Military Histor...,Robert Cowley,2000,Berkley Publishing Group,Historians and inquisitive laymen alike love t...
4,1881320189,Goodbye to the Buttermilk Sky,Julia Oliver,1994,River City Pub,This highly praised first novel by fiction wri...


Now that we have checked the dataset, we export this data as JSON to enable loading of the data to Qdrant.


In [150]:
corpus_size = 500

In [151]:
df[:corpus_size].to_json("data.json", orient="records")

In [152]:
with open("data.json",) as f:
    data_json = json.load(f)

The below creates a Qdrant client, as an in memory vector store. For now, we will use the option of persisting the vectors to a Qdrant instance running on local as a docker container, outside the memory of this application.

In [153]:
encoder = SentenceTransformer("all-MiniLM-L6-v2")

In [154]:
QC = QdrantClient("http://localhost:6333")

In [155]:
def create_new_collection_and_upload_vectors(QC, data, shards, collection_name, collection_field, encoder):
    
    #delete existing collection
    QC.delete_collection(collection_name=f"{collection_name}")
    
    #create collection with same name
    QC.create_collection(
        collection_name=f"{collection_name}",
        vectors_config=models.VectorParams(
            size=encoder.get_sentence_embedding_dimension(),  # Vector size is defined by used model
            distance=models.Distance.COSINE,),
        shard_number= shards,
    )
    
    #upload records using the encoder model supplied
    QC.upload_records(
        collection_name=f"{collection_name}",
        records=[
            models.Record(
                id=idx, vector=encoder.encode(record[f"{collection_field}"]).tolist(), payload=record
            )
            for idx, record in enumerate(data_json)
        ],
    )

In [156]:
collections = {
    "Title": "Title",
    "ISBN": "ISBN",
    "Author": "Author",
    "Publisher": "Publisher",
    "Blurb": "Blurb"
}

In [157]:
for collection_name, collection_field in collections.items():
    create_new_collection_and_upload_vectors(QC=QC, 
                                             data=data_json,
                                             shards=8,
                                             collection_name=collection_name, 
                                             collection_field=collection_field, 
                                             encoder=encoder)
    

In [162]:
hits = QC.search(
    collection_name="Blurb",
    query_vector=encoder.encode("natural satellites in the solar system").tolist(),
    limit=3,
)
for hit in hits:
    print(hit.payload, "score:", hit.score)

{'Author': 'Ray Bradbury', 'Blurb': 'Im Januar 1999 beginnt die Kolonisation des Planeten Mars. Dort wachsen goldene Früchte an kristallenen Wänden, doch das Leben auf dem Mars ist demjenigen auf der Erde gar nicht so unähnlich...', 'ISBN': '3257208634', 'Publisher': 'Diogenes Verlag', 'Title': 'Die Mars- Chroniken. Roman in ErzÃ?Â¤hlungen.', 'Year': 1981} score: 0.2238886
{'Author': 'C.S. Lewis', 'Blurb': 'Dr. Ransom is abducted by a megalomaniacal physicist and taken via space ship to the planet Malacandra (Mars). There, Dr. Ransom finds Malacandra similar to, and yet distinct from, Earth.', 'ISBN': '0684823802', 'Publisher': 'Scribner', 'Title': 'OUT OF THE SILENT PLANET', 'Year': 1996} score: 0.22164167
{'Author': 'Neal Barrett Jr.', 'Blurb': 'Babylon 5, designed to be a place of peace in a troubled universe, has erupted into rioting as visiting cultures clash and passions explode. Security chief Garibaldi must use all his skills to quell the violence between races. But the trouble