### Workflow

![workflow](https://qdrant.tech/docs/workflow-neural-search.png)

### Prepare dataset

In [2]:
!wget https://storage.googleapis.com/generall-shared-data/startups_demo.json

--2025-02-17 14:09:55--  https://storage.googleapis.com/generall-shared-data/startups_demo.json
Đang phân giải storage.googleapis.com (storage.googleapis.com)… 142.250.197.155, 142.250.71.187, 142.250.197.251, ...
Kết nối tới storage.googleapis.com (storage.googleapis.com)[142.250.197.155]:443… đã kết nối.
Đã gửi yêu cầu HTTP, đang đợi câu trả lời… 200 OK
Kích thước: 22205751 (21M) [application/json]
Đang ghi vào: “startups_demo.json”


2025-02-17 14:10:01 (3.92 MB/s) — đã lưu “startups_demo.json” [22205751/22205751]



In [2]:
from sentence_transformers import SentenceTransformer
import numpy as np
import json
import pandas as pd
from tqdm.notebook import tqdm

### Embed model
Using a pre-trained model called `all-MiniLM-L6-v2`. This is a performance-optimized sentence embedding model

In [4]:
model = SentenceTransformer(
    "all-MiniLM-L6-v2", device="cpu"
)  # or device="cuda" if have a GPU



In [5]:
# Load the data
df = pd.read_json("./startups_demo.json", lines=True)

In [6]:
df

Unnamed: 0,name,images,alt,description,link,city
0,SaferCodes,https://safer.codes/img/brand/logo-icon.png,SaferCodes Logo QR codes generator system form...,QR codes systems for COVID-19.\nSimple tools f...,https://safer.codes,Chicago
1,Human Practice,https://d1qb2nb5cznatu.cloudfront.net/startups...,Human Practice - health care information tech...,Point-of-care word of mouth\nPreferral is a mo...,http://humanpractice.com,Chicago
2,StyleSeek,https://d1qb2nb5cznatu.cloudfront.net/startups...,StyleSeek - e-commerce fashion mass customiza...,Personalized e-commerce for lifestyle products...,http://styleseek.com,Chicago
3,Scout,https://d1qb2nb5cznatu.cloudfront.net/startups...,Scout - security consumer electronics interne...,Hassle-free Home Security\nScout is a self-ins...,http://www.scoutalarm.com,Chicago
4,Invitation codes,https://invitation.codes/img/inv-brand-fb3.png,Invitation App - Share referral codes community,The referral community\nInvitation App is a so...,https://invitation.codes,Chicago
...,...,...,...,...,...,...
40469,Drunken Moose,https://d1qb2nb5cznatu.cloudfront.net/startups...,Drunken Moose - digital media advertising des...,Branding and Marketing Consultancy Agency\nHel...,http://www.drunkenmoose.com.au,Sydney
40470,AA Adonis Rubbish Removals,https://d1qb2nb5cznatu.cloudfront.net/startups...,AA Adonis Rubbish Removals - cleaning,Rubbish Removals Sydney\nAA Adonis Rubbish Rem...,http://www.aaadonisrubbishremovals.com.au/,Sydney
40471,QualityTrade,https://d1qb2nb5cznatu.cloudfront.net/startups...,QualityTrade - B2B,Merit based wholesale trade platform. \nQualit...,https://www.qualitytrade.com/,Sydney
40472,The Myer Family Company,https://d1qb2nb5cznatu.cloudfront.net/startups...,The Myer Family Company -,MFCo is a family office specialising in design...,http://www.mfco.com.au/,Sydney


Encode all startup descriptions to create an embedding vector for each. Internally, the `encode` function will split the input into batches, which will significantly speed up the process.

In [7]:
vectors = model.encode(
    [row.alt + ". " + row.description for row in df.itertuples()],
    show_progress_bar=True,
)

Batches:   0%|          | 0/1265 [00:00<?, ?it/s]

In [8]:
vectors.shape

(40474, 384)

There are 40474 vectors of 384 dimensions. The output layer of the model has this dimension

### Save embed vectors

In [9]:
np.save("startup_vectors.npy", vectors, allow_pickle=False)

### Upload to Qdrant

In [3]:
# Import client library
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance

client = QdrantClient("http://localhost:6333")

#### Create a collection

In [11]:
if not client.collection_exists("startups"):
    client.create_collection(
        collection_name="startups",
        vectors_config=VectorParams(size=384, distance=Distance.COSINE),
    )

#### Create an iterator over the startup data and vectors.

In [12]:
fd = open("./startups_demo.json")

# payload is now an iterator over startup data
payload = map(json.loads, fd)

# Load all vectors into memory, numpy array works as iterable for itself.
# Other option would be to use Mmap, if you don't want to load all data into RAM
vectors = np.load("./startup_vectors.npy")

#### Upload

In [13]:
client.upload_collection(
    collection_name="startups",
    vectors=vectors,
    payload=payload,
    ids=None,  # Vector ids will be assigned automatically
    batch_size=256,  # How many vectors will be uploaded in a single request?
)