https://github.com/openai/openai-cookbook/blob/main/examples/vector_databases/qdrant/Getting_started_with_Qdrant_and_OpenAI.ipynb

MIT License

Copyright (c) 2025 OpenAI

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

# 使用 Qdrant 作為 OpenAI 向量嵌入的向量資料庫

本筆記本將逐步引導你如何使用 **Qdrant** 作為 **OpenAI 向量嵌入（embeddings）** 的向量資料庫。
Qdrant 是一個高效能的向量搜尋資料庫，使用 **Rust** 編寫，並提供 **RESTful** 與 **gRPC API** 來管理你的嵌入資料。
官方也提供了 **Python 的 `qdrant-client` 套件**，可輕鬆整合至你的應用程式中。

---

## 📘 本筆記本流程簡介

本筆記本展示了一個完整的端對端流程，包括：

1. 使用 **OpenAI API** 預先計算好的向量嵌入。
2. 將這些嵌入資料儲存至本機的 **Qdrant** 實例中。
3. 使用 **OpenAI API** 將原始文字查詢轉換為嵌入。
4. 使用 **Qdrant** 在建立好的 collection 中執行 **最近鄰搜尋（nearest neighbor search）**。

---

## 🔍 Qdrant 是什麼？

**Qdrant** 是一個開源的向量資料庫，可用來儲存 **神經網路嵌入（neural embeddings）** 以及其對應的 **元資料（payload）**。

* Payload 不僅可以儲存每筆向量的額外屬性，也可用於 **查詢過濾（filtering）**。
* Qdrant 提供的 **內建過濾機制** 整合於向量搜尋階段，讓搜尋效能更佳。

本機端的 Qdrant 使用 docker compose 執行。

---

## 🚀 部署選項

你可以根據應用程式的負載需求選擇以下幾種部署方式：

* 本機或內部部署：使用 **Docker 容器**
* **Kubernetes 叢集**：透過 **Helm chart**
* 使用 **Qdrant Cloud 雲端服務**

---

## 🔗 整合方式

Qdrant 提供 **RESTful API** 和 **gRPC API**，使得無論使用哪種程式語言都能輕鬆整合。

* 官方提供多種語言的客戶端（Client SDK）
* 若你使用 **Python**，建議使用官方的 [`qdrant-client`](https://github.com/qdrant/qdrant-client) 套件，最為方便。

---

## 🚀 啟動 qdrant 伺服器

根據本次 workshop 的內容，每位學員應取得一台可以 ssh 遠端登入操作的機器。並在啟動 jupyter notebook 時，同時啟動 qdrant 伺服器。

我們可以透過執行一個簡單的 curl 指令來驗證伺服器是否成功啟動：

In [2]:
! curl http://qdrant:6333

{"title":"qdrant - vector search engine","version":"1.14.0","commit":"3617a0111fc8590c4adcc6e88882b63ca4dda9e7"}

### 📦 安裝必要套件

本筆記本需要安裝 `openai` 和 `qdrant-client` 這兩個主要套件，此外還會用到一些其他輔助的函式庫。
可以透過以下指令一次安裝所有所需套件：

```bash
pip install openai qdrant-client tqdm datasets
```

#### 套件說明：

* **`openai`**：用於與 OpenAI API 溝通，產生文字嵌入（embeddings）。
* **`qdrant-client`**：官方的 Python 客戶端，用來與 Qdrant 向量資料庫互動。
* **`tqdm`**：提供進度條，方便追蹤資料處理過程。
* **`datasets`**：來自 Hugging Face，用於下載與載入示範資料集。

安裝完成後，就可以開始進行後續的操作了。


In [3]:
! pip install openai qdrant-client pandas wget kagglehub tqdm tenacity



### 🔐 準備你的 OpenAI API 金鑰

OpenAI API 金鑰會用來將文件與查詢轉換為向量（vectorization）。

如果你還沒有 OpenAI API 金鑰，可以從這裡申請：
👉 [https://platform.openai.com/settings/organization/api-keys](https://platform.openai.com/settings/organization/api-keys)

---

好的！若你是使用 **Azure OpenAI** 而不是 OpenAI 官方 API，則需使用以下兩個環境變數來設定憑證資訊：

* `AZURE_OPENAI_API_KEY`：你的 Azure OpenAI 金鑰
* `AZURE_OPENAI_ENDPOINT`：你的 Azure OpenAI 端點網址（例如 `https://<your-resource-name>.openai.azure.com/`）

---

### 🔐 設定 Azure OpenAI 的 API 金鑰與端點

```python
os.environ["AZURE_OPENAI_API_KEY"]="" # 填上 api key
os.environ["AZURE_OPENAI_ENDPOINT"]="https://chechia-workshop.openai.azure.com/"
os.environ["OPENAI_API_VERSION"]="2024-12-01-preview" # 替換成你的 API 版本
os.environ["OPENAI_MODEL"]="text-embedding-3-large"
```


In [42]:
# Test that your OpenAI API key is correctly set as an environment variable
# Note. if you run this notebook locally, you will need to reload your terminal and the notebook for the env variables to be live.
import os

# Note. alternatively you can set a temporary env variable like this:
#os.environ["OPENAI_API_KEY"] = ""

os.environ["AZURE_OPENAI_API_KEY"]=""
os.environ["AZURE_OPENAI_ENDPOINT"]="https://chechia-workshop.openai.azure.com/"
os.environ["OPENAI_API_VERSION"]="2024-12-01-preview"
os.environ["OPENAI_MODEL"]="text-embedding-3-large"
#os.environ["OPENAI_MODEL"]="text-embedding-3-small"

if os.getenv("AZURE_OPENAI_API_KEY") is not None:
    print("AZURE_OPENAI_API_KEY is ready")
else:
    print("AZURE_OPENAI_API_KEY environment variable not found")

AZURE_OPENAI_API_KEY is ready


### 🔗 連接到 Qdrant

使用官方的 Python 套件 `qdrant-client`，可以輕鬆連接到正在執行的 Qdrant 伺服器實例。

---

#### ✅ 本機端 Qdrant 實例（例如透過 Docker 啟動的）

本機端的 Qdrant 使用 docker compose 執行。從 notebook 可透過 docker network 的 dns 存取 Qdrant。

```python
from qdrant_client import QdrantClient

client = QdrantClient(host="qdrant", port=6333)
```

---

### 🧪 驗證連線是否成功

你可以執行以下指令測試連線，列出目前所有的 collections：

```python
client.get_collections()
```

若成功連線，會回傳現有 collections 的 JSON 資料，目前是空的；若連線失敗，請檢查：

* Qdrant 伺服器是否有啟動
* URL/host 是否正確


### 🗑 移除 Qdrant 中的 Collection

若你想從 Qdrant 中**刪除某個 collection**（例如清除測試資料或重新初始化資料庫），可以使用官方 `qdrant-client` 提供的 `delete_collection()` 方法。

---

### ✅ Python 程式碼範例：

```python
# 刪除該 collection
client.delete_collection(collection_name=collection_name)
```

---

### ⚠ 注意事項

* 刪除 collection 是**不可逆**的操作，資料會永久移除。
* 建議先用 `client.get_collections()` 確認要刪除的 collection 是否存在。
* 若 collection 不存在，`delete_collection()` 也不會報錯，它會靜默略過。

---

如果你想一次刪除多個 collections，也可以用迴圈處理。我可以幫你補上範例。需要嗎？



In [23]:
import qdrant_client

client = qdrant_client.QdrantClient(
    host="qdrant",
    prefer_grpc=True,
)

client.get_collections()

CollectionsResponse(collections=[])

### 資料集

我們使用 Kaggle 上的資料集作為我們第一個 RAG 資料。

當你開啟這個 Kaggle 連結：
👉 [https://www.kaggle.com/datasets/xhlulu/covidqa/data?select=community.csv](https://www.kaggle.com/datasets/xhlulu/covidqa/data?select=community.csv)

你會看到這個資料集是來自 **[COVID-QA](https://github.com/deepset-ai/COVID-QA)** 專案的一部分，由使用者 `xhlulu` 上傳到 Kaggle，主要用來訓練與評估問答系統（Question Answering, QA）模型。

---

### 📁 資料集簡介：`covidqa`

這是一組基於 **COVID-19 文獻資料** 建立的問答資料集，目的是幫助開發者訓練自然語言理解（NLU）與問答系統，以回應與疫情相關的問題。

---

### 📄 檔案：`community.csv` 是什麼？

這個 `community.csv` 是該資料集中的一個檔案，根據名稱與用途，它大致包含從 COVID-19 社群討論（例如研究社群、論壇、問答平台等）中擷取出來的 **問答對（Question-Answer Pairs）**。

---

### 🔍 用途

* 用於訓練問答模型（如：BERT、GPT + 向量資料庫）。
* 適合建立 Retrieval-Augmented Generation (RAG) 系統。
* 可搭配向量嵌入技術（如 OpenAI Embedding + Qdrant）建立問答應用。

---

### 📌 延伸應用

你可以使用這份資料來：

* 建立 COVID-19 主題的 chatbot。
* 建構 QA pipeline，例如使用：

  * OpenAI Embedding 建立向量
  * 儲存至 Qdrant
  * 使用 LLM 查詢並生成回答

---

如果你需要這份 CSV 的欄位說明、如何處理這些資料、或是如何將它嵌入並儲存到 Qdrant，我可以一步一步幫你操作。需要的話只要說一聲！


In [14]:
import kagglehub

# Download a single file.
# https://www.kaggle.com/datasets/xhlulu/covidqa/data?select=community.csv
kagglehub.dataset_download('xhlulu/covidqa', path='community.csv')

'/home/jovyan/.cache/kagglehub/datasets/xhlulu/covidqa/versions/6/community.csv'

### 📥 載入資料

在這一節，我們將載入預先處理好的資料，**避免你用自己的 OpenAI API 金額重新計算文章的向量嵌入（embeddings）**。這讓你可以直接進行 Qdrant 儲存與查詢的實驗。

---

### ✅ 為什麼要用預處理資料？

* 計算文字嵌入會消耗 OpenAI 的使用額度（credits）。
* 若資料量大（例如整份 Wikipedia 的段落），嵌入計算時間與成本都不小。
* 使用事先嵌入好的資料，可以快速進行向量資料庫操作的示範與測試。

---

### 🧾 資料欄位範例：

資料集中有許多欄位，我們今天會使用到的欄位可能包括：

* `title`：問題文字，通常是自然語言表達的疑問句。
* `answer`：根據 `context` 找出的簡短回答。
* `question_id`：每筆資料的唯一識別碼。

In [10]:
import pandas as pd

from ast import literal_eval

article_df = pd.read_csv('/home/jovyan/.cache/kagglehub/datasets/xhlulu/covidqa/versions/6/community.csv')
# Read vectors from strings back into a list
#article_df["title_vector"] = article_df.title_vector.apply(literal_eval)
#article_df["content_vector"] = article_df.content_vector.apply(literal_eval)
article_df.head()

Unnamed: 0,question_id,title,question,answer_id,answer,answer_type,wrong_answer,wrong_answer_type,url,source
0,14057,Can pets catch the cold?,Last night I was drying my cat with a towel af...,14083,Yes they can. The viruses that cause a cold in...,Accepted,"That is a Priapulid worm, also known as a ""pen...",Random,biology.stackexchange.com,biomedical
1,89709,Is the Common Cold an Immune Overreaction?,It's my understanding that the majority of sym...,89712,Can someone die of the common cold?\n\nNo. \nT...,Accepted,"The dash (""-"") does not represent a negative c...",Random,biology.stackexchange.com,biomedical
2,89886,Air purifier agains bacteria and viruses?,We would buy a mobile air purifier in our home...,89887,The aforementioned filter will filter microbes...,Accepted,"It's a bleu ray gelyfish, don't tauch is becau...",Random,biology.stackexchange.com,biomedical
3,89929,Why are bats the source of dangerous coronavir...,Why do coronaviruses come from bats?\n\nI mean...,89944,\n The preponderance of links between bat and...,Accepted,"First of, depending on your definition of life...",Random,biology.stackexchange.com,biomedical
4,89938,How do bats survive their own coronaviruses?,How do bats survive their own coronaviruses (w...,89975,It's common for the reservoir host of a zoonot...,Accepted,"I think that ""career in synthetic biology"" and...",Random,biology.stackexchange.com,biomedical


In [36]:
article_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 642 entries, 0 to 641
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   question_id        642 non-null    int64 
 1   title              642 non-null    object
 2   question           642 non-null    object
 3   answer_id          642 non-null    int64 
 4   answer             642 non-null    object
 5   answer_type        642 non-null    object
 6   wrong_answer       642 non-null    object
 7   wrong_answer_type  642 non-null    object
 8   url                642 non-null    object
 9   source             642 non-null    object
 10  title_vector       642 non-null    object
dtypes: int64(2), object(9)
memory usage: 55.3+ KB


In [35]:
article_df.describe(include="all")

Unnamed: 0,question_id,title,question,answer_id,answer,answer_type,wrong_answer,wrong_answer_type,url,source,title_vector
count,642.0,642,642,642.0,642,642,642,642,642,642,642
unique,,642,642,,642,2,640,2,26,3,642
top,,Can pets catch the cold?,Last night I was drying my cat with a towel af...,,Yes they can. The viruses that cause a cold in...,Accepted,"Basically, the signal transduction pathway of ...",Random,travel.stackexchange.com,general,"[-0.07070205360651016, 0.008015173487365246, -..."
freq,,1,1,,1,335,2,544,119,300,1
mean,101172.771028,,,101244.221184,,,,,,,
std,99158.636971,,,99156.996497,,,,,,,
min,6235.0,,,7195.0,,,,,,,
25%,35909.25,,,35923.25,,,,,,,
50%,71596.5,,,71608.5,,,,,,,
75%,147239.5,,,147281.0,,,,,,,


In [20]:
#from openai import OpenAI
#openai_client = OpenAI()

from openai import AzureOpenAI
openai_client = AzureOpenAI()

from tqdm import tqdm
from tenacity import retry, wait_random_exponential, stop_after_attempt

@retry(
    wait=wait_random_exponential(min=1, max=60),  # backoff 等待時間
    stop=stop_after_attempt(6),  # 最多重試 6 次
)
def embedding(input: str) -> str:
    response = openai_client.embeddings.create(
        input = input,
        model= "text-embedding-3-large"
    )
    return response.data[0].embedding

tqdm.pandas(desc="Generating embeddings")
article_df['title_vector'] = article_df['title'].progress_apply(embedding)

Generating embeddings: 100%|██████████| 642/642 [03:14<00:00,  3.30it/s]


In [44]:
article_df['answer_vector'] = article_df['answer'].progress_apply(embedding)

Generating embeddings: 100%|██████████| 642/642 [03:30<00:00,  3.04it/s]


### 📇 將資料建立索引（Index Data）

在這一節，我們將把資料 **儲存到 Qdrant** 的 collection 中，也就是進行「建立向量索引」的步驟。

---

## 📌 Qdrant 的資料模型

* **Collection**：類似資料表，是向量與其 payload（額外資訊）的容器。
* **Vector**：用來進行相似度搜尋的主體（通常由 embedding 模型產生）。
* **Payload**：附加的 metadata，例如文章標題、來源、分類等。
* Qdrant **不需要預先定義 schema**，可以直接加入資料。

---

## 📂 我們要建立的 Collection

* 名稱：`Articles`
* 每筆資料包含：

  * 一組向量（由 OpenAI 產生的嵌入）
  * `title` 和 `answer` 作為 payload

---

## ✅ 建立 Collection 並寫入資料

**印出第 0 筆資料的 `title_vector` 向量長度**，通常是用來確認該欄位是否為有效的向量資料（也就是一組浮點數 list 或 array）。

你用的是 OpenAI 的 `text-embedding-3-large`，向量長度 Vector Dimension 上限是 `3072`。

把同樣的長度值**指定給變數 `vector_size`**，這樣你可以在建立 Qdrant collection 時使用它來設定向量維度。

```python
from qdrant_client.models import VectorParams, Distance

client.recreate_collection(
    collection_name="Articles",
    vectors_config=VectorParams(size=vector_size, distance=Distance.COSINE)
)
```



In [49]:
from qdrant_client.http import models as rest

# vector_size = 3072
print(len(article_df.iloc[0]["title_vector"]))
vector_size = len(article_df.iloc[0]["title_vector"])

client.create_collection(
    collection_name="Articles",
    vectors_config={
        "title": rest.VectorParams(
            distance=rest.Distance.COSINE,
            size=vector_size,
        ),
        "answer": rest.VectorParams(
            distance=rest.Distance.COSINE,
            size=vector_size,
        ),
    }
)

3072


True

### 🗑 如果想要重做，可移除 Qdrant 中的 Collection

若你想從 Qdrant 中**刪除某個 collection**（例如清除測試資料或重新初始化資料庫），可以使用官方 `qdrant-client` 提供的 `delete_collection()` 方法。

---

### ✅ Python 程式碼範例：

```python
# 刪除該 collection
client.delete_collection(collection_name=collection_name)
```

---

### ⚠ 注意事項

* 刪除 collection 是**不可逆**的操作，資料會永久移除。
* 建議先用 `client.get_collections()` 確認要刪除的 collection 是否存在。
* 若 collection 不存在，`delete_collection()` 也不會報錯，它會靜默略過。

## 🧩 插入資料（用 precomputed embeddings）

包含欄位：`title`, `answer`, `title_vector`, `answer_vector`：

In [50]:
client.upsert(
    collection_name="Articles",
    points=[
        rest.PointStruct(
            id=k,
            vector={
                "title": v["title_vector"],
                "answer": v["answer_vector"],
            },
            payload=v.to_dict(),
        )
        for k, v in article_df.iterrows()
    ],
)

UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

In [51]:
# Check the collection size to make sure all the points have been stored
client.count(collection_name="Articles")

CountResult(count=642)

### 🔍 查詢資料（Search Data）

在這一節，我們會開始對已儲存在 **Qdrant** 的向量資料進行**相似度搜尋**。
你可以查詢最接近某個輸入查詢的向量資料（例如找出與某個問題相關的文章段落）。

---

## ✅ 基本查詢流程

1. **將查詢文字轉換為向量（使用與儲存時相同的模型）**
2. **使用 Qdrant 的 `.search()` 方法來查找相近向量**
3. **可指定 `vector_name` 來切換查詢標題向量或內文向量**

---

### 🧠 嵌入模型

由於你使用的是 OpenAI 的 `text-embedding-3-large` 建立向量，所以查詢時也需要使用同一個模型：

---

### 🔍 使用 Qdrant 進行搜尋

```python
query = "What is the incubation period of COVID-19?"
query_vector = get_embedding(query)

search_result = client.search(
    collection_name="Articles",
    query_vector=query_vector,
    limit=5,  # 回傳最接近的前 5 筆資料
    with_payload=True,  # 顯示 payload（如 title、text）
    vector_name="title_vector"  # 或使用 content_vector，視你的設定而定
)
```

---

## 📌 多向量支援（optional）

如果你的 Collection 是多向量格式（例如 `title_vector` + `answer_vector`），請記得：

* 搜尋標題用 `vector_name="title"`
* 搜尋內文用 `vector_name="answer"`


In [52]:
def query_qdrant(query, collection_name, vector_name="title", top_k=20):
    # Creates embedding vector from user query
    embedded_query = openai_client.embeddings.create(
        input=query,
        model="text-embedding-3-large",
    ).data[0].embedding

    query_results = client.search(
        collection_name=collection_name,
        query_vector=(
            vector_name, embedded_query
        ),
        limit=top_k,
    )

    return query_results

### 🖨 顯示查詢結果

In [64]:
article_sample=article_df.sample()
article_sample

Unnamed: 0,question_id,title,question,answer_id,answer,answer_type,wrong_answer,wrong_answer_type,url,source,title_vector,answer_vector
279,145638,Acknowledging local government for quarantine ...,It looks like I may completely write a paper w...,145641,I decided to elevate my comment to an answer.\...,Accepted,Your question is based on a few statements whi...,Random,academia.stackexchange.com,expert,"[-0.011870007961988449, -0.014500943943858147,...","[0.007967445068061352, -0.03274961933493614, -..."


### 🖨 embedding search 的使用情境

使用 sample 從原始資料集中抽出一行資料，並使用資料的 title 問題向 Qdrant 搜尋

```
query_results = query_qdrant("Acknowledging local government for quarantine", "Articles")
```

使用完整的 title，或是使用單字或片語查詢
使用自己想到的任何問題進行搜尋（由於目前資料集是英文，可以透過 chatgpt 先翻譯成英文）

In [67]:
print(article_sample["title"])

279    Acknowledging local government for quarantine ...
Name: title, dtype: object


In [66]:
query_results = query_qdrant("Acknowledging local government for quarantine", "Articles")

for i, article in enumerate(query_results):
    print(f"{i + 1}. {article.payload['title']} (Score: {round(article.score, 3)})")

  query_results = client.search(


1. Acknowledging local government for quarantine measures (Score: 0.953)
2. Have any countries or regions announced coronavirus control plans other than quarantine/contact tracing? (Score: 0.442)
3. Is there an official assessment of the side-effects of a quarantine for COVID-19 in US or in China? (Score: 0.412)
4. Fifth Amendment and Mandatory Shelter in Place (Score: 0.404)
5. Self quarantine for travel in Europe? (Score: 0.399)
6. Leaving major cities during COVID-19 pandemic (Score: 0.39)
7. People exempt from Trump's COVID-19 proclamation, what is the quarantine/screening requirement? (Score: 0.364)
8. COVID-19: why are countries still introducing quarantines for travelers from affected regions, even though it's been proven they don't work? (Score: 0.362)
9. Are there any polls explicitly measuring "quarantine fatigue" in Western countries? (Score: 0.362)
10. What are the currently quarantined states planning to do in order to jumpstart their economy back to life? (Score: 0.355)
1

### 🖨 embedding search 的使用情境

* 搜尋內文用 `vector_name="answer"`

In [68]:
print(article_sample["answer"])

279    I decided to elevate my comment to an answer.\...
Name: answer, dtype: object


In [77]:
# This time we'll query using content vector
query_results = query_qdrant("I decided to elevate my comment to an answer", "Articles", "anwser")

for i, article in enumerate(query_results):
    print(f"{i + 1}. {article.payload['title']} (Score: {round(article.score, 3)})")
    
print(f"{query_results[0].payload['answer']}\n")

  query_results = client.search(


1. Acknowledging local government for quarantine measures (Score: 0.265)
2. Getting refunded for a flight which I'll be prevented from boarding (Score: 0.243)
3. Is there any identified policy China is doing to succesfully reduce Coronavirus the other countries aren't using? (Score: 0.221)
4. What is the most effective way to contest poorly enforced speed restrictions? (Score: 0.22)
5. Tips for transition to online classrooms given university shutdowns in response to COVID-19 (Score: 0.219)
6. Parameter Estimation for the SIRD model via Kalman Filter (Score: 0.219)
7. Ratio of bleach to water required to disinfect COVID-19? (Score: 0.208)
8. Flying into United States from Hong Kong Just After 14-Day Schengen Window Expires (Score: 0.208)
9. What to do after Air Canada cancelled flight home (Score: 0.201)
10. Can I claim on my travel insurance if a country closes its border? (Score: 0.197)
11. risk of shifting to online learning permanently (Score: 0.195)
12. Why has Russia declined OPE

In [85]:
print(query_results[0].payload["question_id"])
print(query_results[0].payload["title"])
print(query_results[0].payload["answer"])

145638
Acknowledging local government for quarantine measures
I decided to elevate my comment to an answer.


  When this is all over, and most of your readers will probably know multiple people who died during the crisis, this would not be a good [look].


This is a global crisis that is only just beginning. By the time the dust settles, millions may have died from the virus alone. It would be a rare reader who does not personally know someone affected. More will have died because of how overwhelmed the global healthcare system is. 

Even more will be affected by the economic fallout.

Viral infections can have lifelong consequences, even if you live. 

If you're in a medical field, many of  your readers will be caregivers who were worked to the bone for months, and had to make impossible triage decisions. 

For your readers, researchers are forced to halt research, often at great expense. Many studies cannot be paused for three months and picked up at the same place. 

And you're pro

### 📥 輸出資料

在這一節，我們將輸出預先處理好的資料，**避免你用自己的 OpenAI API 金額重新計算文章的向量嵌入（embeddings）**。這讓你可以直接進行 Qdrant 儲存與查詢的實驗。

---

### ✅ 為什麼要用預處理資料？

* 計算文字嵌入會消耗 OpenAI 的使用額度（credits）。
* 若資料量大（例如整份 Wikipedia 的段落），嵌入計算時間與成本都不小。
* 使用事先嵌入好的資料，可以快速進行向量資料庫操作的示範與測試。

In [97]:
filename="/home/jovyan/community_embedded_text_embedding_3_large.csv"
article_df.to_csv(filename, index=False)

In [99]:
article_df_embedded = pd.read_csv(filename)
article_df_embedded.head()

Unnamed: 0,question_id,title,question,answer_id,answer,answer_type,wrong_answer,wrong_answer_type,url,source,title_vector,answer_vector
0,14057,Can pets catch the cold?,Last night I was drying my cat with a towel af...,14083,Yes they can. The viruses that cause a cold in...,Accepted,"That is a Priapulid worm, also known as a ""pen...",Random,biology.stackexchange.com,biomedical,"[-0.07070205360651016, 0.008015173487365246, -...","[-0.062062136828899384, 0.00963278952986002, -..."
1,89709,Is the Common Cold an Immune Overreaction?,It's my understanding that the majority of sym...,89712,Can someone die of the common cold?\n\nNo. \nT...,Accepted,"The dash (""-"") does not represent a negative c...",Random,biology.stackexchange.com,biomedical,"[-0.023646455258131027, -0.03416893631219864, ...","[-0.024439387023448944, -0.01717069186270237, ..."
2,89886,Air purifier agains bacteria and viruses?,We would buy a mobile air purifier in our home...,89887,The aforementioned filter will filter microbes...,Accepted,"It's a bleu ray gelyfish, don't tauch is becau...",Random,biology.stackexchange.com,biomedical,"[-0.04043661430478096, -0.0090789208188653, -0...","[-0.018159082159399986, -0.008939010091125965,..."
3,89929,Why are bats the source of dangerous coronavir...,Why do coronaviruses come from bats?\n\nI mean...,89944,\n The preponderance of links between bat and...,Accepted,"First of, depending on your definition of life...",Random,biology.stackexchange.com,biomedical,"[-0.020367445424199104, -0.029645273461937904,...","[-0.028410816565155983, -0.00841598305851221, ..."
4,89938,How do bats survive their own coronaviruses?,How do bats survive their own coronaviruses (w...,89975,It's common for the reservoir host of a zoonot...,Accepted,"I think that ""career in synthetic biology"" and...",Random,biology.stackexchange.com,biomedical,"[0.00902707502245903, -0.00865456834435463, -0...","[-0.04463430121541023, 0.0020316781010478735, ..."
