https://github.com/openai/openai-cookbook/blob/main/examples/vector_databases/qdrant/Getting_started_with_Qdrant_and_OpenAI.ipynb

MIT License

Copyright (c) 2025 OpenAI

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

# 使用 OpenAI 向量嵌入的向量資料庫

本筆記本將逐步引導你如何使用 **OpenAI 向量嵌入（embeddings）** 

---

## 📘 本筆記本流程簡介

使用 **OpenAI API** 預先計算好的向量嵌入。

---

## 🚀 啟動 qdrant 伺服器

根據本次 workshop 的內容，每位學員應取得一台可以 ssh 遠端登入操作的機器。並在啟動 jupyter notebook 時，同時啟動 qdrant 伺服器。

我們可以透過執行一個簡單的 curl 指令來驗證伺服器是否成功啟動：

### 📦 安裝必要套件

本筆記本需要安裝 `openai` 主要套件，此外還會用到一些其他輔助的函式庫。

#### 套件說明：

* **`openai`**：用於與 OpenAI API 溝通，產生文字嵌入（embeddings）。
* **`tqdm`**：提供進度條，方便追蹤資料處理過程。
* **`datasets`**：來自 Hugging Face，用於下載與載入示範資料集。

安裝完成後，就可以開始進行後續的操作了。


In [2]:
! pip install openai pandas wget kagglehub tqdm tenacity



### 🔐 準備你的 OpenAI API 金鑰

OpenAI API 金鑰會用來將文件與查詢轉換為向量（vectorization）。

如果你還沒有 OpenAI API 金鑰，可以從這裡申請：
👉 [https://platform.openai.com/settings/organization/api-keys](https://platform.openai.com/settings/organization/api-keys)

---

好的！若你是使用 **Azure OpenAI** 而不是 OpenAI 官方 API，則需使用以下兩個環境變數來設定憑證資訊：

* `AZURE_OPENAI_API_KEY`：你的 Azure OpenAI 金鑰
* `AZURE_OPENAI_ENDPOINT`：你的 Azure OpenAI 端點網址（例如 `https://<your-resource-name>.openai.azure.com/`）

---

### 🔐 設定 Azure OpenAI 的 API 金鑰與端點

```python
os.environ["AZURE_OPENAI_API_KEY"]="" # 填上 api key
os.environ["AZURE_OPENAI_ENDPOINT"]=""
os.environ["OPENAI_API_VERSION"]="2024-12-01-preview" # 替換成你的 API 版本
os.environ["OPENAI_MODEL"]="text-embedding-3-large"
```


In [42]:
# Test that your OpenAI API key is correctly set as an environment variable
# Note. if you run this notebook locally, you will need to reload your terminal and the notebook for the env variables to be live.
import os

# Note. alternatively you can set a temporary env variable like this:
#os.environ["OPENAI_API_KEY"] = ""

os.environ["AZURE_OPENAI_API_KEY"]=""
os.environ["AZURE_OPENAI_ENDPOINT"]=""
os.environ["OPENAI_API_VERSION"]="2024-12-01-preview"
os.environ["OPENAI_MODEL"]="text-embedding-3-large"
#os.environ["OPENAI_MODEL"]="text-embedding-3-small"

if os.getenv("AZURE_OPENAI_API_KEY") is not None:
    print("AZURE_OPENAI_API_KEY is ready")
else:
    print("AZURE_OPENAI_API_KEY environment variable not found")

AZURE_OPENAI_API_KEY is ready


### 資料集

我們使用 Kaggle 上的資料集作為我們第一個 RAG 資料。

當你開啟這個 Kaggle 連結：
👉 [https://www.kaggle.com/datasets/xhlulu/covidqa/data?select=community.csv](https://www.kaggle.com/datasets/xhlulu/covidqa/data?select=community.csv)

你會看到這個資料集是來自 **[COVID-QA](https://github.com/deepset-ai/COVID-QA)** 專案的一部分，由使用者 `xhlulu` 上傳到 Kaggle，主要用來訓練與評估問答系統（Question Answering, QA）模型。

---

### 📁 資料集簡介：`covidqa`

這是一組基於 **COVID-19 文獻資料** 建立的問答資料集，目的是幫助開發者訓練自然語言理解（NLU）與問答系統，以回應與疫情相關的問題。

---

### 📄 檔案：`community.csv` 是什麼？

這個 `community.csv` 是該資料集中的一個檔案，根據名稱與用途，它大致包含從 COVID-19 社群討論（例如研究社群、論壇、問答平台等）中擷取出來的 **問答對（Question-Answer Pairs）**。

---

### 🔍 用途

* 用於訓練問答模型（如：BERT、GPT + 向量資料庫）。
* 適合建立 Retrieval-Augmented Generation (RAG) 系統。
* 可搭配向量嵌入技術（如 OpenAI Embedding + Qdrant）建立問答應用。

---

### 📌 延伸應用

你可以使用這份資料來：

* 建立 COVID-19 主題的 chatbot。
* 建構 QA pipeline，例如使用：

  * OpenAI Embedding 建立向量
  * 儲存至 Qdrant
  * 使用 LLM 查詢並生成回答

In [3]:
import kagglehub

# Download a single file.
# https://www.kaggle.com/datasets/xhlulu/covidqa/data?select=community.csv
kagglehub.dataset_download('xhlulu/covidqa', path='community.csv')

'/home/jovyan/.cache/kagglehub/datasets/xhlulu/covidqa/versions/6/community.csv'

### 📥 載入資料

在這一節，我們將載入預先處理好的資料，**避免你用自己的 OpenAI API 金額重新計算文章的向量嵌入（embeddings）**。這讓你可以直接進行 Qdrant 儲存與查詢的實驗。

---

### ✅ 為什麼要用預處理資料？

* 計算文字嵌入會消耗 OpenAI 的使用額度（credits）。
* 若資料量大（例如整份 Wikipedia 的段落），嵌入計算時間與成本都不小。
* 使用事先嵌入好的資料，可以快速進行向量資料庫操作的示範與測試。

---

### 🧾 資料欄位範例：

資料集中有許多欄位，我們今天會使用到的欄位可能包括：

* `title`：問題文字，通常是自然語言表達的疑問句。
* `answer`：根據 `context` 找出的簡短回答。
* `question_id`：每筆資料的唯一識別碼。

In [10]:
import pandas as pd

from ast import literal_eval

article_df = pd.read_csv('/home/jovyan/.cache/kagglehub/datasets/xhlulu/covidqa/versions/6/community.csv')
# Read vectors from strings back into a list
#article_df["title_vector"] = article_df.title_vector.apply(literal_eval)
#article_df["content_vector"] = article_df.content_vector.apply(literal_eval)
article_df.head()

Unnamed: 0,question_id,title,question,answer_id,answer,answer_type,wrong_answer,wrong_answer_type,url,source
0,14057,Can pets catch the cold?,Last night I was drying my cat with a towel af...,14083,Yes they can. The viruses that cause a cold in...,Accepted,"That is a Priapulid worm, also known as a ""pen...",Random,biology.stackexchange.com,biomedical
1,89709,Is the Common Cold an Immune Overreaction?,It's my understanding that the majority of sym...,89712,Can someone die of the common cold?\n\nNo. \nT...,Accepted,"The dash (""-"") does not represent a negative c...",Random,biology.stackexchange.com,biomedical
2,89886,Air purifier agains bacteria and viruses?,We would buy a mobile air purifier in our home...,89887,The aforementioned filter will filter microbes...,Accepted,"It's a bleu ray gelyfish, don't tauch is becau...",Random,biology.stackexchange.com,biomedical
3,89929,Why are bats the source of dangerous coronavir...,Why do coronaviruses come from bats?\n\nI mean...,89944,\n The preponderance of links between bat and...,Accepted,"First of, depending on your definition of life...",Random,biology.stackexchange.com,biomedical
4,89938,How do bats survive their own coronaviruses?,How do bats survive their own coronaviruses (w...,89975,It's common for the reservoir host of a zoonot...,Accepted,"I think that ""career in synthetic biology"" and...",Random,biology.stackexchange.com,biomedical


In [36]:
article_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 642 entries, 0 to 641
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   question_id        642 non-null    int64 
 1   title              642 non-null    object
 2   question           642 non-null    object
 3   answer_id          642 non-null    int64 
 4   answer             642 non-null    object
 5   answer_type        642 non-null    object
 6   wrong_answer       642 non-null    object
 7   wrong_answer_type  642 non-null    object
 8   url                642 non-null    object
 9   source             642 non-null    object
 10  title_vector       642 non-null    object
dtypes: int64(2), object(9)
memory usage: 55.3+ KB


In [35]:
article_df.describe(include="all")

Unnamed: 0,question_id,title,question,answer_id,answer,answer_type,wrong_answer,wrong_answer_type,url,source,title_vector
count,642.0,642,642,642.0,642,642,642,642,642,642,642
unique,,642,642,,642,2,640,2,26,3,642
top,,Can pets catch the cold?,Last night I was drying my cat with a towel af...,,Yes they can. The viruses that cause a cold in...,Accepted,"Basically, the signal transduction pathway of ...",Random,travel.stackexchange.com,general,"[-0.07070205360651016, 0.008015173487365246, -..."
freq,,1,1,,1,335,2,544,119,300,1
mean,101172.771028,,,101244.221184,,,,,,,
std,99158.636971,,,99156.996497,,,,,,,
min,6235.0,,,7195.0,,,,,,,
25%,35909.25,,,35923.25,,,,,,,
50%,71596.5,,,71608.5,,,,,,,
75%,147239.5,,,147281.0,,,,,,,


# 下面這段同學可以不用跑！

做 embedding 會消耗 embedding api rate limit，讓其他同學被限流

包含 embedding 的檔案已經存在 `/home/jovyan/community_embedded_text_embedding_3_large.csv`

想要嘗試自己做 embedding，可以到 [4_RAG_DIY.ipynb](./4_RAG_DIY.ipynb) DIY 時一起進行

In [20]:
#from openai import OpenAI
#openai_client = OpenAI()

from openai import AzureOpenAI
openai_client = AzureOpenAI()

from tqdm import tqdm
from tenacity import retry, wait_random_exponential, stop_after_attempt

@retry(
    wait=wait_random_exponential(min=1, max=60),  # backoff 等待時間
    stop=stop_after_attempt(6),  # 最多重試 6 次
)
def embedding(input: str) -> str:
    response = openai_client.embeddings.create(
        input = input,
        model= "text-embedding-3-large"
    )
    return response.data[0].embedding

tqdm.pandas(desc="Generating embeddings")
article_df['title_vector'] = article_df['title'].progress_apply(embedding)

Generating embeddings: 100%|██████████| 642/642 [03:14<00:00,  3.30it/s]


In [44]:
article_df['answer_vector'] = article_df['answer'].progress_apply(embedding)

Generating embeddings: 100%|██████████| 642/642 [03:30<00:00,  3.04it/s]


### 📥 輸出資料

在這一節，我們將輸出預先處理好的資料，**避免你用自己的 OpenAI API 金額重新計算文章的向量嵌入（embeddings）**。這讓你可以直接進行 Qdrant 儲存與查詢的實驗。

---

### ✅ 為什麼要用預處理資料？

* 計算文字嵌入會消耗 OpenAI 的使用額度（credits）。
* 若資料量大（例如整份 COVID-QA 的段落），嵌入計算時間與成本都不小。
* 使用事先嵌入好的資料，可以快速進行向量資料庫操作的示範與測試。

In [99]:
article_df_embedded = pd.read_csv(filename)
article_df_embedded.head()

Unnamed: 0,question_id,title,question,answer_id,answer,answer_type,wrong_answer,wrong_answer_type,url,source,title_vector,answer_vector
0,14057,Can pets catch the cold?,Last night I was drying my cat with a towel af...,14083,Yes they can. The viruses that cause a cold in...,Accepted,"That is a Priapulid worm, also known as a ""pen...",Random,biology.stackexchange.com,biomedical,"[-0.07070205360651016, 0.008015173487365246, -...","[-0.062062136828899384, 0.00963278952986002, -..."
1,89709,Is the Common Cold an Immune Overreaction?,It's my understanding that the majority of sym...,89712,Can someone die of the common cold?\n\nNo. \nT...,Accepted,"The dash (""-"") does not represent a negative c...",Random,biology.stackexchange.com,biomedical,"[-0.023646455258131027, -0.03416893631219864, ...","[-0.024439387023448944, -0.01717069186270237, ..."
2,89886,Air purifier agains bacteria and viruses?,We would buy a mobile air purifier in our home...,89887,The aforementioned filter will filter microbes...,Accepted,"It's a bleu ray gelyfish, don't tauch is becau...",Random,biology.stackexchange.com,biomedical,"[-0.04043661430478096, -0.0090789208188653, -0...","[-0.018159082159399986, -0.008939010091125965,..."
3,89929,Why are bats the source of dangerous coronavir...,Why do coronaviruses come from bats?\n\nI mean...,89944,\n The preponderance of links between bat and...,Accepted,"First of, depending on your definition of life...",Random,biology.stackexchange.com,biomedical,"[-0.020367445424199104, -0.029645273461937904,...","[-0.028410816565155983, -0.00841598305851221, ..."
4,89938,How do bats survive their own coronaviruses?,How do bats survive their own coronaviruses (w...,89975,It's common for the reservoir host of a zoonot...,Accepted,"I think that ""career in synthetic biology"" and...",Random,biology.stackexchange.com,biomedical,"[0.00902707502245903, -0.00865456834435463, -0...","[-0.04463430121541023, 0.0020316781010478735, ..."


In [97]:
filename="/home/jovyan/community_embedded_text_embedding_3_large.csv"
article_df.to_csv(filename, index=False)

# 小結

至此 Embedding 已經完成了

1. 使用 Kaggle 上整理好的 COVID-QA 資料集
2. 使用 openai embedding api 將資料集的 question 與 answer 的資料進行 embedding
3. 將 embedding 存成新的資料集，提供下一節 RAG 系統使用

---

# 延伸問題

1. 這裡使用了 COVID-QA 資料集，整理得很好。現實中的專案通常沒有已經整理好的資料集，該怎麼辦？
2. 如何選擇 embedding model？例如幾個常用的 model 要如何選擇？

| Model                           | Dim      | Notes                                                 |
| ------------------------------- | -------- | ----------------------------------------------------- |
| OpenAI `text-embedding-3-large` | 3072     | Very powerful general-purpose embedding               |
| OpenAI `text-embedding-3-small` | 1536     | Cheaper, less powerful                                |
| `BAAI/bge-large-en-v1.5`        | 1024     | Good for English QA, high performance                 |
| `intfloat/e5-large-v2`          | 1024     | Strong for retrieval tasks; needs "query: ..." format |
| `sentence-transformers`         | 384–1024 | Great open-source family, easy to use                 |

3. 如何選擇 database 或 vector database?

### 上面的問題，這個 workshop 都沒有答案，但都是現實中會遇到的問題

歡迎會後來找我聊天。我還是沒有答案，但可以陪你聊天
