# CISIデータセットを使った検索精度検証


## データの準備
[CISI](https://www.kaggle.com/datasets/dmaso01dsta/cisi-a-dataset-for-information-retrieval)データセットを利用します。関連ドキュメントIDが提供されているため、検索精度を定量的に評価できます。

In [1]:
import os
import pandas as pd
from tqdm import tqdm
import ast
def load_data(path):
    
    #_____________ Read data from CISI.ALL file and store in dictinary ________________
    
    with open(os.path.join(path, 'CISI.ALL')) as f:
        lines = ""
        for l in f.readlines():
            lines += "\n" + l.strip() if l.startswith(".") else " " + l.strip()
        lines = lines.lstrip("\n").split("\n")
 
    doc_set = {}
    doc_id = 0
    doc_text = ""

    for l in lines:
        if l.startswith(".I"):
            doc_id = int(l.split(" ")[1].strip())
        elif l.startswith(".X"):
            doc_set[doc_id] = doc_text.lstrip(" ")
            doc_id = ""
            doc_text = ""
        else:
            doc_text += l.strip()[3:] + " " 

    print(f"Number of documents = {len(doc_set)}")
    print(doc_set[1]) 
    
    
    #_____________ Read data from CISI.QRY file and store in dictinary ________________
    
    with open(os.path.join(path, 'CISI.QRY')) as f:
        lines = ""
        for l in f.readlines():
            lines += "\n" + l.strip() if l.startswith(".") else " " + l.strip()
        lines = lines.lstrip("\n").split("\n")
          
    qry_set = {}
    qry_id = 0
    for l in lines:
        if l.startswith(".I"):
            qry_id = int(l.split(" ")[1].strip())
        elif l.startswith(".W"):
            qry_set[qry_id] = l.strip()[3:]
            qry_id = ""

    print(f"\n\nNumber of queries = {len(qry_set)}")    
    print(qry_set[1]) 
    
    
    #_____________ Read data from CISI.REL file and store in dictinary ________________
    
    rel_set = {}
    with open(os.path.join(path, 'CISI.REL')) as f:
        for l in f.readlines():
            qry_id = int(l.lstrip(" ").strip("\n").split("\t")[0].split(" ")[0])
            doc_id = int(l.lstrip(" ").strip("\n").split("\t")[0].split(" ")[-1])

            if qry_id in rel_set:
                rel_set[qry_id].append(doc_id)
            else:
                rel_set[qry_id] = []
                rel_set[qry_id].append(doc_id)

    print(f"\n\nNumber of mappings = {len(rel_set)}")
    print(rel_set[1]) 
    
    return doc_set, qry_set, rel_set

In [2]:
doc_set, qry_set, rel_set = load_data('./input')

Number of documents = 1460
18 Editions of the Dewey Decimal Classifications Comaromi, J.P. The present study is a history of the DEWEY Decimal Classification.  The first edition of the DDC was published in 1876, the eighteenth edition in 1971, and future editions will continue to appear as needed.  In spite of the DDC's long and healthy life, however, its full story has never been told.  There have been biographies of Dewey that briefly describe his system, but this is the first attempt to provide a detailed history of the work that more than any other has spurred the growth of librarianship in this country and abroad. 


Number of queries = 112
What problems and concerns are there in making up descriptive titles? What difficulties are involved in automatically retrieving articles from approximate titles? What is the usual relevance of the content of articles to their titles?


Number of mappings = 76
[28, 35, 38, 42, 43, 52, 65, 76, 86, 150, 189, 192, 193, 195, 215, 269, 291, 320, 429

In [3]:
# rel_setは全ての関連するドキュメントのIDを持っているわけではないのでqry_setをフィルタリング
related_ids = set(rel_set.keys())
filtered_qry_set = {qid: qry_set[qid] for qid in related_ids if qid in qry_set}

print(f"Filtered Query Set: {len(filtered_qry_set.keys())}")

Filtered Query Set: 76


### Embedding

In [None]:
from sentence_transformers import SentenceTransformer

# model = SentenceTransformer('msmarco-distilbert-base-tas-b')
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

In [None]:
data = []

for doc_id, doc_text in tqdm(doc_set.items()):
    embeddings = model.encode(doc_text, convert_to_tensor=True)
    embeddings_list = embeddings.cpu().numpy().tolist()
    data.append({'id': doc_id, 'vector': embeddings_list})

# DataFrameに変換
df = pd.DataFrame(data)

# CSVに保存
df.to_csv('document_embeddings.csv', index=False)

print("Embeddings have been saved to document_embeddings.csv")


In [None]:
data = []

for qry_id, qry_text in tqdm(filtered_qry_set.items()):
    embeddings = model.encode(qry_text, convert_to_tensor=True)
    embeddings_list = embeddings.cpu().numpy().tolist()
    data.append({'id': qry_id, 'vector': embeddings_list})

# DataFrameに変換
df = pd.DataFrame(data)

# CSVに保存
df.to_csv('query_embeddings.csv', index=False)

print("Embeddings have been saved to query_embeddings.csv")

In [4]:
import numpy as np
train_df = pd.read_csv('document_embeddings.csv')
test_df = pd.read_csv('query_embeddings.csv')
train_df['vector'] = train_df['vector'].apply(eval).apply(np.array)
test_df['vector'] = test_df['vector'].apply(eval).apply(np.array)

In [5]:
len(train_df['vector'][0])

768

In [6]:
train_df.shape, test_df.shape

((1460, 2), (76, 2))

## 精度の計算
精度の評価には、Mean Reciprocal Rank（MRR）を利用します。

MRRとは、各クエリに対して、最初に出現する関連ドキュメントの順位の逆数を平均したものです。

$$
\text{MRR} = \frac{1}{N} \sum_{i=1}^{N} \frac{1}{\text{rank}_i}
$$

ここで、$\{Q_1, Q_2, \ldots, Q_N\}$ はクエリセットを表し、$\text{rank}_i$ はクエリ $Q_i$ の検索結果に対して正しい答えが何番目に出てくるかを示しています。


MRRが大きいと、関連ドキュメントが上位に表示されており、検索精度が高いことを示します。

In [7]:
def calculate_mrr_byid(all_result_ids, rel_set, k=10):
    mrrs = []
    for n in all_result_ids.keys():
        query_index = n
        retrieved_ids = all_result_ids[n][:k]
        relevant_ids = set(rel_set.get(query_index, []))

        # Find the rank of the first relevant document
        first_relevant_rank = None
        for rank, doc_id in enumerate(retrieved_ids, start=1):
            if doc_id in relevant_ids:
                first_relevant_rank = rank
                break

        # Calculate MRR for this query
        mrr = 1 / first_relevant_rank if first_relevant_rank else 0.0
        mrrs.append(mrr)

        # Debug print to see MRR for each query
        # print(f"MRR for query {query_index}: {mrr:.4f}")

    average_mrr = np.mean(mrrs)
    print(f"Average MRR over {len(all_result_ids.keys())} instances: {average_mrr:.4f}")
    return average_mrr

## 速度の計測
今回は簡易的な検証のため、専用のロードテストツールではなくJupyter Notebook上で以下のように検証を行います。

In [18]:
import time
import pandas as pd
from tqdm import tqdm
import numpy as np

def perform_search(test_df, search_function, get_data_function, k=10, num_iterations=1):
    percentile_90_list = []
    percentile_99_list = []
    total_execution_times = []
    all_result_ids = {}

    for i in range(num_iterations):
        print(f"Iteration {i+1}")

        search_times = []
        for index, query_text in tqdm(filtered_qry_set.items(), desc="Searching"):
            vector = np.array(test_df[test_df['id'] == index].iloc[0]['vector'])
            start_time = time.time()
            try:
                response = search_function(vector, query_text)
            except Exception as e:
                print(f"Error occurred for index {index}: {e}")
                continue

            end_time = time.time()
            search_times.append(end_time - start_time)

            data = get_data_function(response)

            result_ids = [int(res_id) for res_id in data[:k]]  # 上位k件のドキュメントIDを取得
            all_result_ids[index] = result_ids

        df = pd.DataFrame(search_times, columns=['search_time'])

        percentile_90 = df['search_time'].quantile(0.9) * 1000  # ミリ秒に変換
        percentile_99 = df['search_time'].quantile(0.99) * 1000  # ミリ秒に変換

        percentile_90_list.append(percentile_90)
        percentile_99_list.append(percentile_99)
        total_execution_times.append(sum(search_times))

        print("Total execution time for searches: {:.3f} s".format(sum(search_times)))
        print("90th percentile of search times: {:.3f} ms".format(percentile_90))
        print("99th percentile of search times: {:.3f} ms".format(percentile_99))

    average_90 = np.mean(percentile_90_list)
    std_90 = np.std(percentile_90_list)

    average_99 = np.mean(percentile_99_list)
    std_99 = np.std(percentile_99_list)

    average_total_time = np.mean(total_execution_times)
    std_total_time = np.std(total_execution_times)

    print("Average total execution time: {:.3f} s (std: {:.3f})".format(average_total_time, std_total_time))
    print("Average 90th percentile of search times: {:.3f} ms (std: {:.3f})".format(average_90, std_90))
    print("Average 99th percentile of search times: {:.3f} ms (std: {:.3f})".format(average_99, std_99))

    return all_result_ids

# Vald

In [19]:
import grpc
from vald.v1.vald import insert_pb2_grpc
from vald.v1.vald import upsert_pb2_grpc
from vald.v1.vald import search_pb2_grpc
from vald.v1.vald import update_pb2_grpc
from vald.v1.vald import remove_pb2_grpc
from vald.v1.vald import object_pb2_grpc
from vald.v1.vald import index_pb2_grpc
from vald.v1.payload import payload_pb2

In [20]:
PORT_DEFAULT = ':8081'
host = "vald-lb-gateway.test-ns.svc.cluster.local"

options = [
    ('grpc.max_metadata_size', 32 * 1024),
]

## create a channel by passing "{host}:{port}"
channel = grpc.insecure_channel(host + PORT_DEFAULT, options=options)

## create stubs for calling RPCs
insertStub = insert_pb2_grpc.InsertStub(channel)
upsertStub = upsert_pb2_grpc.UpsertStub(channel)
updateStub = update_pb2_grpc.UpdateStub(channel)
removeStub = remove_pb2_grpc.RemoveStub(channel)
objectStub = object_pb2_grpc.ObjectStub(channel)
searchStub = search_pb2_grpc.SearchStub(channel)
indexStub = index_pb2_grpc.IndexStub(channel)

insertConfig = payload_pb2.Insert.Config(skip_strict_exist_check=True)
updateConfig = payload_pb2.Update.Config(skip_strict_exist_check=True)
removeConfig = payload_pb2.Remove.Config(skip_strict_exist_check=True)
upsertConfig = payload_pb2.Upsert.Config(skip_strict_exist_check=True)
searchConfig = payload_pb2.Search.Config(num=10, radius=-1.0, epsilon=0.2)

## Insert

In [21]:
def generatorUpsertStream():
    upsert_requests = []
    for _, row in tqdm(train_df.iterrows()):
        v = payload_pb2.Object.Vector(id=str(row['id']), vector=row['vector'])
        request = payload_pb2.Upsert.Request(vector=v, config=upsertConfig)
        upsert_requests.append(request)
    return upsert_requests

In [22]:
from tqdm import tqdm
print('Start Stream Upsert')
for r in tqdm(upsertStub.StreamUpsert(iter(generatorUpsertStream()))):
    pass

Start Stream Upsert


1460it [00:00, 4114.34it/s]
1460it [00:01, 1046.15it/s]


### Indexing完了確認

In [23]:
index_count = 1460
check_interval = 5 # チェック間隔（秒）

start_time = time.time()

while True:
    res = indexStub.IndexInfo(payload_pb2.Empty())
    current_count = res.stored

    print(f"Current index count: {current_count}")

    if current_count >= index_count * 3: # index_replica=3
        end_time = time.time()
        print("Indexing completed in: {:.2f} seconds".format(end_time - start_time))
        break

    time.sleep(check_interval)

Current index count: 4380
Indexing completed in: 0.00 seconds


### Search

In [24]:
def vald_search_function(vector, _,):
    response = searchStub.Search(payload_pb2.Search.Request(vector=vector, config=searchConfig))
    return response
def vald_get_data_from_response(response):
    return [int(result.id) for result in response.results]
    
# Vald
vald_all_result_ids = perform_search(test_df, vald_search_function, vald_get_data_from_response, k=20, num_iterations=1)

Iteration 1


Searching: 100%|██████████| 76/76 [00:00<00:00, 479.60it/s]

Total execution time for searches: 0.137 s
90th percentile of search times: 2.215 ms
99th percentile of search times: 2.519 ms
Average total execution time: 0.137 s (std: 0.000)
Average 90th percentile of search times: 2.215 ms (std: 0.000)
Average 99th percentile of search times: 2.519 ms (std: 0.000)





In [25]:
calculate_mrr_byid(vald_all_result_ids, rel_set, k=20)

Average MRR over 76 instances: 0.6157


0.6157111528822055

# Opensearch

In [33]:
from opensearchpy import OpenSearch
from opensearchpy.exceptions import ConnectionError
import time
from tqdm import tqdm
import json

In [34]:
import urllib3
# InsecureRequestWarningを無効化
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

In [40]:
initial_admin_password = '' # set your password

In [41]:
client = OpenSearch(
    hosts=[{'host': 'my-third-cluster.opensearch-3.svc.cluster.local', 'port': 9200}],
    http_auth=('admin', initial_admin_password),
    use_ssl=True,
    verify_certs=False
)

In [42]:
def index_documents(doc_set, train_df, index_name, hybrid=False):
    for doc_id, doc_text in tqdm(doc_set.items()):
        document = {
            'doc_id': doc_id,
            'text': doc_text,
        }

        if hybrid:
            document['passage_embedding'] = np.array(train_df.iloc[doc_id - 1].vector)

        response = client.index(
            index=index_name,
            id=doc_id,
            body=document
        )
        # print(f'Document {doc_id} indexed with response: {response["result"]}')

In [43]:
text_index_name = 'cisi-text'
index_documents(doc_set, train_df, text_index_name, hybrid=False)

vector_index_name = 'cisi-text-hybrid'
index_documents(doc_set, train_df, vector_index_name, hybrid=True)

100%|██████████| 1460/1460 [00:04<00:00, 349.23it/s]
100%|██████████| 1460/1460 [00:07<00:00, 194.19it/s]


### Indexing完了確認

In [44]:
index_name = 'cisi-text-hybrid'
index_count = 1460
check_interval = 5 # チェック間隔（秒）

def get_document_count(index_name):
    try:
        response = client.indices.stats(index=index_name)
        doc_count = response['_all']['primaries']['docs']['count']
        return doc_count
    except ConnectionError as e:
        print(f"Connection error: {e}")
        return None


start_time = time.time()

while True:
    doc_count = get_document_count(index_name)
    
    if doc_count is not None:
        print(f"Current document count: {doc_count}")

        if doc_count >= index_count:
            end_time = time.time()
            elapsed_time = end_time - start_time
            print(f"Indexing completed in {elapsed_time:.2f} seconds.")
            break

    time.sleep(check_interval)

Current document count: 1460
Indexing completed in 0.00 seconds.


In [45]:
def full_search(vector, query_text):
    search_query = {
        "size": 20,  # 最大20件取得
        "query": {
            "match": {
                "text": {
                    "query": query_text,
                    "minimum_should_match": "30%"
                }
            }
        }
    }

    response = client.search(
        index=text_index_name,
        body=search_query,
        params={}
    )
    return response

def hybrid_search(vector, query_text):
    search_query = {
        "_source": {
            "exclude": [
                "passage_embedding"
            ]
        },
        "size": 20,  # 最大20件取得
        "query": {
            "hybrid": {
                "queries": [
                    {
                        "match": {
                            "text": {
                                "query": query_text,
                                "minimum_should_match": "30%",
                            }
                        }
                    },
                    {
                        "knn": {
                            "passage_embedding": {
                                "vector": vector,
                                "k": 20    
                            }
                        }
                    }
                ]
            }
        }
    }
    search_pipeline = 'hybrid-search-pipeline'
    params={"search_pipeline": search_pipeline}

    response = client.search(
        index=vector_index_name,
        body=search_query,
        params=params
    )
    return response

def vector_search(vector, query_text):
    search_query = {
        "_source": {
            "exclude": [
                "passage_embedding"
            ]
        },
        "size": 20,  # 最大20件取得
        "query": {
            "knn": {
                "passage_embedding": {
                    "vector": vector,
                    "k": 20    
                }
            }
        }
    }
    params = {}

    response = client.search(
        index=vector_index_name,
        body=search_query,
        params=params
    )
    return response

def opensearch_get_data_from_response(response):
    return [int(res['_id']) for res in response['hits']['hits']]

In [49]:
# 全文検索
opensearch_full_all_result_ids = perform_search(test_df, full_search, opensearch_get_data_from_response, k=20, num_iterations=1)
calculate_mrr_byid(opensearch_full_all_result_ids, rel_set, k=20)

Iteration 1


Searching: 100%|██████████| 76/76 [00:00<00:00, 136.17it/s]

Total execution time for searches: 0.529 s
90th percentile of search times: 10.471 ms
99th percentile of search times: 17.033 ms
Average total execution time: 0.529 s (std: 0.000)
Average 90th percentile of search times: 10.471 ms (std: 0.000)
Average 99th percentile of search times: 17.033 ms (std: 0.000)
Average MRR over 76 instances: 0.6059





0.6058757633835034

In [50]:
# ハイブリッドサーチ
opensearch_hybrid_all_result_ids = perform_search(test_df, hybrid_search, opensearch_get_data_from_response, k=20, num_iterations=1)
calculate_mrr_byid(opensearch_hybrid_all_result_ids, rel_set, k=20)

Iteration 1


Searching: 100%|██████████| 76/76 [00:01<00:00, 51.47it/s]

Total execution time for searches: 1.433 s
90th percentile of search times: 23.414 ms
99th percentile of search times: 31.956 ms
Average total execution time: 1.433 s (std: 0.000)
Average 90th percentile of search times: 23.414 ms (std: 0.000)
Average 99th percentile of search times: 31.956 ms (std: 0.000)
Average MRR over 76 instances: 0.6618





0.6617831716108561

In [51]:
# ベクトル検索
opensearch_vector_all_result_ids = perform_search(test_df, vector_search, opensearch_get_data_from_response, k=20, num_iterations=1)
calculate_mrr_byid(opensearch_vector_all_result_ids, rel_set, k=20)

Iteration 1


Searching: 100%|██████████| 76/76 [00:00<00:00, 99.70it/s] 

Total execution time for searches: 0.731 s
90th percentile of search times: 10.339 ms
99th percentile of search times: 15.210 ms
Average total execution time: 0.731 s (std: 0.000)
Average 90th percentile of search times: 10.339 ms (std: 0.000)
Average 99th percentile of search times: 15.210 ms (std: 0.000)
Average MRR over 76 instances: 0.6198





0.6198466372808479

## 謝辞
データセットを作成いただいたグラスゴー大学の方々、Kaggleのデータセットを作成いただいたHJMason様に感謝申し上げます。

Riddhi Pawar様の[Notebook](https://www.kaggle.com/code/rid17pawar/semantic-search-using-mean-of-vectors)を参考にさせていただきました。感謝申し上げます。