# Assignment for week 5 - Hybrid Search evaluation notebook

This notebook presents the steps for evaluating the code of vector-only, keyword-only, and hybrid (Weighted summary) search.

1) To extract and split the text from the PDF files stored at ./pdfs, build sqlite3 database and FAISS index for keyword/vector search, run the following code:

In [None]:
import os
# Build an absolute path from this notebook's src directory
module_path = os.path.abspath('./src')

# Add to sys.path if not already present
if module_path not in os.sys.path:
    os.sys.path.append(module_path)

from build_sqlite3_db_and_faiss_index import build
build()

  from .autonotebook import tqdm as notebook_tqdm


Pdf chunk file saved!
Sqlite3 database file saved!


2) To load the document chunks, FAISS index, then execute vector search on queries:

In [5]:
from hybrid_search import connect_to_database, load_documents_and_faiss_index, sqlite_keyword_search, faiss_search, hybrid_search
import pprint

# Define a list of queries
queries = [
    "Normative LLMs Profiling",
    "semantic similarity",
    "recommender systems",
    "Extracting Temporal Commonsense from Text",
    "Ambiguity Categories and Benchmarks",
    "knowledge graph embedding",
    "Elastic Weight Consolidation algorithm",
    "Multiple Memory Systems",
    "Knowledge Graph Completion Models",
    "Speculative decoding"
]

load_documents_and_faiss_index()

for query in queries:
    print("query = \"", query, "\"")
    result = faiss_search(query, 6, verbose=True)
    print("\n")
    # pprint.pprint(result, sort_dicts=False)


FAISS index and documents files loaded
query = " Normative LLMs Profiling "

Vector Search Results, for query: 'Normative LLMs Profiling'
Rank Trunk-ID Nor-dist Filename           page      
------------------------------------------------------------
1    5434     0.565    2508.15361v1.pdf   25        
2    1682     0.553    2508.15250v1.pdf   1         
3    5542     0.543    2508.15361v1.pdf   34        
4    5169     0.539    2508.15361v1.pdf   4         
5    174      0.532    2508.15396v1.pdf   8         
6    1775     0.530    2508.15250v1.pdf   10        


query = " semantic similarity "

Vector Search Results, for query: 'semantic similarity'
Rank Trunk-ID Nor-dist Filename           page      
------------------------------------------------------------
1    319      0.602    2508.15396v1.pdf   22        
2    5212     0.544    2508.15361v1.pdf   8         
3    7934     0.525    2508.15370v1.pdf   17        
4    175      0.522    2508.15396v1.pdf   8         
5    2307    

3) To perform keywork search using Sqlite3 FTS for the queries, ran the following code:

In [6]:
connect_to_database()

for query in queries:
    result = sqlite_keyword_search(query, 6, verbose=True)
    # pprint.pprint(result, sort_dicts=False)

Successfully opened the database

Keyword Search Results, for query: 'Normative LLMs Profiling'
Rank Trunk-ID Norm-score Filename           page      
------------------------------------------------------------
1    1682     0.724      2508.15250v1.pdf   1         
2    1674     0.673      2508.15250v1.pdf   1         
3    1675     0.156      2508.15250v1.pdf   1         

Keyword Search Results, for query: 'semantic similarity'
Rank Trunk-ID Norm-score Filename           page      
------------------------------------------------------------
1    2703     0.577      2508.15274v1.pdf   4         
2    187      0.504      2508.15396v1.pdf   9         
3    2460     0.494      2508.15464v1.pdf   6         
4    7177     0.494      2508.15658v1.pdf   7         
5    759      0.465      2508.15392v1.pdf   14        
6    2459     0.465      2508.15464v1.pdf   6         

Keyword Search Results, for query: 'recommender systems'
Rank Trunk-ID Norm-score Filename           page      
------

4) Execute hybrid search (vector + keyword) for the queries, apply the weight-summary merge logic and get the top 3 matches.

The search results show that the hrbrid method yields more accurate results over using vector or keyword method alone.

In [11]:
for query in queries:
    result = hybrid_search(query, 3)
    print('\r')
    pprint.pprint(result, sort_dicts=False)
    print('\r')


Hybrid Search Results for query: 'Normative LLMs Profiling'
alpha=0.6
Rank Doc ID Combined Vector   Keyword  Filename           Page      
------------------------------------------------------------
1    1682   0.621    0.553    0.724    2508.15250v1.p     1         
2    5434   0.339    0.565    0.000    2508.15361v1.p     25        
3    5542   0.326    0.543    0.000    2508.15361v1.p     34        

[{'doc_idx': 1682,
  'filename': '2508.15250v1.pdf',
  'page': 1,
  'content': 't measurements often focus solely on eth-\n'
             'ical judgments, without considering how factors\n'
             'like professional background, language environ-\n'
             'ment (Changjiang et al., 2024), and model param-\n'
             'eters (Achiam et al., 2023) interact. To fill this gap,\n'
             'we propose the EMNLP (Educator-role Moral and\n'
             'Normative LLMs Profiling) framework for compre-\n'
             'hensive testing and analysis of LLMs’ personality\n'
  

5) A FastAPI endpoint "/hybrid-search" has been implemented in hybrid_search.py. To test its API, run the following code and then test the API in a browser at 127.0.0.1/8000:

In [13]:
! python3 hybrid_search.py

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[32mINFO[0m:     Will watch for changes in these directories: ['/home/ehan/evanhan_homework/evanhan_homework/class5']
[32mINFO[0m:     Uvicorn running on [1mhttp://127.0.0.1:8000[0m (Press CTRL+C to quit)
[32mINFO[0m:     Started reloader process [[36m[1m1705942[0m] using [36m[1mWatchFiles[0m
[32mINFO[0m:     Started server process [[36m1706006[0m]
[32mINFO[0m:     Waiting for application startup.
FAISS index and documents files loaded
Successfully opened the database
[32mINFO[0m:     Application startup complete.
[32mINFO[0m:     127.0.0.1:57564 - "[1mGET / HTTP/1.1[0m" [32m200 OK[0m
[32mINFO[0m:     127.0.0.1:57564 - "[1mGET /favicon.ico HTTP/1.1[0m" [31m404 Not Found[0m
[32mINFO[0m:     127.0.0.1:57580 - "[1mGET /docs HTTP/1.1[0m" [32m200 OK[0m
[32mINFO[0m:     127.0.0.1:57580 - "[1mGET /openapi.json HTTP/1.1[0m" [32m200 OK[0m
[32mINFO[0m:     127.0.0.1:58076 - "[1mPOST /hybrid-search/ HTTP/1.1[0m" [32m200 OK[0m
[32mINFO[0m:     

Note: The PDF files processed in this assignment had been downloaded in the last assignment by the get_latest_arxiv() function in documents_downloading.py. To perform this task again:

from documents_downloading import get_latest_arxiv
get_latest_arxiv(query="cat:cs.CL", max_results=50)
