# Assignment for week 5 - Hybrid Search evaluation notebook

This notebook presents the steps for evaluating the code of vector-only, keyword-only, and hybrid (Weighted summary) search.

### *) To install necessary modules

In [9]:
! pip install -r ./src/requirements.txt

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




### 1) To extract and split the text from the PDF files stored at ./pdfs, build sqlite3 database and FAISS index for keyword/vector search, run the following code

In [1]:
from src.build_sqlite3_db_and_faiss_index import build
build()

  from .autonotebook import tqdm as notebook_tqdm


Pdf chunk file saved!
Sqlite3 database file saved!
Number of vectors in the FAISS index: 8390
FAISS index created, populated, and saved!


### 2) To load the document chunks, FAISS index, then execute vector search on queries

In [None]:
from src.hybrid_search import connect_to_database, load_documents_and_faiss_index, sqlite_keyword_search, faiss_search, hybrid_search
import pprint

# Define a list of queries
queries = [
    "Normative LLMs Profiling", # 1674, 1675, 1682
    "Automatic Mixed Precision", # 2147, other chunk in same page
    "Sampling Number", #7397, 7398
    "knowledge distillation", # 1571, 1572, 2104, 2730, 3126, 6206, 6274, 6300, 7522, 7523, 7534, 7630, 8173, 8174  (13)
    "Knowledge Graph Completion Models", # 1103/04, 1184, 1189, 1197, 1199, 1205/06, 3846/48 (7)  
    "small language models",
    "knowledge graph embedding",
    "Elastic Weight Consolidation algorithm",
    "Multiple Memory Systems",
    "Speculative decoding"
]

load_documents_and_faiss_index()

for query in queries:
    print("query = \"", query, "\"")
    result = faiss_search(query, 10, verbose=True)
    print("\n")
    # pprint.pprint(result, sort_dicts=False)


FAISS index and documents files loaded
query = " Normative LLMs Profiling "

Vector Search Results, for query: 'Normative LLMs Profiling'
Rank Trunk-ID Nor-dist Filename           page      
------------------------------------------------------------
1    5434     0.565    2508.15361v1.pdf   25        
2    1682     0.553    2508.15250v1.pdf   1         
3    5542     0.543    2508.15361v1.pdf   34        
4    5169     0.539    2508.15361v1.pdf   4         
5    174      0.532    2508.15396v1.pdf   8         
6    1775     0.530    2508.15250v1.pdf   10        
7    1691     0.530    2508.15250v1.pdf   2         
8    6183     0.529    2508.15283v1.pdf   11        
9    4215     0.528    2508.15746v1.pdf   14        
10   7347     0.526    2508.15648v1.pdf   3         


query = " Automatic Mixed Precision "

Vector Search Results, for query: 'Automatic Mixed Precision'
Rank Trunk-ID Nor-dist Filename           page      
------------------------------------------------------------
1

### 3) To perform keywork search using Sqlite3 FTS for the queries, ran the following code

In [28]:
connect_to_database()

for query in queries:
    result = sqlite_keyword_search(query, 6, verbose=True)
    # pprint.pprint(result, sort_dicts=False)

Successfully opened the database

Keyword Search Results, for query: 'Normative LLMs Profiling'
Rank Trunk-ID Norm-score Filename           page      
------------------------------------------------------------
1    1682     0.724      2508.15250v1.pdf   1         
2    1674     0.673      2508.15250v1.pdf   1         
3    1675     0.156      2508.15250v1.pdf   1         

Keyword Search Results, for query: 'Automatic Mixed Precision'
Rank Trunk-ID Norm-score Filename           page      
------------------------------------------------------------
1    2153     0.740      2508.15617v1.pdf   5         
2    2147     0.260      2508.15617v1.pdf   5         

Keyword Search Results, for query: 'Sampling Number'
Rank Trunk-ID Norm-score Filename           page      
------------------------------------------------------------
1    8294     0.797      2508.15244v1.pdf   6         
2    8270     0.720      2508.15244v1.pdf   4         
3    3015     0.701      2508.15371v1.pdf   3        

### 4) Execute hybrid search (vector + keyword) for the queries, apply the weight-summary merge logic and get the top 3 matches

The search results show that the hrbrid method yields more accurate results over using vector or keyword method alone.

In [29]:
for query in queries:
    result = hybrid_search(query, 6)
    print('\r')
    pprint.pprint(result, sort_dicts=False)
    print('\r')


Hybrid Search Results for query: 'Normative LLMs Profiling'
alpha=0.6
Rank Doc ID Combined Vector   Keyword  Filename           Page      
------------------------------------------------------------
1    1682   0.621    0.553    0.724    2508.15250v1.p     1         
2    5434   0.339    0.565    0.000    2508.15361v1.p     25        
3    5542   0.326    0.543    0.000    2508.15361v1.p     34        
4    5169   0.323    0.539    0.000    2508.15361v1.p     4         
5    174    0.319    0.532    0.000    2508.15396v1.p     8         
6    1775   0.318    0.530    0.000    2508.15250v1.p     10        

[{'doc_idx': 1682,
  'filename': '2508.15250v1.pdf',
  'page': 1,
  'content': 't measurements often focus solely on eth-\n'
             'ical judgments, without considering how factors\n'
             'like professional background, language environ-\n'
             'ment (Changjiang et al., 2024), and model param-\n'
             'eters (Achiam et al., 2023) interact. To fill thi

### 5) A FastAPI endpoint "/hybrid-search" has been implemented in hybrid_search.py. To test its API, run the following code and then test the API in a browser at 127.0.0.1/8000

In [6]:
! python3 ./src/hybrid_search.py

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[32mINFO[0m:     Will watch for changes in these directories: ['/home/ehan/evanhan_homework/evanhan_homework/class5']
[32mINFO[0m:     Uvicorn running on [1mhttp://127.0.0.1:8000[0m (Press CTRL+C to quit)
[32mINFO[0m:     Started reloader process [[36m[1m1809280[0m] using [36m[1mWatchFiles[0m
[32mINFO[0m:     Started server process [[36m1809332[0m]
[32mINFO[0m:     Waiting for application startup.
FAISS index and documents files loaded
Successfully opened the database
[32mINFO[0m:     Application startup complete.
^C
[32mINFO[0m:     Shutting down
[32mINFO[0m:     Finished server process [[36m1809332[0m]
[32mINFO[0m:     Stopping reloader process [[36m[1m1809280[0m]


**Note:** The PDF files processed in this assignment had been downloaded in the last assignment by the get_latest_arxiv() function in documents_downloading.py. To perform this task again:

from src.documents_downloading import get_latest_arxiv 
<br>
get_latest_arxiv(query="cat:cs.CL", max_results=50)


### 6) Remarks

Based on the calculated Recall@k metrics on following five queries (with chunk ids identified for the occurances of the searched terms), the hybrid method performed better than the vector-only search, but it didn't perform much better than the keywork-only search. The value of alpha was 0.6 on the tests. 

    "Normative LLMs Profiling", # chunk_id: 1674, 1675, 1682 (total 3)
    "Automatic Mixed Precision", # chunk_id: 2147, other chunk in same page (total 2)
    "Sampling Number", # chunk_id: 7397, 7398 (total 2)
    "knowledge distillation", # chunk_id: 1571, 1572, 2104, 2730, 3126, 6206, 6274, 6300, 7522, 7523, 7534, 7630, 8173, 8174  (total 13)
    "Knowledge Graph Completion Models", # 1103/04, 1184, 1189, 1197, 1199, 1205/06, 3846/48 (8)  




| Query        | Recall@6 Vector | Recall@10 Vector | Recall@6 Keyword  | Recall@6 Hybrid |
|--------------|-----------------|------------------|------------------------------|-------|
| 1. "Normative .."| 0.33| 0.33   |    1     |     0.33      |
| 2. "Automatic .."  | 0  | 0       |    1     |    0.5        |
| 3. "Sampling .."  | 1  | 1       |   1      |     1         |
| 4. "knowledge d.."  | 0.23 | 0.46  |  0.38     |     0.38     |
| 5. "Knowledge G .."  | 0.62 | 1  |  0.37     |     0.62     |