# Indexing BM25 on Wikipedia Corpus

Source is Pyserini's own example:

https://github.com/castorini/pyserini/blob/master/docs/usage-index.md#building-a-bm25-index-embeddable-python-implementation

Steps omitted: extracting jsonl.gz first in my file system and afterwards with tar -xvf. (It has double compression, or I used the wrong decompression method) 

## 1. Indexing

This uses DefaultLuceneDocumentGenerator which already by default features stopword removal and stemming

- I have 14 threads, so I use half of them. More might be faster, but this seemed safe
- "--storePositions" "--storeDocvectors" "--storeRaw" are expensive options but we need the raw documents for reranking.

In [1]:
# !python -m pyserini.index.lucene \
#   --collection JsonCollection \
#   --input data/data00/jiajie_jin/flashrag_indexes/wiki_dpr_100w \
#   --index indexes/wiki_dump \
#   --generator DefaultLuceneDocumentGenerator \
#   --threads 7 \
#   --storePositions --storeDocvectors --storeRaw

In [2]:
from pyserini.search.lucene import LuceneSearcher
import polars as pl
import json

query = 'Deaf Basketball'

searcher = LuceneSearcher('index/wiki_dump')
hits = searcher.search('Deaf Basketball',k=1000)

docs = [json.loads(hits[i].lucene_document.get('raw'))["contents"] for i in range(len(hits))]
docs[:2]

Nov 14, 2025 4:02:47 PM org.apache.lucene.store.MemorySegmentIndexInputProvider <init>
INFO: Using MemorySegmentIndexInput with Java 21; to disable start with -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false


['"Deaf basketball"\nDeaf basketball Deaf basketball is basketball played by deaf people. Sign language is used to communicate whistle blows and communication between players. The game played by deaf people is organized with national and international associations including Deaf Basketball Australia, Deaf Basketball UK and United States of America Deaf Basketball. Deaf basketball has gained great visibility because of athlete like Lance Allred who played basketball with the National Basketball Association\'s (NBA) Cleveland Cavaliers. Allred is Hard of Hearing, with a 75-80% hearing loss wearing a hearing aid. He later on continued to play basketball professionally in the European basketball leagues. Another',
 '"Deaf basketball"\ndeaf basketball. DIBF encourages the growth and development of deaf basketball in all nationals of the world through an organized program of education and instruction. The Federation schedules and conducts all international contests and championships in deaf 

### Two variants (https://docs.langchain.com/oss/python/langchain/rag#rag-chains)


In [3]:
from langchain.agents import create_agent
from langchain.chat_models import init_chat_model
from IPython.display import display, Markdown
from sentence_transformers import CrossEncoder


qwen = init_chat_model(model="ollama:qwen2.5:7B").bind(logprobs=True)
bm25 = LuceneSearcher('index/wiki_dump')
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L6-v2')
query = "what is deaf basketball and what countries have this?"

def get_top_3_rerank(query: str) -> str:
    """Using the user query, this function retrieves the three most relevant models using rerank

    Args:
        query (str): the user query

    Returns:
        str: a string containing the top three documents.
    """    
    hits = bm25.search(query,k=1000)

    docs = [json.loads(hits[i].lucene_document.get('raw'))["contents"] for i in range(len(hits))]
    

    top_k = pl.DataFrame(cross_encoder.rank(query,docs,top_k=3,return_documents=True))
    
    return "\n".join(top_k.get_column('text').to_list())

agent = create_agent(
    model=qwen,
    tools=[get_top_3_rerank],
    system_prompt="You are a helpful assistant",
)

# Run the agent
response = agent.invoke(
    {"messages": [{"role": "user", "content": query}]}
)
display(Markdown(response["messages"][-1].content))

Deaf basketball is a form of basketball played by individuals who are deaf or hard of hearing. Communication on the court often involves using sign language to convey whistle blows and other important information. There are several national organizations dedicated to promoting deaf basketball, such as Deaf Basketball Australia, Deaf Basketball UK, and United States of America Deaf Basketball.

The sport has gained significant recognition due to athletes like Lance Allred, who played for the National Basketball Association's (NBA) Cleveland Cavaliers despite his 75-80% hearing loss. After his NBA career, he continued playing professionally in European basketball leagues.

Internationally, there is an organization called the Deaf International Basketball Federation (DIBF), which was officially founded to organize and promote deaf basketball globally. The first DIBF Central Board consisted of representatives from countries like Finland, Australia, Sweden, the United States, China, Ukraine, Greece, Lithuania, Estonia, and Sweden.

Some notable countries where deaf basketball is practiced include:

- **Australia**
- **Finland**
- **United States**
- **Sweden**
- **China**
- **Ukraine**
- **Greece**
- **Lithuania**
- **Estonia**

This sport not only brings together players from different backgrounds but also promotes the visibility and inclusion of deaf athletes in mainstream sports.

In [4]:
from langchain.agents.middleware import dynamic_prompt, ModelRequest


@dynamic_prompt
def prompt_with_context(request: ModelRequest) -> str:
    """Creates a prompt where the retrieval context is concatenated to the prompt

    Args:
        request (ModelRequest): The full list of messages. From this the function retrieves the last human message

    Returns:
        str: a new prompt with context(top 3 rerank) that should 
    """    
    """Inject context into state messages."""
    last_query = request.state["messages"][-1]

    retrieved_docs = get_top_3_rerank(last_query.content)

    system_message = (
        "You are a helpful assistant. Use the following context in your response:"
        f"\n\n{retrieved_docs}"
    )

    return system_message


agent = create_agent(qwen, tools=[], middleware=[prompt_with_context])
response = agent.invoke(
    {"messages": [{"role": "user", "content": query}]}
)
display(Markdown(response["messages"][-1].content))

Deaf basketball refers to basketball played by individuals who are deaf or hard of hearing. Communication during games and practices relies heavily on sign language, with whistle blows often accompanied by visual signals for clarity. This form of the sport has a growing global presence and is organized through various national and international associations.

Several countries have active deaf basketball programs:

1. **Australia**: Deaf Basketball Australia organizes local competitions and participates in international events.
2. **United Kingdom (UK)**: Deaf Basketball UK manages and promotes deaf basketball at both amateur and competitive levels.
3. **United States**: United States of America Deaf Basketball is involved in organizing tournaments and events for deaf players.
4. **Slovenia**: Slovenian player Miha Zupan has gained international recognition, showcasing the level of skill that can be achieved within this community.
5. **Sweden**: The sport has a presence here as evidenced by its recognition during the Winter Deaflympics in 2003.

These organizations contribute to promoting and supporting deaf basketball through training camps, tournaments, and other activities. Additionally, the Deaf International Basketball Federation (DIBF), recognized by FIBA, plays a crucial role in governing international competitions and ensuring that deaf basketball remains an integral part of global sports events like the Deaflympics.

### Using huggingface (for white-box :()

In [6]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Code mostly taken directly from huggingface: https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct

model_name = "Qwen/Qwen2.5-0.5B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

def prompt_with_context_from_string(query: str) -> str:
    """Creates a prompt where the retrieval context is concatenated to the prompt

    Args:
        str(query): The human query as string.

    Returns:
        str: a new prompt with context(top 3 rerank) that should 
    """    
    """Inject context into state messages."""
    retrieved_docs = get_top_3_rerank(query)

    system_message = (
        "You are a helpful assistant. You answer in markdown. Use the following context in your response:"
        f"\n\n{retrieved_docs}"
    )

    return system_message

messages = [
    {"role": "system", "content": prompt_with_context_from_string(query)},
    {"role": "user", "content": query}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=2048
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
display(Markdown(response))

Deaf basketball is a sport played by deaf individuals using sign language to communicate. It involves physical skills such as dribbling, shooting, and passing. There are several national and international associations for deaf basketball, including Deaf Basketball Australia, Deaf Basketball UK, and the United States of America.

Some notable deaf basketball players include Lance Allred, who played for the Cleveland Cavaliers of the National Basketball Association, and Miha Zupan, who was born with a hearing impairment but played power forward for the Slovenian National Basketball Team. These players have helped promote deaf basketball worldwide through their advocacy efforts.

In [9]:
import TruthTorchLM as ttlm
import torch

In [10]:
# Define truth methods
mars = ttlm.truth_methods.MARS()
eccentricity = ttlm.truth_methods.EccentricityUncertainty()
truth_methods = [mars, eccentricity]

config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/729 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.62G [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/deberta-large-mnli were not used when initializing DebertaForSequenceClassification: ['config']
- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

There are 2 methods for similarity: semantic similarity and jaccard score. The default method is semantic similarity. If you want to use jaccard score, please set method_for_similarity="jaccard". Please refer to https://arxiv.org/pdf/2305.19187 for more information.


In [11]:
output_hf_model = ttlm.generate_with_truth_value(
    model=model,
    tokenizer=tokenizer,
    messages=messages,
    truth_methods=truth_methods,
    max_new_tokens=1024
)

In [12]:
output_hf_model

{'generated_text': "Deaf basketball is a sport played by individuals who are deaf or hard-of-hearing. It uses sign language to communicate and involves whistle blows from coaches or officials. Players use their hands to signal when they want to shoot the ball. Deaf basketball has gained popularity due to the presence of notable players such as Lance Allred, a former NBA player who played for the Cleveland Cavaliers.\n\nThe Deaf International Basketball Federation (DIBF) is a world governing body for international deaf basketball. It includes representatives from around the globe, supporting organizations like the International Basketball Federation (FIBA), Deaflympics, and other confederations dedicated to deaf sports. The DFB was established in 2003 to recognize and represent deaf athletes globally.\n\nSeveral international and national basketball competitions have also been held in collaboration with the Deaflympics. These events showcase deaf athletes' talents and provide opportunit