# 🤝 Contribute to **ImpliRet**

Thank you for contributing!  
You can help in **two** ways:

1. **Add a new retriever implementation** and open a PR.  
2. **Submit ready‑made retrieval results** so we can include them in the leaderboard.

> If you’re only interested in option 2, **skip Step 1** below.


## Step 1 — Add a custom retriever  
*(skip if you only want to upload results)*

Your retriever should be a **Python class** with two methods:

```python
class MyRetriever:
    def __init__(self, corpus: list[str], k: int):
        """Store `corpus` and set the top‑k you’ll retrieve."""

    def retrieve_data(self, query: str) -> tuple[list[str], list[float]]:
        """Return *k* documents and their scores for the given `query`."""
```

Below is a minimal example that wraps `rank_bm25`.  
Feel free to swap in any library or algorithm!


In [10]:
# 🔧 Example skeleton — fill in the TODOs
from rank_bm25 import BM25Okapi
import numpy as np
from typing import List, Tuple

class MyBM25Retriever:
    def __init__(self, corpus: List[str], k: int = 10):
        self.k = k
        self.corpus = corpus
        tokenized = [doc.split() for doc in corpus]
        self.bm25 = BM25Okapi(tokenized)

    def retrieve_data(self, query: str) -> Tuple[List[str], List[float]]:
        tokenized_q = query.split()
        scores = np.array(self.bm25.get_scores(tokenized_q))
        ranked_idx = scores.argsort()[::-1][:self.k]
        return [self.corpus[i] for i in ranked_idx.tolist()], scores[ranked_idx].tolist()


# ——— minimal working example ———
if __name__ == "__main__":
    corpus = [
        "Apple unveils the new iPhone today",
        "Bananas are an excellent source of potassium",
        "Python is a popular programming language",
        "The iPhone features an improved camera system",
        "Oranges are rich in vitamin C"
    ]

    retriever = MyBM25Retriever(corpus, k=2)
    idxs, scrs = retriever.retrieve_data("iPhone camera")

    print("Top-2 results:")
    for rank, (i, s) in enumerate(zip(idxs, scrs), 1):
        print(f"{rank}. doc= {i} · score={s:.4f}")

Top-2 results:
1. doc= The iPhone features an improved camera system · score=1.3770
2. doc= Apple unveils the new iPhone today · score=0.3462


### Where to put your code

1. Save your file (for example content of previous cell) as `Retrieval/retrievals/MY_BM_retriever.py` (or another name).  
2. Add an import branch to `Retrieval/retrieve_indexing.py` **lines 28–37**:

```python
if retriever_name.lower() == "my_bm25":
        try:
            from retrievals.MY_BM_retriever import MyBM25Retriever
            retriever_module = MyBM25Retriever
        except:
            try:    
                from Retrieval.retrievals.MY_BM_retriever import MyBM25Retriever
                retriever_module = MyBM25Retriever
            except:
                raise Exception("MyBM25Retriever not found")
```

3. Run the pipeline:

```bash
python Retrieval/retrieve_indexing.py \
       --output_folder Retrieval/results/ \
       --category arithmetic \
       --discourse multispeaker \
       --retriever_name my_bm25
```


## Step 2 — Provide results only

For **each** of the six pools you must create a JSONL file named

```
{category}_{discourse}_{retriever_name}_index.jsonl
# e.g. arithmetic_multispeaker_bm25_index.jsonl
```

* 1 500 lines — exactly one per query  
* Keys per line:

| key | type | description |
|-----|------|-------------|
| `question` | str | The query text |
| `gold_index` | int | Always the row index (ground‑truth doc) |
| `index_score_tuple_list` | list[[int, float]] | *k ≥ 10* tuples of `(doc_index, score)` sorted by score (desc) |

Example line:

```json
{
  "question": "What is the 2024 model price?",
  "gold_index": 0,
  "index_score_tuple_list": [[273, 4.97], [102, 1.23], ...]
}
```

In [13]:
# 🔎 Validation helper
import json, pathlib, sys

def validate_jsonl(path, k_min=10, n_rows=1500):
    path = pathlib.Path(path)
    rows = path.read_text().splitlines()
    assert len(rows) == n_rows, f"{path}: expected {n_rows} rows, got {len(rows)}"
    for i, line in enumerate(rows):
        data = json.loads(line)
        assert set(data) >= {'question', 'gold_index', 'index_score_tuple_list'}, f"{path}: missing keys on row {i}"
        lst = data['index_score_tuple_list']
        assert len(lst) >= k_min, f"row {i}: fewer than {k_min} retrieved docs"
        assert all(isinstance(t, list) and len(t)==2 for t in lst), f"row {i}: each item must be [idx, score]"
    print(f"{path.name}: ✔ format looks good")

# Example usage:
# validate_jsonl('./Retrieval/results/arithmetic_multispeaker_bm25_index.jsonl')


arithmetic_multispeaker_bm25_index.jsonl: ✔ format looks good


### Submit

* **Pull request** — fork the repo and add your code / JSONL files under `Experiments/evaluation/results`.  
* **Email** — send the six JSONL files (and optionally your retriever code + `requirements.txt`) to **zeinabtaghavi1377@gmail.com**.

We’ll run the validation script, merge, and your numbers will appear in the README leaderboard. 🎉
