##### Task 2.1: Implement Classical IR Models
○ Action: Implement a straightforward TF-IDF Vector Space model from scratch or using scikit-learn to demonstrate foundational knowledge.

=> In this case, we will use Scikit-learn library for a more efficient perspective.

#### TF‑IDF Vector Space (scikit‑learn)

In [None]:
# Assuming the sample corpus have already been pre-processed (To be replaced later)
SAMPLE_DOCS = [
    "security council discuss peacekeeper mandate west africa focus training logistics",
    "resolution stipulation address political process ceasefire syria include humanitarian corridor",
    "sanction arm trade tighten reduce illicit flow destabilize region",
    "nuclear nonproliferation discussion emphasize verification international cooperation framework",
    "counterterrorism committee mandate state reporting requirement national measure",
    "blue helmet deployment rule update include better equipment medical support",
    "human right reporting requirement include regular briefing independent monitoring mission",
    "ceasefire monitoring mission extend presence conflict zone ensure compliance",
    "women peace security agenda highlight participation protection peace process",
    "humanitarian corridor syria coordinate allow aid delivery besieged area",
]

# Sample queries (also pre-processed)
SAMPLE_QUERIES = [
    "peacekeeper mandate west africa",
    "humanitarian corridor syria",
]

In [None]:
#Import the required packages and libraries

!pip install scikit-learn pandas



In [None]:
from typing import List, Tuple
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
## Build TF‑IDF Index

In [None]:
# Vectorize documents (since they are already preprocessed, use a simple whitespace tokenizer)
vectorizer = TfidfVectorizer(lowercase=False, tokenizer=str.split, preprocessor=None, token_pattern=None)
X = vectorizer.fit_transform(SAMPLE_DOCS)  # (n_docs x n_terms)
feature_names = np.array(vectorizer.get_feature_names_out())

print(f"TF-IDF matrix shape: {X.shape}")

TF-IDF matrix shape: (10, 78)


In [None]:
## Query → Rank (Cosine Similarity)

In [None]:
def rank_tfidf(queries: List[str], top_k: int = 5) -> pd.DataFrame:
    rows = []
    for q in queries:
        # Transform query to TF-IDF space
        q_vec = vectorizer.transform([q])
        sims = cosine_similarity(q_vec, X)[0]  # similarity to each doc
        ranked = sims.argsort()[::-1][:top_k]
        for rank, idx in enumerate(ranked, start=1):
            rows.append({
                "query": q,
                "doc_id": idx,
                "rank": rank,
                "score": float(sims[idx]),
                "doc_preview": SAMPLE_DOCS[idx][:160]
            })
    return pd.DataFrame(rows)

tfidf_results = rank_tfidf(SAMPLE_QUERIES, top_k=5)
tfidf_results

Unnamed: 0,query,doc_id,rank,score,doc_preview
0,peacekeeper mandate west africa,0,1,0.627796,security council discuss peacekeeper mandate w...
1,peacekeeper mandate west africa,4,2,0.139896,counterterrorism committee mandate state repor...
2,peacekeeper mandate west africa,8,3,0.0,women peace security agenda highlight particip...
3,peacekeeper mandate west africa,9,4,0.0,humanitarian corridor syria coordinate allow a...
4,peacekeeper mandate west africa,6,5,0.0,human right reporting requirement include regu...
5,humanitarian corridor syria,1,1,0.515241,resolution stipulation address political proce...
6,humanitarian corridor syria,9,2,0.515192,humanitarian corridor syria coordinate allow a...
7,humanitarian corridor syria,8,3,0.0,women peace security agenda highlight particip...
8,humanitarian corridor syria,7,4,0.0,ceasefire monitoring mission extend presence c...
9,humanitarian corridor syria,5,5,0.0,blue helmet deployment rule update include bet...


#### Inspect top TF-IDF terms for a document (Optional)

In [None]:
doc_idx = 0
row = X[doc_idx].toarray().ravel()
top_idx = row.argsort()[::-1][:10]
list(zip(feature_names[top_idx], row[top_idx]))

[('training', np.float64(0.32538076593964255)),
 ('west', np.float64(0.32538076593964255)),
 ('focus', np.float64(0.32538076593964255)),
 ('discuss', np.float64(0.32538076593964255)),
 ('logistics', np.float64(0.32538076593964255)),
 ('peacekeeper', np.float64(0.32538076593964255)),
 ('council', np.float64(0.32538076593964255)),
 ('africa', np.float64(0.32538076593964255)),
 ('security', np.float64(0.27660337782848204)),
 ('mandate', np.float64(0.27660337782848204))]

#### BM25 (rank_bm25)

In [None]:
#Import the required packages and libraries

!pip install rank_bm25 pandas

Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25
Successfully installed rank_bm25-0.2.2


In [None]:
from typing import List, Tuple
import pandas as pd
from rank_bm25 import BM25Okapi

In [None]:
# Tokenize (Since the texts are pre-processed, a simple split is fine)
tokenized_docs = [doc.split() for doc in SAMPLE_DOCS]

# Build BM25
bm25 = BM25Okapi(tokenized_docs)

#### Query → Rank (BM25 score)

In [None]:
def rank_bm25(queries: List[str], top_k: int = 5) -> pd.DataFrame:
    rows = []
    for q in queries:
        q_tokens = q.split()
        scores = bm25.get_scores(q_tokens)  # score for each doc
        ranked = scores.argsort()[::-1][:top_k]
        for rank, idx in enumerate(ranked, start=1):
            rows.append({
                "query": q,
                "doc_id": idx,
                "rank": rank,
                "score": float(scores[idx]),
                "doc_preview": SAMPLE_DOCS[idx][:160]
            })
    return pd.DataFrame(rows);

bm25_results = rank_bm25(SAMPLE_QUERIES, top_k=5)
bm25_results

Unnamed: 0,query,doc_id,rank,score,doc_preview
0,peacekeeper mandate west africa,0,1,6.506648,security council discuss peacekeeper mandate w...
1,peacekeeper mandate west africa,4,2,1.300085,counterterrorism committee mandate state repor...
2,peacekeeper mandate west africa,8,3,0.0,women peace security agenda highlight particip...
3,peacekeeper mandate west africa,9,4,0.0,humanitarian corridor syria coordinate allow a...
4,peacekeeper mandate west africa,6,5,0.0,human right reporting requirement include regu...
5,humanitarian corridor syria,9,1,3.707596,humanitarian corridor syria coordinate allow a...
6,humanitarian corridor syria,1,2,3.533076,resolution stipulation address political proce...
7,humanitarian corridor syria,8,3,0.0,women peace security agenda highlight particip...
8,humanitarian corridor syria,7,4,0.0,ceasefire monitoring mission extend presence c...
9,humanitarian corridor syria,5,5,0.0,blue helmet deployment rule update include bet...


#### Tuning BM25 parameters (Optional)

In [None]:
# You can re-initialize with custom k1 and b (defaults ~1.5 and 0.75)
# from rank_bm25 import BM25Okapi
# bm25 = BM25Okapi(tokenized_docs, k1=1.6, b=0.7)
# bm25_results = rank_bm25(SAMPLE_QUERIES, top_k=5)
# bm25_results.head()

Evaluation metrics such as (P@k, MAP, nDCG) will be added later.