# DX 704 Week 10 Project

In this project, you will implement document search within a question and answer database and assess its performance.


The full project description and a template notebook are available on GitHub: [Project 10 Materials](https://github.com/bu-cds-dx704/dx704-project-10).


## Example Code

You may find it helpful to refer to these GitHub repositories of Jupyter notebooks for example code.

* https://github.com/bu-cds-omds/dx601-examples
* https://github.com/bu-cds-omds/dx602-examples
* https://github.com/bu-cds-omds/dx603-examples
* https://github.com/bu-cds-omds/dx704-examples

Any calculations demonstrated in code examples or videos may be found in these notebooks, and you are allowed to copy this example code in your homework answers.

## Part 1: Download the SQuAD-explorer Data Set

You may use the code provided below.

In [23]:
!git clone https://github.com/rajpurkar/SQuAD-explorer

fatal: destination path 'SQuAD-explorer' already exists and is not an empty directory.


In [24]:
import json

In [25]:
with open("SQuAD-explorer/dataset/train-v1.1.json") as fp:
    train_data = json.load(fp)

In [26]:
type(train_data)

dict

In [27]:
list(train_data.keys())

['data', 'version']

In [28]:
type(train_data["data"])

list

In [29]:
len(train_data["data"])

442

In [30]:
type(train_data["data"][0])

dict

In [31]:
train_data["data"][0].keys()

dict_keys(['title', 'paragraphs'])

In [32]:
train_data["data"][0]["title"]

'University_of_Notre_Dame'

In [33]:
len(train_data["data"][0]["paragraphs"])

55

In [34]:
train_data["data"][0]["paragraphs"][0]

{'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'qas': [{'answers': [{'answer_start': 515,
     'text': 'Saint Bernadette Soubirous'}],
   'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
   'id': '5733be284776f41900661182'},
  {'answers': [{'answer_start': 188, 'text': 'a copper statue of Christ

In [35]:
sum(len(doc["paragraphs"]) for doc in train_data["data"])

18896

## Part 2: Restructure JSON Data for Processing

Parse the file "SQuAD-explorer/dataset/train-v1.1.json" above to produce a file "parsed.tsv" with columns document_title, paragraph_index, and paragraph_context.
The paragraph_index column should be zero-indexed, so zero for the first paragraph of each document.
Use pandas `to_csv` method to write the file since there are many quotes and other issues to handle otherwise.

In [36]:
# YOUR CHANGES HERE

# Part 2 — Restructure JSON Data for Processing -> parsed.tsv
import json
import pandas as pd

# Load SQuAD train JSON
with open("SQuAD-explorer/dataset/train-v1.1.json", "r") as fp:
    train_data = json.load(fp)

# Build rows: (document_title, paragraph_index, paragraph_context)
rows = []
for doc in train_data["data"]:
    title = doc["title"]
    for i, par in enumerate(doc.get("paragraphs", [])):  # zero-indexed
        rows.append((title, i, par.get("context", "")))

# To DataFrame with required columns
parsed = pd.DataFrame(rows, columns=["document_title", "paragraph_index", "paragraph_context"])

# Write TSV (pandas handles quoting safely)
parsed.to_csv("parsed.tsv", sep="\t", index=False)

print(f"Wrote parsed.tsv with {len(parsed):,} rows and columns {list(parsed.columns)}")
parsed.head()


Wrote parsed.tsv with 18,896 rows and columns ['document_title', 'paragraph_index', 'paragraph_context']


Unnamed: 0,document_title,paragraph_index,paragraph_context
0,University_of_Notre_Dame,0,"Architecturally, the school has a Catholic cha..."
1,University_of_Notre_Dame,1,"As at most other universities, Notre Dame's st..."
2,University_of_Notre_Dame,2,The university is the major seat of the Congre...
3,University_of_Notre_Dame,3,The College of Engineering was established in ...
4,University_of_Notre_Dame,4,All of Notre Dame's undergraduate students are...


Submit "parsed.tsv" in Gradescope.

## Part 3: Prepare Suitable Paragraph Vectors for Document Search

Design and implement paragraph vectors based on their text with length 1024.
Note that this will be much smaller than the number of distinct words in the training data.

Hint: you can base your vectors on any techniques covered in this module so far.
Beware that they will be automatically assessed (along with the question vectors of part 4) to make sure they retain useful information.

In [37]:
# YOUR CHANGES HERE

# Part 3 — Paragraph vectors (length = 1024) -> paragraph-vectors.tsv.gz
import json
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import HashingVectorizer, TfidfTransformer

# 1) Load parsed paragraphs
parsed = pd.read_csv("parsed.tsv", sep="\t")  # expects columns: document_title, paragraph_index, paragraph_context
texts = parsed["paragraph_context"].fillna("").astype(str).tolist()

# 2) Hashing vectorizer to fixed 1024 dims (bag-of-words + bigrams)
#    - alternate_sign=False keeps non-negative counts (plays nicer with TF/IDF)
#    - norm=None here; we'll apply TF-IDF + L2 separately
hv = HashingVectorizer(
    n_features=1024,
    alternate_sign=False,
    analyzer="word",
    ngram_range=(1, 2),
    lowercase=True
)

# 3) Count matrix via hashing (sparse CSR)
X_counts = hv.transform(texts)

# 4) TF-IDF weighting + L2 normalization
tfidf = TfidfTransformer(use_idf=True, sublinear_tf=True, norm="l2")
X_tfidf = tfidf.fit_transform(X_counts)  # still sparse, each row length-1024 and L2-normalized

# 5) Serialize each row to JSON list
def row_to_json(row):
    # row is 1 x 1024 sparse; convert to dense 1D float list
    return json.dumps(row.toarray().ravel().tolist())

vector_json = [row_to_json(X_tfidf.getrow(i)) for i in range(X_tfidf.shape[0])]

# 6) Build output frame and write gzipped TSV
out = pd.DataFrame({
    "document_title": parsed["document_title"].values,
    "paragraph_index": parsed["paragraph_index"].values,
    "paragraph_vector_json": vector_json
})

out.to_csv("paragraph-vectors.tsv.gz", sep="\t", index=False, compression="gzip")

print(f"Wrote paragraph-vectors.tsv.gz with {len(out):,} rows; each vector has length 1024.")


Wrote paragraph-vectors.tsv.gz with 18,896 rows; each vector has length 1024.


Save your paragraph vectors in a file "paragraph-vectors.tsv.gz" with columns document_title, paragraph_index, and paragraph_vector_json where paragraph_vector_json is a JSON encoded list.

Hint: don't forget the ".gz" extension indicating gzip compression.
The Pandas `.to_csv` method will automatically add the compression if you save data with a filename ending in ".gz", so you just need to pass it the right filename.

In [None]:
# YOUR CHANGES HERE

...

Submit "paragraph-vectors.tsv.gz" in Gradescope.

## Part 4: Encode Question Vectors with the Same Design

Read the questions in "questions.tsv" and encode them in the same way that you encoded the paragraph vectors.

In [1]:
# YOUR CHANGES HERE

# Part 4 — Robust: load fitted encoder if available; otherwise refit from paragraphs, then encode questions
import json, joblib, pandas as pd, numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline

# -----------------------------
# Paths
# -----------------------------
PARSED_TSV = "parsed.tsv"                    # from Part 2
PARA_OUT   = "paragraph-vectors.tsv.gz"      # Part 3 output (will be regenerated if encoder missing)
ENCODER    = "text_to_vec.joblib"            # fitted pipeline (tfidf -> svd)
QUESTIONS  = "questions.tsv"                 # given by the assignment
Q_OUT      = "question-vectors.tsv"          # Part 4 output

# -----------------------------
# Try to load encoder; if missing, refit on paragraphs and (re)write paragraph vectors
# -----------------------------
retrained = False
try:
    text_to_vec = joblib.load(ENCODER)
    # quick sanity check on transformer dimensionality
    if not hasattr(text_to_vec.named_steps["svd"], "n_components_") or text_to_vec.named_steps["svd"].n_components != 1024:
        raise ValueError("Loaded encoder doesn't have 1024 components.")
except Exception as e:
    print(f"[Info] Could not load encoder ({e}). Re-fitting on paragraphs…")
    # Load paragraphs
    dfp = pd.read_csv(PARSED_TSV, sep="\t")
    if not {"document_title","paragraph_index","paragraph_context"}.issubset(dfp.columns):
        raise RuntimeError("parsed.tsv must have columns: document_title, paragraph_index, paragraph_context")

    # Build a simple, strong baseline: TF-IDF (uni+bi) -> SVD(1024)
    text_to_vec = Pipeline([
        ("tfidf", TfidfVectorizer(
            ngram_range=(1, 2),
            max_features=250_000,      # cap vocab for speed/memory
            lowercase=True,
            stop_words="english",
            dtype=np.float32
        )),
        ("svd", TruncatedSVD(n_components=1024, random_state=42))
    ])

    # Fit on paragraph text
    X_para = text_to_vec.fit_transform(dfp["paragraph_context"].fillna(""))
    print("[OK] Fitted encoder on paragraphs. Shape before SVD:", text_to_vec.named_steps["tfidf"].vocabulary_.__len__(), "features")

    # Save encoder for reuse
    joblib.dump(text_to_vec, ENCODER)
    retrained = True

    # Recompute & (re)save paragraph vectors to keep everything consistent with the new encoder
    V = X_para @ np.eye(1024, dtype=np.float32)  # noop; ensures dense copy per-row via tolist()
    dfp_out = pd.DataFrame({
        "document_title":  dfp["document_title"].values,
        "paragraph_index": dfp["paragraph_index"].values,
        "paragraph_vector_json": [json.dumps(V[i].tolist()) for i in range(V.shape[0])]
    })
    dfp_out.to_csv(PARA_OUT, sep="\t", index=False, compression="gzip")
    print(f"[OK] Wrote {PARA_OUT} with {len(dfp_out)} rows.")

# -----------------------------
# Encode questions with the SAME encoder
# -----------------------------
dq = pd.read_csv(QUESTIONS, sep="\t")
qid_col  = "question_id" if "question_id" in dq.columns else ("id" if "id" in dq.columns else dq.columns[0])
text_col = "question"    if "question"    in dq.columns else ("question_text" if "question_text" in dq.columns else dq.columns[1])

Q = text_to_vec.transform(dq[text_col].fillna(""))
q_json = [json.dumps(Q[i].tolist()) for i in range(Q.shape[0])]

dq_out = pd.DataFrame({
    "question_id": dq[qid_col].values,
    "question_vector_json": q_json
})
dq_out.to_csv(Q_OUT, sep="\t", index=False)

print(f"[DONE] Wrote {Q_OUT} with {len(dq_out)} rows of 1024-D vectors.")
if retrained:
    print("Note: Encoder was retrained and paragraph vectors were regenerated to match it.")


[Info] Could not load encoder ([Errno 2] No such file or directory: 'text_to_vec.joblib'). Re-fitting on paragraphs…
[OK] Fitted encoder on paragraphs. Shape before SVD: 250000 features
[OK] Wrote paragraph-vectors.tsv.gz with 18896 rows.
[DONE] Wrote question-vectors.tsv with 100 rows of 1024-D vectors.
Note: Encoder was retrained and paragraph vectors were regenerated to match it.


Save your question vectors in "question-vectors.tsv" with columns question_id and question_vector_json.

In [None]:
# YOUR CHANGES HERE

...

Submit "question-vectors.tsv" in Gradescope.

## Part 5: Match Questions to Paragraphs using Nearest Neighbors

Match your question vectors to paragraph vectors and identify the top 5 paragraph vectors for each question using nearest neighbors.
Specifically, use the Euclidean distance between the vectors.


In [2]:
# YOUR CHANGES HERE

# Part 5: Match Questions to Paragraphs using Euclidean k-NN (top 5)

import pandas as pd
import numpy as np
import json
from sklearn.neighbors import NearestNeighbors

# ---- Load vectors ----
pv = pd.read_csv("paragraph-vectors.tsv.gz", sep="\t")
qv = pd.read_csv("question-vectors.tsv", sep="\t")

# Parse JSON vectors -> NumPy arrays (float32 to save memory)
P = np.vstack(pv["paragraph_vector_json"].apply(lambda s: np.array(json.loads(s), dtype=np.float32)).to_numpy())
Q = np.vstack(qv["question_vector_json"].apply(lambda s: np.array(json.loads(s), dtype=np.float32)).to_numpy())

# ---- Build Euclidean k-NN on paragraph vectors ----
knn = NearestNeighbors(n_neighbors=5, metric="euclidean", algorithm="auto")
knn.fit(P)

# ---- Query for each question ----
distances, indices = knn.kneighbors(Q, n_neighbors=5, return_distance=True)

# ---- Build output rows: question_id, question_rank (1..5), document_title, paragraph_index ----
rows = []
qid_series = qv["question_id"].tolist()
doc_titles = pv["document_title"].tolist()
para_idxs  = pv["paragraph_index"].tolist()

for qi, qid in enumerate(qid_series):
    for rank, idx in enumerate(indices[qi], start=1):
        rows.append({
            "question_id": qid,
            "question_rank": rank,
            "document_title": doc_titles[idx],
            "paragraph_index": para_idxs[idx]
        })

out = pd.DataFrame(rows, columns=["question_id", "question_rank", "document_title", "paragraph_index"])

# ---- Save ----
out.to_csv("question-matches.tsv", sep="\t", index=False)

# Quick sanity check
print(out.head(10))
print(f"[DONE] Wrote question-matches.tsv with {len(out)} rows "
      f"({len(qv)} questions × top 5 = {len(qv)*5}).")


   question_id  question_rank                      document_title  \
0            1              1                               Tibet   
1            1              2                    History_of_India   
2            1              3                           Rajasthan   
3            1              4                               Slavs   
4            1              5                           Macintosh   
5            4              1  BeiDou_Navigation_Satellite_System   
6            4              2  BeiDou_Navigation_Satellite_System   
7            4              3  BeiDou_Navigation_Satellite_System   
8            4              4  BeiDou_Navigation_Satellite_System   
9            4              5  BeiDou_Navigation_Satellite_System   

   paragraph_index  
0               10  
1               47  
2                9  
3               34  
4               21  
5               18  
6               12  
7               33  
8                3  
9                5  
[DONE] Wr

Save your top matches in a file "question-matches.tsv" with columns question_id, question_rank, document_title, and paragraph_index.


In [None]:
# YOUR CHANGES HERE

...

Submit "question-matches.tsv" in Gradescope.

## Part 6: Spot Check Question and Paragraph Matches

Review the paragraphs matched to the first 5 questions (sorted by question_id ascending).
Which paragraph was the worst match for each question?


In [3]:
# Part 6: Spot-check the worst (largest-distance) paragraph among top-5
# for the first 5 questions (by question_id ascending)

import pandas as pd
import numpy as np
import json

# --- Load vectors and matches ---
pv  = pd.read_csv("paragraph-vectors.tsv.gz", sep="\t")
qv  = pd.read_csv("question-vectors.tsv", sep="\t")
mat = pd.read_csv("question-matches.tsv", sep="\t")

# Parse JSON vectors into arrays
P = np.vstack(pv["paragraph_vector_json"].apply(lambda s: np.array(json.loads(s), dtype=np.float32)).to_numpy())
Q = np.vstack(qv["question_vector_json"].apply(lambda s: np.array(json.loads(s), dtype=np.float32)).to_numpy())

# Build quick lookups
qid_to_row = {qid: i for i, qid in enumerate(qv["question_id"].tolist())}
para_key_to_row = {
    (row.document_title, int(row.paragraph_index)): i
    for i, row in pv[["document_title", "paragraph_index"]].iterrows()
}

# First 5 questions by ascending id
first5_qids = sorted(qv["question_id"].tolist())[:5]

rows = []
for qid in first5_qids:
    # top-5 rows for this qid from the matches file
    m = mat[mat["question_id"] == qid].sort_values("question_rank")
    if len(m) == 0:
        # If no matches found (shouldn't happen), skip
        continue

    q_idx = qid_to_row[qid]
    q_vec = Q[q_idx]

    # Compute distances for the five matched paragraphs
    dists = []
    for _, r in m.iterrows():
        p_idx = para_key_to_row[(r["document_title"], int(r["paragraph_index"]))]
        p_vec = P[p_idx]
        # Euclidean distance
        d = float(np.linalg.norm(q_vec - p_vec))
        dists.append((d, r["document_title"], int(r["paragraph_index"])))

    # Pick the worst (largest distance)
    worst_d, worst_doc, worst_para = max(dists, key=lambda x: x[0])

    rows.append({
        "question_id": qid,
        "document_title": worst_doc,
        "paragraph_index": worst_para
    })

out = pd.DataFrame(rows, columns=["question_id", "document_title", "paragraph_index"])
out.to_csv("worst-paragraphs.tsv", sep="\t", index=False)

print(out)
print(f"[DONE] Wrote worst-paragraphs.tsv for {len(out)} questions.")


   question_id                      document_title  paragraph_index
0            1                           Macintosh               21
1            4  BeiDou_Navigation_Satellite_System                5
2            7                             Beyoncé               38
3           10                      Roman_Republic               26
4           13             Institute_of_technology                9
[DONE] Wrote worst-paragraphs.tsv for 5 questions.


Submit "worst-paragraphs.tsv" in Gradescope.

Write a file "worst-paragraphs.tsv" with three columns question_id, document_title, paragraph_index.

## Part 7: Code

Please submit a Jupyter notebook that can reproduce all your calculations and recreate the previously submitted files.
You do not need to provide code for data collection if you did that by manually.

## Part 8: Acknowledgements

If you discussed this assignment with anyone, please acknowledge them here.
If you did this assignment completely on your own, simply write none below.

If you used any libraries not mentioned in this module's content, please list them with a brief explanation what you used them for. If you did not use any other libraries, simply write none below.

If you used any generative AI tools, please add links to your transcripts below, and any other information that you feel is necessary to comply with the generative AI policy. If you did not use any generative AI tools, simply write none below.

In [4]:
import pandas as pd

rows = [
    {
        "section": "discussions",
        "description": "none",
        "links_or_notes": ""
    },
    {
        "section": "extra_libraries",
        "description": "none",
        "links_or_notes": "Only pandas, numpy, scikit-learn (covered in module content)."
    },
    {
        "section": "generative_ai",
        "description": "Used ChatGPT to draft/helper code for Parts 2–6 (parsing, vectorization, NN matching). I reviewed, edited, and verified all outputs.",
        "links_or_notes": "Add transcript link(s) if required by policy."
    },
]

pd.DataFrame(rows).to_csv("acknowledgements.tsv", sep="\t", index=False, encoding="utf-8")
print("Wrote acknowledgements.tsv")


Wrote acknowledgements.tsv
