# Using vectorized representations of text for semantic text similarity

This notebook has been created to allow students of the TNLP 25/26 course to complete their assignment on vectorized representations of text. This notebook is provided with the minimal information to start working on the assignment. Students will have to follow the [instructions](https://mespla.github.io/tpln2526/assignment-searchinvectorialspace/) of the assignment reflecting in this notebook the work done.

The starting point will be to install both the `scikit-learn` and the `sentence-embeddings` python libraries:

In [None]:
!pip install sentence-transformers scikit-learn



Once this is done, students will have to download and read the file containing the dataset consisting of a list of scientific paper's title and abstract.


In [None]:
import json
import os
import requests
# =========================
# Section 2 — Dataset Acquisition & Loading (EMNLP 2016–2018 JSON)
# =========================

import os, json, hashlib, requests
from collections import Counter

DATA_URL = "https://sbert.net/datasets/emnlp2016-2018.json"
DATA_PATH = "emnlp2016-2018.json"   # Feel free to rename, but keep it consistent across the notebook
FORCE_DOWNLOAD = False             # Set True if you want to re-download even if the file exists

def sha256_of_file(path: str) -> str:
    h = hashlib.sha256()
    with open(path, "rb") as f:
        for chunk in iter(lambda: f.read(1024 * 1024), b""):
            h.update(chunk)
    return h.hexdigest()

def download_json(url: str, out_path: str, timeout: int = 120) -> None:
    r = requests.get(url, stream=True, timeout=timeout)
    r.raise_for_status()
    with open(out_path, "wb") as f:
        for chunk in r.iter_content(chunk_size=1024 * 1024):
            if chunk:
                f.write(chunk)

# --- (1) Acquire the dataset file (download if missing) ---
if FORCE_DOWNLOAD or (not os.path.exists(DATA_PATH)):
    print(f"[Download] Fetching dataset from: {DATA_URL}")
    download_json(DATA_URL, DATA_PATH)
    print(f"[Download] Saved to: {DATA_PATH}")
else:
    print(f"[Cache] Using existing file: {DATA_PATH}")

# --- (2) Load JSON into memory ---
with open(DATA_PATH, "r", encoding="utf-8") as f:
    papers = json.load(f)

# --- (3) Validate expected JSON structure: root must be a list of dicts ---
if not isinstance(papers, list):
    raise TypeError(f"Expected JSON root to be a list, got: {type(papers)}")

if len(papers) == 0:
    raise ValueError("Dataset loaded but it is empty (N=0). File may be corrupted or not the expected dataset.")

if not isinstance(papers[0], dict):
    raise TypeError(f"Expected each item to be a dict, got first item type: {type(papers[0])}")

# --- (4) Provenance info (practical traceability) ---
file_size_mb = os.path.getsize(DATA_PATH) / (1024 * 1024)
print(f"\n[Provenance]")
print(f" - Source URL: {DATA_URL}")
print(f" - Local path: {DATA_PATH}")
print(f" - File size : {file_size_mb:.2f} MB")
print(f" - SHA-256   : {sha256_of_file(DATA_PATH)}")

# --- (5) Mandatory sanity checks ---
N = len(papers)
print(f"\n[Sanity checks]")
print(f" - N papers loaded: {N}")

required_keys = {"title", "abstract", "url", "venue", "year"}
missing_key_counts = Counter()

empty_title = 0
empty_abstract = 0
non_string_title = 0
non_string_abstract = 0

years = []
venues = []
urls = []

for p in papers:
    # Key presence
    for k in required_keys:
        if k not in p:
            missing_key_counts[k] += 1

    # Title / abstract existence + type + emptiness
    t = p.get("title", None)
    a = p.get("abstract", None)

    if not isinstance(t, str):
        non_string_title += 1
        t = "" if t is None else str(t)
    if not isinstance(a, str):
        non_string_abstract += 1
        a = "" if a is None else str(a)

    if len(t.strip()) == 0:
        empty_title += 1
    if len(a.strip()) == 0:
        empty_abstract += 1

    # Metadata distributions (for plausibility checks)
    venues.append(str(p.get("venue", "")).strip())
    urls.append(str(p.get("url", "")).strip())

    y = p.get("year", None)
    try:
        years.append(int(y))
    except Exception:
        years.append(None)

# Report key presence
if sum(missing_key_counts.values()) == 0:
    print(" - Key presence: OK (all required keys found in all records)")
else:
    print(" - Key presence issues:")
    for k in sorted(required_keys):
        if missing_key_counts[k] > 0:
            print(f"   * Missing '{k}': {missing_key_counts[k]} records")

# Report title/abstract health
print(f" - Empty titles   : {empty_title}")
print(f" - Empty abstracts: {empty_abstract}")
print(f" - Non-string titles   : {non_string_title}")
print(f" - Non-string abstracts: {non_string_abstract}")

# Year plausibility
valid_years = [y for y in years if isinstance(y, int)]
year_counts = Counter(valid_years)
print(f" - Year distribution (top): {year_counts.most_common(5)}")

outside = [y for y in valid_years if y < 2016 or y > 2018]
print(f" - Years outside 2016–2018: {len(outside)}")

# Venue plausibility
venue_counts = Counter([v for v in venues if v])
print(f" - Venue distribution (top): {venue_counts.most_common(5)}")

# Duplicate URLs (very practical duplicate detector)
url_counts = Counter([u for u in urls if u])
dupe_urls = sum(1 for u, c in url_counts.items() if c > 1)
print(f" - Duplicate URL keys: {dupe_urls} (unique non-empty URLs: {len(url_counts)})")

# --- (6) Access check (prove we can read title/abstract) ---
print(f"\n[Access check — first record]")
print(f" - title   : {papers[0].get('title','')[:120]}{'...' if len(papers[0].get('title',''))>120 else ''}")
print(f" - abstract: {papers[0].get('abstract','')[:200]}{'...' if len(papers[0].get('abstract',''))>200 else ''}")
print(f" - url     : {papers[0].get('url','')}")
print(f" - venue   : {papers[0].get('venue','')}")
print(f" - year    : {papers[0].get('year','')}")

# At this point, you have a usable in-memory structure:
# - papers: List[Dict] with keys title/abstract/url/venue/year


[Download] Fetching dataset from: https://sbert.net/datasets/emnlp2016-2018.json
[Download] Saved to: emnlp2016-2018.json

[Provenance]
 - Source URL: https://sbert.net/datasets/emnlp2016-2018.json
 - Local path: emnlp2016-2018.json
 - File size : 1.05 MB
 - SHA-256   : 9e6020503e5f0dd0e91dbb970d47f43c7621f67321249daed5990057e159961c

[Sanity checks]
 - N papers loaded: 974
 - Key presence: OK (all required keys found in all records)
 - Empty titles   : 0
 - Empty abstracts: 0
 - Non-string titles   : 0
 - Non-string abstracts: 0
 - Year distribution (top): [(2018, 549), (2017, 230), (2016, 195)]
 - Years outside 2016–2018: 0
 - Venue distribution (top): [('EMNLP', 974)]
 - Duplicate URL keys: 0 (unique non-empty URLs: 974)

[Access check — first record]
 - title   : Rule Extraction for Tree-to-Tree Transducers by Cost Minimization
 - abstract: Finite-state transducers give efficient representations of many Natural Language phenomena. They allow to account for complex lexicon restricti

## Part 1:
From this point, students should be able to obtain BoW and TF-IDF representations of the dataset, and to obtain similar matches for the new scientific paper titles in the instructions of the exercise. Include here the code use to build the representations, as well as the discussion on the results obtained.

In [None]:
!pip install numpy

###Step 1: Preprocessing the Data



###Step 2: Building the BoW and TF-IDF Representations

###Step 3: Similarity Search

Use the following code snipet to compute the pairwise cosine similarity, and to sort the top 3 candidates for each query. In this example, `query_vectors` and `paper_vectors` correspond to the vectorized collections of queries and papers (as stored in the JSON file previously downloaded), respectively. The variable `paper_texts` contain the list of concatenated titles+abstracts of the papers loaded from the JSON file.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

#...

similarity_matrix = cosine_similarity(query_vectors, paper_vectors)

for nquery in [0, 1, 2]:
  similarity_scores = similarity_matrix[nquery]
  top_indices = np.argsort(similarity_scores)[::-1][:3]  # Indices of top 3 scores
  for i, index in enumerate(top_indices, 1):
    print(f"{i}. Text: '{paper_texts[index]}' (Score: {similarity_scores[index]:.4f})")
  print()

###Step 4: Analysis of the results obtained

## Part 2:

In this section students should use `setence-embeddings` to obtain sentence-embedding representations of the dataset and to peform searches for best matches regarding the examples proposed in the instructions of the assignment.

###Step 1: Trying a general purpose small monolingual model

###Step 2: Comparing other models

###Step 3: Moving to a multilingual environment

When trying multilingual models, you will have to build a multilingual collection of papers. To do so, extend the collection of papers from the EMNLP conference used in the first part of this notebook with an extra collection of papers from the SEPLN conference, which are both in English and Spanish. Create a new dataset that concatenates both collections to try a multilingual search. The collection of SEPLN papers can be downloaded from [https://www.dlsi.ua.es/~mespla/sepln.json](https://www.dlsi.ua.es/~mespla/sepln.json)

##Concluding remarks