# Using vectorized representations of text for semantic text similarity

This notebook has been created to allow students of the TNLP 25/26 course to complete their assignment on vectorized representations of text. This notebook is provided with the minimal information to start working on the assignment. Students will have to follow the [instructions](https://mespla.github.io/tpln2526/assignment-searchinvectorialspace/) of the assignment reflecting in this notebook the work done.

The starting point will be to install both the `scikit-learn` and the `sentence-embeddings` python libraries:

In [1]:
!pip install sentence-transformers scikit-learn

Defaulting to user installation because normal site-packages is not writeable
Collecting sentence-transformers
  Downloading sentence_transformers-5.1.2-py3-none-any.whl.metadata (16 kB)
Collecting scikit-learn
  Downloading scikit_learn-1.7.2-cp313-cp313-win_amd64.whl.metadata (11 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Downloading joblib-1.5.2-py3-none-any.whl.metadata (5.6 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading sentence_transformers-5.1.2-py3-none-any.whl (488 kB)
Downloading scikit_learn-1.7.2-cp313-cp313-win_amd64.whl (8.7 MB)
   ---------------------------------------- 0.0/8.7 MB ? eta -:--:--
   - -------------------------------------- 0.3/8.7 MB ? eta -:--:--
   -- ------------------------------------- 0.5/8.7 MB 1.2 MB/s eta 0:00:07
   --- ------------------------------------ 0.8/8.7 MB 1.2 MB/s eta 0:00:07
   ---- ----------------------------------- 1.0/8.7 MB 1.2 MB/s 


[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


Once this is done, students will have to download and read the file containing the dataset consisting of a list of scientific paper's title and abstract.


In [4]:
# =========================
# Section 2 — Dataset Acquisition & Loading (EMNLP 2016–2018 JSON)
# =========================

import os, json, hashlib
import requests
from collections import Counter

DATA_URL = "https://sbert.net/datasets/emnlp2016-2018.json"
DATA_PATH = "emnlp2016-2018.json"   # Feel free to rename, but keep it consistent across the notebook
FORCE_DOWNLOAD = False             # Set True if you want to re-download even if the file exists

def sha256_of_file(path: str) -> str:
    h = hashlib.sha256()
    with open(path, "rb") as f:
        for chunk in iter(lambda: f.read(1024 * 1024), b""):
            h.update(chunk)
    return h.hexdigest()

def download_json(url: str, out_path: str, timeout: int = 120) -> None:
    r = requests.get(url, stream=True, timeout=timeout)
    r.raise_for_status()
    with open(out_path, "wb") as f:
        for chunk in r.iter_content(chunk_size=1024 * 1024):
            if chunk:
                f.write(chunk)

# --- (1) Acquire the dataset file (download if missing) ---
if FORCE_DOWNLOAD or (not os.path.exists(DATA_PATH)):
    print(f"[Download] Fetching dataset from: {DATA_URL}")
    download_json(DATA_URL, DATA_PATH)
    print(f"[Download] Saved to: {DATA_PATH}")
else:
    print(f"[Cache] Using existing file: {DATA_PATH}")

# --- (2) Load JSON into memory ---
with open(DATA_PATH, "r", encoding="utf-8") as f:
    papers = json.load(f)

# Alias for compatibility with the starter template cells
data = papers

# --- (3) Validate expected JSON structure: root must be a list of dicts ---
if not isinstance(papers, list):
    raise TypeError(f"Expected JSON root to be a list, got: {type(papers)}")

if len(papers) == 0:
    raise ValueError("Dataset loaded but it is empty (N=0). File may be corrupted or not the expected dataset.")

if not isinstance(papers[0], dict):
    raise TypeError(f"Expected each item to be a dict, got first item type: {type(papers[0])}")

# --- (4) Provenance info (practical traceability) ---
file_size_mb = os.path.getsize(DATA_PATH) / (1024 * 1024)
print(f"\n[Provenance]")
print(f" - Source URL: {DATA_URL}")
print(f" - Local path: {DATA_PATH}")
print(f" - File size : {file_size_mb:.2f} MB")
print(f" - SHA-256   : {sha256_of_file(DATA_PATH)}")

# --- (5) Mandatory sanity checks ---
N = len(papers)
print(f"\n[Sanity checks]")
print(f" - N papers loaded: {N}")

required_keys = {"title", "abstract", "url", "venue", "year"}
missing_key_counts = Counter()

empty_title = 0
empty_abstract = 0
non_string_title = 0
non_string_abstract = 0

years = []
venues = []
urls = []

for p in papers:
    # Key presence
    for k in required_keys:
        if k not in p:
            missing_key_counts[k] += 1

    # Title / abstract existence + type + emptiness
    t = p.get("title", None)
    a = p.get("abstract", None)

    if not isinstance(t, str):
        non_string_title += 1
        t = "" if t is None else str(t)
    if not isinstance(a, str):
        non_string_abstract += 1
        a = "" if a is None else str(a)

    if len(t.strip()) == 0:
        empty_title += 1
    if len(a.strip()) == 0:
        empty_abstract += 1

    # Metadata distributions (for plausibility checks)
    venues.append(str(p.get("venue", "")).strip())
    urls.append(str(p.get("url", "")).strip())

    y = p.get("year", None)
    try:
        years.append(int(y))
    except Exception:
        years.append(None)

# Report key presence
if sum(missing_key_counts.values()) == 0:
    print(" - Key presence: OK (all required keys found in all records)")
else:
    print(" - Key presence issues:")
    for k in sorted(required_keys):
        if missing_key_counts[k] > 0:
            print(f"   * Missing '{k}': {missing_key_counts[k]} records")

# Report title/abstract health
print(f" - Empty titles   : {empty_title}")
print(f" - Empty abstracts: {empty_abstract}")
print(f" - Non-string titles   : {non_string_title}")
print(f" - Non-string abstracts: {non_string_abstract}")

# Year plausibility
valid_years = [y for y in years if isinstance(y, int)]
year_counts = Counter(valid_years)
print(f" - Year distribution (top): {year_counts.most_common(5)}")

outside = [y for y in valid_years if y < 2016 or y > 2018]
print(f" - Years outside 2016–2018: {len(outside)}")

# Venue plausibility
venue_counts = Counter([v for v in venues if v])
print(f" - Venue distribution (top): {venue_counts.most_common(5)}")

# Duplicate URLs (very practical duplicate detector)
url_counts = Counter([u for u in urls if u])
dupe_urls = sum(1 for u, c in url_counts.items() if c > 1)
print(f" - Duplicate URL keys: {dupe_urls} (unique non-empty URLs: {len(url_counts)})")

# --- (6) Access check (prove we can read title/abstract) ---
print(f"\n[Access check — first record]")
print(f" - title   : {papers[0].get('title','')[:120]}{'...' if len(papers[0].get('title',''))>120 else ''}")
print(f" - abstract: {papers[0].get('abstract','')[:200]}{'...' if len(papers[0].get('abstract',''))>200 else ''}")
print(f" - url     : {papers[0].get('url','')}")
print(f" - venue   : {papers[0].get('venue','')}")
print(f" - year    : {papers[0].get('year','')}")


[Cache] Using existing file: emnlp2016-2018.json

[Provenance]
 - Source URL: https://sbert.net/datasets/emnlp2016-2018.json
 - Local path: emnlp2016-2018.json
 - File size : 1.05 MB
 - SHA-256   : 9e6020503e5f0dd0e91dbb970d47f43c7621f67321249daed5990057e159961c

[Sanity checks]
 - N papers loaded: 974
 - Key presence: OK (all required keys found in all records)
 - Empty titles   : 0
 - Empty abstracts: 0
 - Non-string titles   : 0
 - Non-string abstracts: 0
 - Year distribution (top): [(2018, 549), (2017, 230), (2016, 195)]
 - Years outside 2016–2018: 0
 - Venue distribution (top): [('EMNLP', 974)]
 - Duplicate URL keys: 0 (unique non-empty URLs: 974)

[Access check — first record]
 - title   : Rule Extraction for Tree-to-Tree Transducers by Cost Minimization
 - abstract: Finite-state transducers give efficient representations of many Natural Language phenomena. They allow to account for complex lexicon restrictions encountered, without involving the use of a large set...
 - url     :

## Part 1:
From this point, students should be able to obtain BoW and TF-IDF representations of the dataset, and to obtain similar matches for the new scientific paper titles in the instructions of the exercise. Include here the code use to build the representations, as well as the discussion on the results obtained.

In [None]:
!pip install numpy

###Step 1: Preprocessing the Data



In [2]:
%%writefile section3_preprocessing.py
from __future__ import annotations

from dataclasses import dataclass
from typing import Any, Dict, List, Optional, Sequence, Tuple


try:
    import pandas as pd
    _HAS_PANDAS = True
except Exception:
    pd = None  # type: ignore
    _HAS_PANDAS = False


@dataclass(frozen=True)
class PrepConfig:
    title_key: str = "title"
    abstract_key: str = "abstract"
    joiner: str = " "
    make_dataframe: bool = True


def _as_text(x: Any) -> str:
    """Minimal, non-aggressive coercion: keep text as-is (no casing/punct stripping)."""
    if x is None:
        return ""
    if isinstance(x, str):
        return x
    return str(x)


def build_paper_texts(
    records: Sequence[Dict[str, Any]],
    config: PrepConfig = PrepConfig(),
) -> Tuple[List[str], Optional["pd.DataFrame"]]:
    """
    Build the working 'text' field exactly as required:
        text = title + " " + abstract

    Document unit: one paper = one document.
    Returns:
      - paper_texts: list[str] with one concatenated document per paper
      - df (optional): pandas DataFrame including title/abstract/url/venue/year/text
    """
    if not isinstance(records, (list, tuple)):
        raise TypeError(f"records must be a sequence (list/tuple) of dicts, got {type(records)}")

    paper_texts: List[str] = []
    rows_for_df: List[Dict[str, Any]] = []

    for i, rec in enumerate(records):
        if not isinstance(rec, dict):
            raise TypeError(f"Each record must be a dict. Found {type(rec)} at index {i}.")

        title = _as_text(rec.get(config.title_key, ""))
        abstract = _as_text(rec.get(config.abstract_key, ""))

        # Exact construction rule (single space joiner)
        text = title + config.joiner + abstract

        paper_texts.append(text)

        if config.make_dataframe:
            rows_for_df.append(
                {
                    "title": title,
                    "abstract": abstract,
                    "url": rec.get("url", ""),
                    "venue": rec.get("venue", ""),
                    "year": rec.get("year", ""),
                    "text": text,
                }
            )

    df_out = None
    if config.make_dataframe:
        if not _HAS_PANDAS:
            raise ImportError("pandas is required for make_dataframe=True, but it is not available.")
        df_out = pd.DataFrame(rows_for_df)

    return paper_texts, df_out


def validate_paper_texts(paper_texts: Sequence[str], expected_n: int) -> None:
    """Sanity checks for Section 3 outputs."""
    if len(paper_texts) != expected_n:
        raise AssertionError(f"len(paper_texts)={len(paper_texts)} != expected_n={expected_n}")
    if any(not isinstance(t, str) for t in paper_texts):
        raise AssertionError("paper_texts must contain only strings.")
    if any(len(t) == 0 for t in paper_texts):
        raise AssertionError("paper_texts contains empty strings (unexpected for this dataset).")


Writing section3_preprocessing.py


In [5]:
# =========================
# Section 3 — Data Preparation for Vectorization (Part 1 – Step 1)
# =========================

from section3_preprocessing import PrepConfig, build_paper_texts, validate_paper_texts

# Use the variable created in Section 2 (you already set: data = papers)
records = data  # list of dicts, one paper per record

config = PrepConfig(
    title_key="title",
    abstract_key="abstract",
    joiner=" ",          # MUST be exactly one space
    make_dataframe=True  # optional, but very useful in a notebook
)

paper_texts, papers_df = build_paper_texts(records, config=config)

# Robust sanity checks
N = len(records)
validate_paper_texts(paper_texts, expected_n=N)

print("[Section 3] Preprocessing complete.")
print(f" - Document unit          : 1 paper = 1 document")
print(f" - N documents (paper_texts): {len(paper_texts)}")
print(f" - DataFrame created      : {papers_df is not None}")
print("\n[Section 3] Example document (first record):")
print(paper_texts[0][:400] + ("..." if len(paper_texts[0]) > 400 else ""))

# Optional: quick inspection (keeps later steps easier)
if papers_df is not None:
    display(papers_df.head(3))


[Section 3] Preprocessing complete.
 - Document unit          : 1 paper = 1 document
 - N documents (paper_texts): 974
 - DataFrame created      : True

[Section 3] Example document (first record):
Rule Extraction for Tree-to-Tree Transducers by Cost Minimization Finite-state transducers give efficient representations of many Natural Language phenomena. They allow to account for complex lexicon restrictions encountered, without involving the use of a large set of complex rules difficult to analyze. We here show that these representations can be made very compact, indicate how to perform the ...


Unnamed: 0,title,abstract,url,venue,year,text
0,Rule Extraction for Tree-to-Tree Transducers b...,Finite-state transducers give efficient repres...,http://aclweb.org/anthology/D16-1002,EMNLP,2016,Rule Extraction for Tree-to-Tree Transducers b...
1,A Neural Network for Coordination Boundary Pre...,We propose a neural-network based model for co...,http://aclweb.org/anthology/D16-1003,EMNLP,2016,A Neural Network for Coordination Boundary Pre...
2,"Distinguishing Past, On-going, and Future Even...",The tremendous amount of user generated data t...,http://aclweb.org/anthology/D16-1005,EMNLP,2016,"Distinguishing Past, On-going, and Future Even..."


In [6]:
%%writefile test_section3_preprocessing.py
import unittest

from section3_preprocessing import PrepConfig, build_paper_texts, validate_paper_texts


class TestSection3Preprocessing(unittest.TestCase):
    def test_basic_concatenation_rule(self):
        records = [
            {"title": "Hello", "abstract": "World", "url": "u", "venue": "v", "year": 2016},
            {"title": "A", "abstract": "B", "url": "u2", "venue": "v2", "year": 2017},
        ]
        cfg = PrepConfig(make_dataframe=False)
        paper_texts, df = build_paper_texts(records, cfg)
        self.assertIsNone(df)
        self.assertEqual(paper_texts, ["Hello World", "A B"])

    def test_stable_length_and_validation(self):
        records = [{"title": "T", "abstract": "X"} for _ in range(5)]
        cfg = PrepConfig(make_dataframe=False)
        paper_texts, _ = build_paper_texts(records, cfg)
        validate_paper_texts(paper_texts, expected_n=5)  # should not raise

    def test_type_coercion_is_minimal_and_safe(self):
        records = [{"title": 123, "abstract": None}]
        cfg = PrepConfig(make_dataframe=False)
        paper_texts, _ = build_paper_texts(records, cfg)
        # title becomes "123", abstract becomes ""
        self.assertEqual(paper_texts[0], "123 ")

    def test_rejects_non_dict_records(self):
        records = [{"title": "OK", "abstract": "OK"}, "not_a_dict"]
        cfg = PrepConfig(make_dataframe=False)
        with self.assertRaises(TypeError):
            build_paper_texts(records, cfg)

    def test_dataframe_creation_if_enabled(self):
        records = [{"title": "T", "abstract": "A", "url": "u", "venue": "EMNLP", "year": 2016}]
        cfg = PrepConfig(make_dataframe=True)
        paper_texts, df = build_paper_texts(records, cfg)
        self.assertIsNotNone(df)
        self.assertIn("text", df.columns)
        self.assertEqual(df.loc[0, "text"], paper_texts[0])


if __name__ == "__main__":
    unittest.main()


Writing test_section3_preprocessing.py


In [7]:
!python -m unittest -v test_section3_preprocessing.py


test_basic_concatenation_rule (test_section3_preprocessing.TestSection3Preprocessing.test_basic_concatenation_rule) ... ok
test_dataframe_creation_if_enabled (test_section3_preprocessing.TestSection3Preprocessing.test_dataframe_creation_if_enabled) ... ok
test_rejects_non_dict_records (test_section3_preprocessing.TestSection3Preprocessing.test_rejects_non_dict_records) ... ok
test_stable_length_and_validation (test_section3_preprocessing.TestSection3Preprocessing.test_stable_length_and_validation) ... ok
test_type_coercion_is_minimal_and_safe (test_section3_preprocessing.TestSection3Preprocessing.test_type_coercion_is_minimal_and_safe) ... ok

----------------------------------------------------------------------
Ran 5 tests in 0.002s

OK


###Step 2: Building the BoW and TF-IDF Representations

###Step 3: Similarity Search

Use the following code snipet to compute the pairwise cosine similarity, and to sort the top 3 candidates for each query. In this example, `query_vectors` and `paper_vectors` correspond to the vectorized collections of queries and papers (as stored in the JSON file previously downloaded), respectively. The variable `paper_texts` contain the list of concatenated titles+abstracts of the papers loaded from the JSON file.

In [1]:
from sklearn.metrics.pairwise import cosine_similarity

#...

similarity_matrix = cosine_similarity(query_vectors, paper_vectors)

for nquery in [0, 1, 2]:
  similarity_scores = similarity_matrix[nquery]
  top_indices = np.argsort(similarity_scores)[::-1][:3]  # Indices of top 3 scores
  for i, index in enumerate(top_indices, 1):
    print(f"{i}. Text: '{paper_texts[index]}' (Score: {similarity_scores[index]:.4f})")
  print()

NameError: name 'query_vectors' is not defined

###Step 4: Analysis of the results obtained

## Part 2:

In this section students should use `setence-embeddings` to obtain sentence-embedding representations of the dataset and to peform searches for best matches regarding the examples proposed in the instructions of the assignment.

###Step 1: Trying a general purpose small monolingual model

###Step 2: Comparing other models

###Step 3: Moving to a multilingual environment

When trying multilingual models, you will have to build a multilingual collection of papers. To do so, extend the collection of papers from the EMNLP conference used in the first part of this notebook with an extra collection of papers from the SEPLN conference, which are both in English and Spanish. Create a new dataset that concatenates both collections to try a multilingual search. The collection of SEPLN papers can be downloaded from [https://www.dlsi.ua.es/~mespla/sepln.json](https://www.dlsi.ua.es/~mespla/sepln.json)

##Concluding remarks