# Subtask 4.5.2 Geotagging of texts

This subtask is part of the [ATRIUM](https://atrium-research.eu/) project. The data we use is from the [Digital Periegesis](https://www.periegesis.org).

### Step 0: Download and import libraries

For Ollama to work you need to download the client and then open it (leave it running on the background).

* Download [Ollama client](https://ollama.com/download).
* After installing Ollama, run on your cmd this `ollama pull qwen3:14b`.
* You can find this model [here](https://ollama.com/library/qwen3).
* If you want better results and have the computational capacity to support larger models, then you can choose one from this [list](https://ollama.com/search).

* For the embedding model we used the [Qwen-Embedding-8B](https://ollama.com/library/qwen3-embedding:8b).
* Run on your cmd this `ollama pull qwen3-embedding:8b`.


In order to use [torch](https://pytorch.org/get-started/locally/) you need to determine if you have an Nvidia GPU with CUDA 12.6+, or if you have a CPU.

* It is highly recommened to use a GPU with CUDA 12.6+.
* For this notebook we will use CPU.

In [None]:
import srsly
from ollama import chat,ChatResponse
from tqdm import tqdm
import requests
from bs4 import BeautifulSoup
import re
import time
import numpy as np
import pickle
import faiss
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import os
from lxml import etree

In [None]:
# Create the folders
os.makedirs("files", exist_ok=True)
os.makedirs("data", exist_ok=True)
os.makedirs("books", exist_ok=True)
os.makedirs("books_chapters", exist_ok=True)

### Step 1: Extract the texts from Digital Periegisis

In [None]:
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/141.0.0.0 Safari/537.36"}

In [None]:
final_list = []

for number in tqdm(range(1, 358)):
    response = requests.get(f"https://www.periegesis.org/en/reports.php?reportid={number}", headers=headers)
    soup = BeautifulSoup(response.content, "html.parser")

    x = soup.find("div", class_="text_max_width")

    for tag in soup.find_all(["b", "strong"]):
            tag.decompose()

    chapter = soup.find("h2", class_="head").text
    book = soup.find("h5", class_="head align_center").text

    for i in x.find_all("p"):
        metadata = []
        clean_text = re.split(r"BOOK\s+\d+", i.text)[-1].strip()
        for pl in i.find_all("pl"):
            if pl.get("id").startswith("Q"):
                metadata.append({"chapter":int(chapter.split()[-1]),
                                "book":int(book.split()[-1])})
                
        final_list.append({"text":clean_text,"metadata":metadata})
    
    time.sleep(1)

srsly.write_jsonl(f"files/1_pausanias.jsonl",final_list)

### Step 2: Perform Name Entity Recognition (NER) using [NameTag 3](https://lindat.mff.cuni.cz/services/nametag/) service

In [None]:
data = list(srsly.read_jsonl("files/1_pausanias.jsonl"))

In [None]:
for element in tqdm(data):
    x = requests.get(f"https://lindat.mff.cuni.cz/services/nametag/api/recognize?data={element.get("text")}&model=nametag3-multilingual-conll-250203&output=vertical", headers=headers).json()
    new_element = [i.split("\t")[-1].replace("\xad","") for i in x.get("result").splitlines() if i.split("\t")[1] != "PER"]

    element.update({"nametag_mentions":new_element})

    srsly.write_jsonl("files/2_pausanias_nametag.jsonl", [element], append=True, append_new_line=False)

### Step 3: Recontext the NER predictions using a Large Language Model (LLM)

In [None]:
# This is the instructions the LLM will use, you can change it according to your task

# prompt = "Your task is to..."

prompt = 'You are a professional ancient Greek historian.' \
'You will be given: A mention (e.g., stoa, “temple,” “grave,” “harbor”) along with a text from Pausanias.' \
'Your task: Determine exactly what the mention refers to in context, paying attention to qualifiers such as “largest,” “near,” “old,” ' \
'as well as the subject matter of decorations, inscriptions, or dedications that can uniquely identify the place, structure, or object. ' \
'Write a concise, 2-sentence Wikipedia-style entry describing the place, including historical and archaeological context. ' \
'If multiple candidates exist (e.g., several harbors), specify the particular one the text refers to. ' \
'Output format example with three cases: ' \
'Example 1 Mention: building Text: “On entering the city there is a building for the preparation of the processions, ' \
'which are held in some cases every year, in others at longer intervals. Hard by is a temple of Demeter, with images of the goddess herself ' \
'and of her daughter, and of Iacchus holding a torch. On the wall, in Attic characters, is written that they are works of Praxiteles. ' \
'Not far from the temple is Poseidon on horseback, hurling a spear against the giant Polybotes … ' \
'From the gate to the Cerameicus there are stoas, and in front of them bronze statues of such as had some title to fame, both men and women.” ' \
'Output: Building (Pompeion): The building is the Pompeion, located in ancient Athens between the Dipylus and the Holy Gate, ' \
'west of the Ancient Agora and the first building of the inner Kerameikos. ' \
'It was used for the preparation of processions, including those of the Panathenaia and other festivals. ' \
'Construction appears to have begun in the 5th century BCE and was largely completed by the early 4th century BCE. ' \
'Example 2 Mention: shipsheds Text: “Even up to my time there were shipsheds there, and near the largest harbor is the grave of Themistocles.” ' \
'Output: shipsheds: The shipsheds of Piraeus were covered naval storage structures used to house and maintain the Athenian fleet, ' \
'protecting triremes from weather and decay. They were built during the early 5th century BCE as part of Themistocles’ expansion of Piraeus' \
' and were located near Kantharos, the largest harbor. These shipsheds remained standing into Pausanias’ time, symbolizing the city’s ' \
'enduring naval infrastructure. ' \
'Example 3 Mention: harbor Text: “Even up to my time there were shipsheds there, and near the largest harbor is the grave of Themistocles.” ' \
'Output: harbor: Kantharos was the largest of the three harbors of Piraeus and served as the main port of ancient ' \
'Athens after Themistocles’ expansion around 493 BCE. It provided docking, loading, and maintenance facilities for the Athenian ' \
'fleet and was central to the city’s naval power and trade. Pausanias notes that near Kantharos were the shipsheds and' \
' the grave of Themistocles, highlighting its historical and maritime significance.'


In [None]:
data = list(srsly.read_jsonl("files/2_pausanias_nametag.jsonl"))

In [None]:
model = "qwen3:14b" #change this to your prefered model from Ollama (https://ollama.com/search)

for i in tqdm(data):
    lista = []
    text = i.get("text")
    search = 0
    for mention in i.get("nametag_mentions"):
      response: ChatResponse = chat(model=model, messages=[
        {
          'role': 'system',
          'content': prompt,
        },
        {
          'role': 'user',
          'content': f'Mention: {mention}, Sentence: {text}',
        },
      ],
      think=False)

      start = text.find(mention, search)
      end = start + len(mention)
      search = end

      lista.append({"name":mention, "start":start, "end":end, "recontext":response.message.content})

    i.update({"mentions_tagged":lista})
    
    srsly.write_jsonl("files/3_pausanias_ner.jsonl", [i], append=True, append_new_line=False)

### Step 4: Use a [FAISS index](https://github.com/facebookresearch/faiss) for fast approximate retrieval [(HNSW)](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world)

* The FAISS index was built from [ToposText](https://topostext.org/) database.

In [None]:
pausanias_data = list(srsly.read_jsonl("files/3_pausanias_ner.jsonl"))
data = srsly.read_json("data/ToposText_gazetteer.json")

index = faiss.read_index("data/topostext.index")
with open("data/topostext_meta.pkl", "rb") as f:
    metadata = pickle.load(f)

In [None]:
for i in tqdm(pausanias_data):
    for mention in i.get("mentions_tagged"):
        lista = []
        query_text = f"{mention.get("name")}: {mention.get("recontext")}"
        x_query = ollama.embed(model='qwen3-embedding:8b', input=query_text) # The faiss was built with qwen3-embedding:8b
        query_vec = np.array(x_query["embeddings"], dtype=np.float32)
        
        # Comment this if you want to get the top 1 result without running the 4.1 step
        distances, indices = index.search(query_vec, 100)

        # Uncomment this if you want to the the top 1 result without running the 4.1 step
        #distances, indices = index.search(query_vec, 1)

        result_ids = metadata.get("ids")[indices]
        for id,distance in zip(result_ids[0], distances[0]):
            lista.append([data["features"][id].get("@id").split("/")[-1], str(distance)])

        mention.update({"vector_db":lista})
    srsly.write_jsonl("files/4_pausanias_faiss.jsonl", [i], append=True, append_new_line=False)

### (OPTIONAL) Step 4.1: Run a Reranker for better results

* We are using the [Qwen3-Reranker-4B](https://huggingface.co/Qwen/Qwen3-Reranker-4B).
* You can use either the [Qwen3-Reranker-8B](https://huggingface.co/Qwen/Qwen3-Reranker-8B) if you have the resources.
* Or the [Qwen3-Reranker-0.6B](https://huggingface.co/Qwen/Qwen3-Reranker-0.6B) if you have limited resources.

In [None]:
rerank_model = "Qwen/Qwen3-Reranker-4B" #change this depending if you want the 4b, the 8b or the 0.6b (e.g. 'Qwen/Qwen3-Reranker-8B', 'Qwen/Qwen3-Reranker-0.6B')

def format_instruction(instruction, query, doc):
    if instruction is None:
        instruction = "Determine whether the Document describes or identifies the same historical place, group, or location referred to in the Query. Answer 'yes' only if it clearly refers to that exact entity, not to a nearby site or people associated with it. Otherwise, answer 'no'."
    output = "<Instruct>: {instruction}\n<Query>: {query}\n<Document>: {doc}".format(instruction=instruction,query=query, doc=doc)
    return output

def process_inputs(pairs):
    inputs = tokenizer(
        pairs, padding=False, truncation='longest_first',
        return_attention_mask=False, max_length=max_length - len(prefix_tokens) - len(suffix_tokens)
    )
    for i, ele in enumerate(inputs['input_ids']):
        inputs['input_ids'][i] = prefix_tokens + ele + suffix_tokens
    inputs = tokenizer.pad(inputs, padding=True, return_tensors="pt", max_length=max_length)
    for key in inputs:
        inputs[key] = inputs[key].to(model.device)
    return inputs

@torch.no_grad()
def compute_logits(inputs, **kwargs):
    batch_scores = model(**inputs).logits[:, -1, :]
    true_vector = batch_scores[:, token_true_id]
    false_vector = batch_scores[:, token_false_id]
    batch_scores = torch.stack([false_vector, true_vector], dim=1)
    batch_scores = torch.nn.functional.log_softmax(batch_scores, dim=1)
    scores = batch_scores[:, 1].exp().tolist()
    return scores

tokenizer = AutoTokenizer.from_pretrained(rerank_model, padding_side='left')

# FOR GPU - We recommend enabling flash_attention_2 or sdpa for better acceleration and memory saving.
#model = AutoModelForCausalLM.from_pretrained(rerank_model, torch_dtype=torch.float16, attn_implementation="sdpa").to('cuda').eval()

# FOR CPU
model = AutoModelForCausalLM.from_pretrained(rerank_model).eval()

token_false_id = tokenizer.convert_tokens_to_ids("no")
token_true_id = tokenizer.convert_tokens_to_ids("yes")
max_length = 8192

prefix = (
    "<|im_start|>system\n"
    "You are a factual judge. Using the Instruct and the Query, decide whether the Document "
    "clearly describes the same historical place, group, or location. "
    "Answer only with 'yes' or 'no'.\n"
    "<|im_end|>\n"
    "<|im_start|>user\n"
)
suffix = "<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"
prefix_tokens = tokenizer.encode(prefix, add_special_tokens=False)
suffix_tokens = tokenizer.encode(suffix, add_special_tokens=False)

In [None]:
topos_data = list(srsly.read_jsonl("files/4_pausanias_faiss.jsonl"))
data = srsly.read_json("data/ToposText_gazetteer.json")
data = data["features"]

task = "Determine whether the Document describes or identifies the same historical place, group, or location referred to in the Query. Answer 'yes' only if it clearly refers to that exact entity, not to a nearby site or people associated with it. Otherwise, answer 'no'."

In [None]:
for i in tqdm(topos_data):
    for mention in i.get("mentions_tagged"):
        topos_text_ids = [i[0] for i in mention.get("vector_db")][:30]
        hash_map = []
        final_list = []

        for id in topos_text_ids:
            for d in data:
                if d.get("@id").split("/")[-1] == id:
                    title = d.get("properties").get("title")
                    description = d.get("properties").get("description")
                    hash_map.append({id:f"{title}: {description}"})
                    break

        for element in hash_map:
            query = f"The mention '{mention.get("name")}' appears in the context: {mention.get("recontext")}"
            pairs = [format_instruction(task, query, list(element.values())[0])]

            inputs = process_inputs(pairs)
            scores = compute_logits(inputs)

            final_list.append({list(element.keys())[0]:scores[0]})
        
        final_list = sorted(final_list, key=lambda x: list(x.values())[0], reverse=True)[0]
        mention.update({"reranker":[final_list]})
    
    srsly.write_jsonl("files/4_1_pausanias_rerank.jsonl", [i], append=True, append_new_line=False)

### Step 5: Create the input for the [Recogito Studio](https://recogitostudio.org/)

In [None]:
# We assume that we executed the 4.1 optional step, comment this if you didn't execute it
data = list(srsly.read_jsonl("files/4_1_pausanias_rerank.jsonl"))

# Uncomment this if you didn't execute the 4.1 optional step
#data = list(srsly.read_jsonl("files/4_pausanias_faiss.jsonl"))

In [None]:
NS_TEI = "http://www.tei-c.org/ns/1.0"
NS_XML = "http://www.w3.org/XML/1998/namespace"
NSMAP = {None: NS_TEI}

In [None]:
counter = 1
uid_counter = 0
chapter = 0
current_book = None
current_chapter = None

In [None]:
# If you want each chapter-book pair to be a different XML run this cell
for i in data:
    book = i.get("book")
    chapter = i.get("chapter")

    if current_chapter is not None and chapter != current_chapter:
        tree = etree.ElementTree(tei)
        tree.write(
            f"books_chapters/pausanias_book_{current_book}_chapter_{current_chapter}.xml",
            xml_declaration=True,
            encoding="utf-8",
            pretty_print=True
        )

        tei = etree.Element("TEI", nsmap=NSMAP, version="3.3.0")
        standoff = etree.SubElement(tei, "standOff", type="recogito_studio_annotations")
        listannotation = etree.SubElement(standoff, "listAnnotation")
        text = etree.SubElement(tei, "text")
        body = etree.SubElement(text, "body")

        counter = 1

        head = etree.SubElement(body, "head")
        head.text = f"Book {book}, Chapter {chapter}"

    if current_chapter is None:
        tei = etree.Element("TEI", nsmap=NSMAP, version="3.3.0")
        standoff = etree.SubElement(tei, "standOff", type="recogito_studio_annotations")
        listannotation = etree.SubElement(standoff, "listAnnotation")
        text = etree.SubElement(tei, "text")
        body = etree.SubElement(text, "body")

        counter = 1

        head = etree.SubElement(body, "head")
        head.text = f"Book {book}, Chapter {chapter}"

    current_book = book
    current_chapter = chapter

    p = etree.SubElement(body, "p")
    p.text = i.get("text")

    for mention in i.get("mentions_tagged"):
        annotation = etree.SubElement(listannotation, "annotation", target=f"/TEI[1]/text[1]/body[1]/p[{str(counter)}]::{str(mention.get("start"))} /TEI[1]/text[1]/body[1]/p[{str(counter)}]::{str(mention.get("end"))}")
        annotation.set(f"{{{NS_XML}}}id", f"UID-FAKE-{uid_counter}")

        # Comment this if you didn't run the optional step
        topos_id = list(mention.get("reranker")[0].keys())[0]
        
        # Uncomment this if you didn't run the optional step
        #topos_id = mention.get("vector_db")[0][0]
        
        rs = etree.SubElement(annotation, "rs", ana=f"https://topostext.org/place/{topos_id}")
        uid_counter += 1

    counter += 1

tree = etree.ElementTree(tei)
tree.write(
    f"books_chapters/pausanias_book_{current_book}_chapter_{current_chapter}.xml",
    xml_declaration=True,
    encoding="utf-8",
    pretty_print=True
)

In [None]:
# If you want each book to be a different XML run this
for i in data:
    if current_book is not None and i.get("book") != current_book:
        tree = etree.ElementTree(tei)
        tree.write(
            f"books/pausanias_book_{current_book}.xml",
            xml_declaration=True,
            encoding="utf-8",
            pretty_print=True
        )

        tei = etree.Element("TEI", nsmap=NSMAP, version="3.3.0")
        standoff = etree.SubElement(tei, "standOff", type="recogito_studio_annotations")
        listannotation = etree.SubElement(standoff, "listAnnotation")
        text = etree.SubElement(tei, "text")
        body = etree.SubElement(text, "body")

        counter = 1
        chapter = 0 

        head = etree.SubElement(body, "head")
        head.text = f"Book {i.get("book")}"

    current_book = i.get("book")

    if i.get("chapter") != chapter:
        head = etree.SubElement(body, "head")
        head.text = f"Chapter {i.get("chapter")}"
        chapter = i.get("chapter")

    p = etree.SubElement(body, "p")
    p.text = i.get("text")
    for mention in i.get("mentions_tagged"):
        annotation = etree.SubElement(listannotation, "annotation", target=f"/TEI[1]/text[1]/body[1]/p[{str(counter)}]::{str(mention.get("start"))} /TEI[1]/text[1]/body[1]/p[{str(counter)}]::{str(mention.get("end"))}")
        annotation.set(f"{{{NS_XML}}}id", f"UID-FAKE-{uid_counter}")

        # Comment this if you didn't run the optional step
        topos_id = list(mention.get("reranker")[0].keys())[0]
        
        # Uncomment this if you didn't run the optional step
        #topos_id = mention.get("vector_db")[0][0]

        rs = etree.SubElement(annotation, "rs", ana=f"https://topostext.org/place/{topos_id}")
        uid_counter += 1

    counter += 1

tree = etree.ElementTree(tei)
tree.write(
    f"books/pausanias_book_{current_book}.xml",
    xml_declaration=True,
    encoding="utf-8",
    pretty_print=True
)