# Citation Tagging End to End

This notebook is for testing `src/perscit_model/xml_processing/tagger.py` as an application for taking in an XML file and inserting citation-relevant tags into it. Here, we take an XML file with reasonably accurate citation tags, strip it of these tags, reinsert them with the application, and then get the edit distance between the original version of the version with citations identified with the application.

In [1]:
%load_ext autoreload

In [6]:
from pathlib import Path

from lxml import etree
from rapidfuzz.distance import Levenshtein

from perscit_model.shared.xml_utils import strip_spec_elems, strip_spec_elem_attrs
from perscit_model.xml_processing.tagger import CitationTagger

## Prepare files

First, take XML files and make a copy with attribs stripped from citation-relevant tags. Then make another copy stripped of citation-relevant tags entirely.

In [3]:
src_dir = Path("xml_original")
cmp_dir = src_dir.parent / "xml_attr_stripped"
tagging_dir = src_dir.parent / "xml_cit_stripped"

def prep_xml(path: Path) -> Path:
    cmp_path = cmp_dir / path.name
    tagging_path = tagging_dir / path.name
    cit_tags = ("cit", "bibl", "quote")
    
    tree = etree.parse(path)
    root = tree.getroot()
    strip_spec_elem_attrs(root, cit_tags)
    tree.write(
        cmp_path, 
        encoding=tree.docinfo.encoding, 
        xml_declaration=tree.docinfo.xml_version is not None, 
        standalone=tree.docinfo.standalone
    )
    
    tree = etree.parse(path)
    root = tree.getroot()
    strip_spec_elems(root, cit_tags)
    tree.write(
        tagging_path, 
        encoding=tree.docinfo.encoding,
        xml_declaration=tree.docinfo.xml_version is not None, 
        standalone=tree.docinfo.standalone)

cmp_dir.mkdir(exist_ok=True)
tagging_dir.mkdir(exist_ok=True)
src_paths = [file for file in src_dir.glob("*.xml")]

print(src_paths)

[PosixPath('xml_original/campbell-sophlanguage-2.xml'), PosixPath('xml_original/viaf2603144.viaf001.perseus-eng1.xml')]


In [4]:
for p in src_paths:
    prep_xml(p)

assert len(list(cmp_dir.glob("*.xml"))) == len(src_paths)
assert len(list(tagging_dir.glob("*.xml"))) == len(src_paths)

## Inference

Now, we want to use the CitationTagger class to infer and insert `<cit>`, `<quote>`, and `<bibl>` tags into the XML files in `tagging_dir`. This should simply copy any existing citation tags, and ignore inferences that would overlap with them.

In [5]:
model_path = Path("../outputs/models/extraction/")
tagger = CitationTagger(model_path)

tagger.process_xml(tagging_dir, preserve_existing=True, overwrite=False)

XML validation failed for xml_cit_stripped/campbell-sophlanguage-2.xml, attempting to fix...
Successfully recovered malformed XML using recovering parser in strip_citation_tags
Successfully recovered malformed XML using recovering parser in strip_citation_tags
XML validation failed for xml_cit_stripped/viaf2603144.viaf001.perseus-eng1.xml, attempting to fix...
Successfully recovered malformed XML using recovering parser in strip_citation_tags
Successfully recovered malformed XML using recovering parser in strip_citation_tags
Base XML (without citations) is malformed for xml_cit_stripped/viaf2603144.viaf001.perseus-eng1.xml: Unescaped '<' not allowed in attributes values, line 1540, column 107 (<string>, line 1540)
XML parsing failed for file xml_cit_stripped/viaf2603144.viaf001.perseus-eng1.xml, attempting recovery: Unescaped '<' not allowed in attributes values, line 1540, column 107 (<string>, line 1540)
Successfully recovered malformed XML for file xml_cit_stripped/viaf2603144.viaf0

## Metrics

Get normalized Levenshtein distance between files in `tagging_dir/processed` and original files (stripped of attributes), as well as between the files in `tagging_dir` stripped of all citation tags and the original files (stripped of attributes).

We can use the `rapidfuzz` library to do this efficiently and fairly accurately.

In [12]:
def get_levenshtein_large(a: Path, b: Path, chunk_size: int = 10000) -> float:
    total_dist = 0.0
    weight = 0
    with open(a, "r") as f1, open(b, "r") as f2: 
        while True:
            text_a = f1.read(chunk_size)
            text_b = f2.read(chunk_size)

            if (not text_a) and (not text_b):
                break
            total_dist += Levenshtein.distance(text_a, text_b)
            weight += max(len(text_a), len(text_b))

    return total_dist / weight if weight else 0.0

prediction_distances = [
    get_levenshtein_large(path_a, path_b) for path_a, path_b in zip(
        (tagging_dir / "processed").glob("*.xml"), cmp_dir.glob("*.xml")
    )
]
base_distances = [
    get_levenshtein_large(path_a, path_b) for path_a, path_b in zip(
        tagging_dir.glob("*.xml"), cmp_dir.glob("*.xml")
    )
]

In [13]:
prediction_distances

[0.7721622497569245, 0.7368869962778417]

In [14]:
base_distances

[0.776722096183092, 0.7356525483875351]