# Inference for Citation Extraction

This notebook is for sanity-checking and testing inference for the citation and extraction model.

In [1]:
%load_ext autoreload

## Imports

Import whatever we need for running inference. For now, this is just the function
to output raw labels from a string input. It's also worth checking what the default
loading location is for the model.

In [2]:
from transformers import AutoModelForTokenClassification
from perscit_model.extraction.inference import predict
from perscit_model.extraction.inference import MODEL_SAVE_DIR
print(MODEL_SAVE_DIR)

## Expected input text without `<bibl>`, `<cit>`, and `<quote>` tags

The cell below tests inference on text input in the expected format, with all XML tags that explicitly give
the desired labels removed. We can see that it works quite well. This text is from the training data itself,
so we should also look at text from the test data.

TODO: reconstruct XML tags based on labels and reconstruct string. Give metrics.

In [19]:
text = "direct or reported, is a regular Greek idiom. Cf. Plato, Rep. 332 D<foreign lang=\"greek\">h( ti/si ti/ a)podidou=sa te/xnh</foreign>; Soph. <title>El.</title> 751 oi(=' e)/rga dra/sas oi(=a lagxa/nei kaka/. In instances like the present"
labels = predict(text)
print(labels)

['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-BIBL', 'I-BIBL', 'I-BIBL', 'I-BIBL', 'I-BIBL', 'I-BIBL', 'I-BIBL', 'I-BIBL', 'I-BIBL', 'I-BIBL', 'I-BIBL', 'I-BIBL', 'B-QUOTE', 'I-QUOTE', 'I-QUOTE', 'I-QUOTE', 'I-QUOTE', 'I-QUOTE', 'I-QUOTE', 'I-QUOTE', 'I-QUOTE', 'I-QUOTE', 'I-QUOTE', 'I-QUOTE', 'I-QUOTE', 'I-QUOTE', 'I-QUOTE', 'I-QUOTE', 'I-QUOTE', 'I-QUOTE', 'I-QUOTE', 'I-QUOTE', 'I-QUOTE', 'I-QUOTE', 'I-QUOTE', 'I-QUOTE', 'I-QUOTE', 'I-QUOTE', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'

## Text with `<bibl>`, `<cit>`, and `<quote>` tags

This tests the behaviour of model inference where the labels are present in the data
as XML tags. This is mostly just to diagnose data leakage issues: it's not in itself very important
whether the model identifies these tokens correctly or not.

What we see here is that the model overlooks anything between`<bib>`, `<cit>`, and `<quote>` tags.
This does suggest that the model is looking for a pretty precise format for identifying these elements:
including explicit labels as XML tags is enough to throw off inference.

In [18]:
text = "direct or reported, is a regular Greek idiom. Cf. <bibl n=\"Plat. Rep. 332D\">Plato, Rep. 332 D</bibl> <foreign lang=\"greek\">h( ti/si ti/ a)podidou=sa te/xnh</foreign>;  <cit><bibl n=\"Soph. El. 751\">Soph. <title>El.</title> 751</bibl> <quote lang=\"greek\">oi(=' e)/rga dra/sas oi(=a lagxa/nei kaka/</quote></cit>. In instances like the present"
labels = predict(text)
print(labels)

['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O',