# Prompt engineering for CoNNL-U

Here's an example API query. To run this yourself you need to run `export OPENAI_API_KEY_TREEBANKS="your api key"` in the terminal with the API key from your own OpenAI account. 

In [None]:
import os
from src.pipeline import pipeline

input_sentence = "Thetta är inte begynnilsen aff Jesu Christi gudz sons euangelio."
conllu = pipeline(input_sentence, model="gpt-5-mini-2025-08-07")
print(conllu)

if not os.path.exists("output"):
    os.makedirs("output")
outname = f"output/parsed_{input_sentence.split()[0]}.conllu"
with open(outname, "w", encoding="utf-8") as f:
    f.write(conllu + "\n")

5it [03:00, 36.13s/it]

1	Thetta	thetta	PRON	_	Case=Nom|Gender=Neut|Number=Sing|PronType=Dem	4	nsubj	4:nsubj	_
2	är	vara	AUX	_	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	4	cop	4:cop	_
3	inte	inte	PART	_	Polarity=Neg	2	neg	2:neg	_
4	begynnilsen	begynnilsen	NOUN	_	Case=Nom|Definite=Def|Gender=Com|Number=Sing	0	root	0:root	_
5	aff	aff	ADP	_	AdpType=Prep	10	case	10:case	_
6	Jesu	Jesu	PROPN	_	Case=Gen|Gender=Masc|Number=Sing	7	flat	7:flat|10:nmod:poss	_
7	Christi	Christi	PROPN	_	Case=Gen|Gender=Masc|Number=Sing	10	nmod:poss	10:nmod:poss	_
8	gudz	gudz	PROPN	_	Case=Gen|Gender=Masc|Number=Sing	9	nmod:poss	9:nmod:poss|10:nmod:poss	_
9	sons	sons	NOUN	_	Case=Gen|Gender=Masc|Number=Sing	10	nmod:poss	10:nmod:poss	_
10	euangelio	euangelio	NOUN	_	Case=Nom|Gender=Neut|Number=Sing	4	nmod	4:nmod	_
11	.	.	PUNCT	_	_	4	punct	4:punct	_





We can quickly check validity using the python `conllu` package.

In [5]:
from src.pipeline import is_valid_conllu

#outname = f"output/parsed_{input_sentence.split()[0]}.conllu"
outname = f"output/parsed.conllu"
validity = is_valid_conllu(outname)
if validity:
    print("CoNNL-U valid 🥳")

CoNNL-U valid 🥳


## The Batch API

Before running on a large file from Svensk diakronisk korpus, you can estimate the cost roughly (based on using the Batch API, which is 50% cheaper). Caveat emptor, though, your credit card is on its own! To be absolutely safe, you can set a limit in your project settings.

The algorithm counts tokens from all `# text =` fields in the input.

In [1]:
from src.count_tokens import count_total_tokens_and_cost

path = "data/svediakorp-rel108-Mar26SLundversion.tsv"
tokens, input_cost, output_cost = count_total_tokens_and_cost(path)
print(f"Total input tokens: {tokens}")
print(f"Approximate input cost: ${input_cost:.6f}")
print(f"Approximate output cost: ${output_cost:.6f}")
print(f"Approximate total cost: ${input_cost + output_cost:.6f}")

Total input tokens: 79705
Approximate input cost: $0.009963
Approximate output cost: $0.034200
Approximate total cost: $0.044163


Irritatingly enough, we have to make one batch jsonl per task, submit it, and then incorporate the results into the jsonl of the next task.

For most sentences, conllu table output seems to not be much more than 2k tokens, but there are of course obscene exceptions, especially in premodern texts.

In [1]:
from src.batching import prepare_task1_responses_batch_jsonl

prepare_task1_responses_batch_jsonl("data/svediakorp-rel108-Mar26SLundversion.conllu", "batches/batch_task1.jsonl", model="gpt-5-mini-2025-08-07", max_output_tokens=8192)

[prepare_task1_responses_batch_jsonl] wrote 171 requests -> batches/batch_task1.jsonl


171