Skip to content

acdh-oeaw/nerdpool-client

Repository files navigation

Run Tests codecov PyPI version

nerdpool-client

A Python client for downloading data from https://nerdpool-api.acdh-dev.oeaw.ac.at

install

pip install nerdpool_client

usage

list data set titles

from nerdpool_client import NerdPoolClient

client = NerdPoolClient()
print(client.data_sets)
# ['RTA', 'RITA', 'MRP', 'Chronik Aldersbach', 'DIPKO']

download samples as .jsonl file

  • go to nerdpool-api and create/filter you'r prefered data sample; e.g. all samples from MRP:
from nerdpool_client import NerdPoolClient

url = "https://nerdpool-api.acdh-dev.oeaw.ac.at/api/ner-sample/?format=json&ner_ent_type__contains=&ner_source__title=MRP"
client = NerdPoolClient()
client.dump_to_jsonl(url)
# 'out.jsonl'

download samples as test.jsonl and eval.jsonl files

  • With file_name_prefix you can add a custom prefix to the default file names train.jsonl and eval.jsonl
  • The param split defines that each split sample should be saved into eval.jsonl and not into train.jsonl
from nerdpool_client import NerdPoolClient

url = "https://nerdpool-api.acdh-dev.oeaw.ac.at/api/ner-sample/?format=json&ner_ent_type__contains=&ner_source__title=MRP"
client = NerdPoolClient()
client.dump_to_train_eval(url, file_name_prefix="mrp__", split=10)
# ['mrp__train.jsonl', 'mrp__eval.jsonl]