Skip to content

Commit

Permalink
Merge branch 'develop' into feature/json-corpus-interface
Browse files Browse the repository at this point in the history
  • Loading branch information
JeltevanBoheemen committed Jun 20, 2024
2 parents 7f9555c + 4166fc0 commit 968ac83
Show file tree
Hide file tree
Showing 103 changed files with 4,204 additions and 3,601 deletions.
27 changes: 27 additions & 0 deletions .github/workflows/backend-test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# This workflow will run backend tests on the Python version defined in the Dockerfiles

name: Backend unit tests

on:
workflow_dispatch:
push:
branches:
- 'develop'
- 'master'
- 'feature/**'
- 'bugfix/**'
- 'hotfix/**'
- 'release/**'
- 'dependabot/**'
paths-ignore:
- 'frontend/**'
- '**.md'

jobs:
backend-test:
name: Test Backend
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run backend tests
run: sudo mkdir -p /ci-data && sudo docker-compose --env-file .env-ci run backend pytest
27 changes: 27 additions & 0 deletions .github/workflows/frontend-test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# This workflow will run frontend tests on the Node version defined in the Dockerfiles

name: Frontend unit tests

on:
workflow_dispatch:
push:
branches:
- 'develop'
- 'master'
- 'feature/**'
- 'bugfix/**'
- 'hotfix/**'
- 'release/**'
- 'dependabot/**'
paths-ignore:
- 'backend/**'
- '**.md'

jobs:
frontend-test:
name: Test Frontend
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run frontend tests
run: sudo docker-compose --env-file .env-ci run frontend yarn test
22 changes: 0 additions & 22 deletions .github/workflows/release.yaml

This file was deleted.

25 changes: 25 additions & 0 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# This action will update the CITATION.cff file for new release or hotfix branches

name: Release

on:
push:
branches:
- 'release/**'
- 'hotfix/**'

jobs:
citation-update:
name: Update CITATION.cff
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Autoformat CITATION.cff
run: |
version=`grep -o '\d\+\.\d\+\.\d\+' package.json`
today=`date +"%Y-%m-%d"`
sed -i "s/^version: [[:digit:]]\{1,\}\.[[:digit:]]\{1,\}\.[[:digit:]]\{1,\}/version: $version/" CITATION.cff
sed -i "s/[[:digit:]]\{4\}-[[:digit:]]\{2\}-[[:digit:]]\{2\}/$today/" CITATION.cff
bash ./update-citation.sh
git commit -a -m "update version and date in CITATION.cff"
32 changes: 0 additions & 32 deletions .github/workflows/test.yml

This file was deleted.

1 change: 1 addition & 0 deletions .nvmrc
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
18.17.1
4 changes: 2 additions & 2 deletions CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -35,5 +35,5 @@ keywords:
- elasticsearch
- natural language processing
license: MIT
version: 5.6.2
date-released: '2024-05-06'
version: 5.8.0
date-released: '2024-06-19'
3 changes: 3 additions & 0 deletions DockerfileElastic
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
FROM docker.elastic.co/elasticsearch/elasticsearch:8.10.2

RUN bin/elasticsearch-plugin install mapper-annotated-text
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ For corpora included in I-analyzer, the backend includes a definition file that

## Usage

If you are interested in using I-analyzer, the most straightforward way to get started is to make an account at [ianalyzer.hum.uu.nl](https://ianalyzer.hum.uu.nl/). This server is maintained by the Research Software Lab and contains corpora focused on a variety of fields. We also maintain more specialised collections at [PEACE portal](https://peace.sites.uu.nl/epigraphy/search/) and [People & Parliament (not publicly accessible)](https://people-and-parliament.hum.uu.nl/).
If you are interested in using I-analyzer, the most straightforward way to get started is to visit [ianalyzer.hum.uu.nl](https://ianalyzer.hum.uu.nl/). This server is maintained by the Research Software Lab and contains corpora focused on a variety of fields. We also maintain more specialised collections at [PEACE portal](https://peace.sites.uu.nl/epigraphy/search/) and [People & Parliament](https://people-and-parliament.hum.uu.nl/).

I-analyzer does not have an "upload data" option (yet!). If you are interested in using I-analyzer as a way to publish your dataset, or to make it easier to search and analyse, you can go about this two ways:

Expand Down
11 changes: 11 additions & 0 deletions backend/corpora/dbnl/citation/citation.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@ I-analyzer presents the [DBNL-dataset](https://www.kb.nl/onderzoeken-vinden/data

> KB, Nationale Biliotheek. "DBNL-dataset". *I-analyzer*, 2023, {{ frontend_url }}/search/dbnl
### Chicago "notes and bibliography" style
> KB, Nationale Bibliotheek, "DBNL-dataset", distributed by I-analyzer, 2023. {{ frontend_url }}/search/dbnl.

## Citing a specific work

Expand All @@ -34,3 +37,11 @@ This describes the query to view all chapters of the book on I-analyzer.
### MLA style

> Porjeere, Olivier. *Zanglievende uitspanningen*. Martinus de Bruijn, 1788. {{ frontend_url }}/search/dbnl?title_id=porj001zang01_01&sort=chapter_index,asc
### Chicago "notes and bibliography" style
#### First note
> Olivier Porjeere, *Zanglievende uitspanningen* (Alkmaar: Martinus de Bruijn, 1788) {{ frontend_url }}/search/dbnl?title_id=porj001zang01_01&sort=chapter_index,asc.
#### Shortened note
> Porjeere, *Zanglievende uitspanningen*
#### Bibliography entry
> Porjeere, Olivier. *Zanglievende uitspanningen*. Alkmaar: Martinus de Bruijn, 1788. {{ frontend_url }}/search/dbnl?title_id=porj001zang01_01&sort=chapter_index,asc.
2 changes: 1 addition & 1 deletion backend/corpora/dutchannualreports/dutchannualreports.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ def sources(self, start=min_date, end=max_date):
full_path = op.join(directory, filename)
file_path = op.join(rel_dir, filename)
image_path = op.join(
rel_dir, name + '.' + self.scan_image_type)
rel_dir, name + '.pdf')
if extension != '.xml':
logger.debug(self.non_xml_msg.format(full_path))
continue
Expand Down
2 changes: 1 addition & 1 deletion backend/corpora/dutchnewspapers/dutchnewspapers_public.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ def sources(self, start=min_date, end=max_date):
self.definition_pattern.search(filename)), None)
if not definition_file:
continue
meta_dict = self.metadata_from_xml(definition_file, tags=[
meta_dict = self._metadata_from_xml(definition_file, tags=[
"title",
"date",
"publisher",
Expand Down
2 changes: 1 addition & 1 deletion backend/corpora/ecco/ecco.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ def sources(self, start=min_date, end=max_date):
'Volume'
]

meta_dict = self.metadata_from_xml(
meta_dict = self._metadata_from_xml(
full_path, tags=meta_tags)
meta_dict['id'] = record_id
meta_dict['category'] = category
Expand Down
3 changes: 2 additions & 1 deletion backend/corpora/guardianobserver/guardianobserver.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,8 @@ def sources(self, start=datetime.min, end=datetime.max):
extractor=extract.XML(
tag='NumericPubDate', toplevel=True,
transform=lambda x: '{y}-{m}-{d}'.format(y=x[:4],m=x[4:6],d=x[6:])
)
),
sortable=True,
),
FieldDefinition(
name='date-pub',
Expand Down
44 changes: 44 additions & 0 deletions backend/corpora/parliament/citation/netherlands.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
## Citing the entire corpus

People & Parliament presents the *Dutch parliamentary data* corpus, which is a combination of the following:
- Dutch parliamentary proceedings from 1814-2013, harvested and enriched in the [Political Mashup project](https://ssh.datastations.nl/dataset.xhtml?persistentId=doi:10.17026/dans-xk5-dw3s), retrieved 2020
- Dutch parliamentary proceedings from 2014-2022, harvested and enriched by [ParlaMINT](https://www.clarin.eu/parlamint), first retrieved 2020 and updated 2023

### Chicago "notes and bibliography" style
> University of Jyväskylä and Utrecht University, "Dutch Parliamentary data", distributed by People & Parliament, 2023. {{ frontend_url }}/search/parliament-netherlands.
### APA style

> University of Jyväskylä and Utrecht University (2023). *Dutch Parliamentary data* [data set]. People & Parliament. {{ frontend_url }}/search/parliament-netherlands
### MLA style

[MLA guidelines](https://style.mla.org/) recommend against citing a database, and recommend [citing each individual work you use](https://style.mla.org/separate-entries-database-works/). If you want to cite the entire corpus nonetheless, we recommend the following format:

> University of Jyväskylä and Utrecht University. "Dutch Parliamentary data". People & Parliament, 2023. {{ frontend_url }}/search/parliament-netherlands
## Referring to a debate
To get an URL for an entire debate, you can use the *view debate* link for a speech. This will get you a link like this:

{{ frontend_url }}/search/parliament-netherlands?debate_id=ParlaMint-NL_2021-12-21-eerstekamer-4&sort=sequence,asc

## Citing a specific speech

To cite a speech in the *Dutch Parliamentary data* corpus, you can retrieve a link by clicking the *link* icon underneath the speech's document tile. This should give you an url as follows:
{{ frontend_url }}/document/parliament-netherlands/ParlaMint-NL_2021-12-21-eerstekamer-4.u1

### Chicago "notes and bibliography" style
#### First note
> Mark Rutte in *Report of the meeting of the Dutch Lower House, Meeting 37, Session 2 (2021-12-21)*, 2021. {{ frontend_url }}/document/parliament-netherlands/ParlaMint-NL_2021-12-21-tweedekamer-2.u225.
#### Shortened note
> Rutte, *Meeting 37, Session 2 (2021-12-21)*
#### Bibliography entry
> Rutte, Mark. In *Report of the meeting of the Dutch Lower House, Meeting 37, Session 2 (2021-12-21)*, 2021. {{ frontend_url }}/document/parliament-netherlands/ParlaMint-NL_2021-12-21-tweedekamer-2.u225.
### APA style

> Rutte, M. (2021). In *Report of the meeting of the Dutch Lower House, Meeting 37, Session 2 (2021-12-21)*. {{ frontend_url }}/document/parliament-netherlands/ParlaMint-NL_2021-12-21-tweedekamer-2.u225
### MLA style

> Rutte, Mark. *Report of the meeting of the Dutch Lower House, Meeting 37, Session 2 (2021-12-21)*, 2021. {{ frontend_url }}/document/parliament-netherlands/ParlaMint-NL_2021-12-21-tweedekamer-2.u225
2 changes: 1 addition & 1 deletion backend/corpora/parliament/description/netherlands.md
Original file line number Diff line number Diff line change
@@ -1 +1 @@
The debates of the First and Second Chamber of the bicameral parliament, enriched until the early 2010s by Maarten Marx for the Political Mashup project, and 2014-2020 by ParlaMINT. Metadata is provided.
The debates of the First and Second Chamber of the bicameral parliament, enriched until the early 2010s by Maarten Marx for the Political Mashup project, and 2014-2023 by ParlaMINT. Metadata is provided.
3 changes: 3 additions & 0 deletions backend/corpora/parliament/netherlands.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
from corpora.parliament.utils.parlamint import extract_all_party_data, extract_people_data, extract_role_data, party_attribute_extractor, person_attribute_extractor
from corpora.utils.formatting import format_page_numbers
from corpora.parliament.parliament import Parliament
from corpora.utils.constants import document_context
import corpora.parliament.utils.field_defaults as field_defaults
import re

Expand Down Expand Up @@ -132,11 +133,13 @@ class ParliamentNetherlands(Parliament, XMLCorpusDefinition):
es_index = getattr(settings, 'PP_NL_INDEX', 'parliament-netherlands')
image = 'netherlands.jpg'
description_page = 'netherlands.md'
citation_page = 'netherlands.md'
tag_toplevel = lambda _, metadata: 'root' if is_old(metadata) else 'TEI'
tag_entry = lambda _, metadata: 'speech' if is_old(metadata) else 'u'
languages = ['nl']

category = 'parliament'
document_context = document_context()

def sources(self, start, end):
logger = logging.getLogger(__name__)
Expand Down
7 changes: 7 additions & 0 deletions backend/corpora/ublad/description/ublad.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Op 5 september 1969 kreeg de Universiteit Utrecht voor het eerst een onafhankelijk blad: _U utrechtse universitaire reflexen_. Dit blad kwam voort uit een fusie van twee andere tijdschriften: _Sol Iustitiae_ dat voornamelijk gericht was op studenten en _Solaire Reflexen_ dat meer was bedoeld voor medewerkers. U utrechtse universitaire reflexen was bedoeld voor alle geledingen.

In 1974 veranderde de naam in het _Ublad_. Dat bleef zo tot de universiteit besloot het papieren Ublad digitaal te maken. Onder luid protest verdween het papieren Ublad en ontstond in april 2010 _DUB_, het digitale universiteitsblad.

Om alle informatie uit het verleden toegankelijk te maken, heeft het Centre for Digital Humanities samen met de Universiteitsbibliotheek de oude jaargangen gedigitaliseerd. In I-analyzer kunt u alle jaargangen van U utrechtse universitaire reflexen en het Ublad vinden en doorzoeken.

Het onafhankelijke Ublad geeft een kleurrijk verslag van wat er speelde op de universiteit, de stad en het studentenleven door middel van artikelen, foto’s en cartoons. De afbeelding die is gebruikt voor OCR is voor elke pagina bijgevoegd zodat u altijd het originele bronmateriaal kunt raadplegen.
Binary file added backend/corpora/ublad/images/ublad.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
14 changes: 14 additions & 0 deletions backend/corpora/ublad/tests/test_ublad.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
import locale
import pytest
from corpora.ublad.ublad import transform_date
import datetime


def test_transform_date():
datestring = '6 september 2007'
goal_date = datetime.date(2007, 9, 6)
try:
date = transform_date(datestring)
except locale.Error:
pytest.skip('Dutch Locale not installed in environment')
assert date == str(goal_date)
Loading

0 comments on commit 968ac83

Please sign in to comment.