Merge branch 'develop' into feature/json-corpus-interface

UUDigitalHumanitieslab · Jun 20, 2024 · 968ac83 · 968ac83
2 parents 7f9555c + 4166fc0
commit 968ac83
Show file tree

Hide file tree

Showing 103 changed files with 4,204 additions and 3,601 deletions.
diff --git a/.github/workflows/backend-test.yml b/.github/workflows/backend-test.yml
@@ -0,0 +1,27 @@
+# This workflow will run backend tests on the Python version defined in the Dockerfiles
+
+name: Backend unit tests
+
+on:
+  workflow_dispatch:
+  push:
+    branches:
+      - 'develop'
+      - 'master'
+      - 'feature/**'
+      - 'bugfix/**'
+      - 'hotfix/**'
+      - 'release/**'
+      - 'dependabot/**'
+    paths-ignore:
+      - 'frontend/**'
+      - '**.md'
+
+jobs:
+  backend-test:
+    name: Test Backend
+    runs-on: ubuntu-latest
+    steps:
+    - uses: actions/checkout@v3
+    - name: Run backend tests
+      run: sudo mkdir -p /ci-data && sudo docker-compose --env-file .env-ci run backend pytest
diff --git a/.github/workflows/frontend-test.yml b/.github/workflows/frontend-test.yml
@@ -0,0 +1,27 @@
+# This workflow will run frontend tests on the Node version defined in the Dockerfiles
+
+name: Frontend unit tests
+
+on:
+  workflow_dispatch:
+  push:
+    branches:
+      - 'develop'
+      - 'master'
+      - 'feature/**'
+      - 'bugfix/**'
+      - 'hotfix/**'
+      - 'release/**'
+      - 'dependabot/**'
+    paths-ignore:
+      - 'backend/**'
+      - '**.md'
+
+jobs:
+  frontend-test:
+    name: Test Frontend
+    runs-on: ubuntu-latest
+    steps:
+    - uses: actions/checkout@v3
+    - name: Run frontend tests
+      run: sudo docker-compose --env-file .env-ci run frontend yarn test
diff --git a/.github/workflows/release.yaml b/.github/workflows/release.yaml
diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
@@ -0,0 +1,25 @@
+# This action will update the CITATION.cff file for new release or hotfix branches
+
+name: Release
+
+on:
+  push:
+    branches:
+      - 'release/**'
+      - 'hotfix/**'
+
+jobs:
+  citation-update:
+    name: Update CITATION.cff
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+      - name: Autoformat CITATION.cff
+        run: |
+          version=`grep -o '\d\+\.\d\+\.\d\+' package.json`
+          today=`date +"%Y-%m-%d"`
+          sed -i "s/^version: [[:digit:]]\{1,\}\.[[:digit:]]\{1,\}\.[[:digit:]]\{1,\}/version: $version/" CITATION.cff
+          sed -i "s/[[:digit:]]\{4\}-[[:digit:]]\{2\}-[[:digit:]]\{2\}/$today/" CITATION.cff
+          bash ./update-citation.sh
+          git commit -a -m "update version and date in CITATION.cff"
+
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
diff --git a/.nvmrc b/.nvmrc
@@ -0,0 +1 @@
+18.17.1
diff --git a/CITATION.cff b/CITATION.cff
@@ -35,5 +35,5 @@ keywords:
   - elasticsearch
   - natural language processing
 license: MIT
-version: 5.6.2
-date-released: '2024-05-06'
+version: 5.8.0
+date-released: '2024-06-19'
diff --git a/DockerfileElastic b/DockerfileElastic
@@ -0,0 +1,3 @@
+FROM docker.elastic.co/elasticsearch/elasticsearch:8.10.2
+
+RUN bin/elasticsearch-plugin install mapper-annotated-text
diff --git a/README.md b/README.md
@@ -18,7 +18,7 @@ For corpora included in I-analyzer, the backend includes a definition file that
 
 ## Usage
 
-If you are interested in using I-analyzer, the most straightforward way to get started is to make an account at [ianalyzer.hum.uu.nl](https://ianalyzer.hum.uu.nl/). This server is maintained by the Research Software Lab and contains corpora focused on a variety of fields. We also maintain more specialised collections at [PEACE portal](https://peace.sites.uu.nl/epigraphy/search/) and [People & Parliament  (not publicly accessible)](https://people-and-parliament.hum.uu.nl/).
+If you are interested in using I-analyzer, the most straightforward way to get started is to visit [ianalyzer.hum.uu.nl](https://ianalyzer.hum.uu.nl/). This server is maintained by the Research Software Lab and contains corpora focused on a variety of fields. We also maintain more specialised collections at [PEACE portal](https://peace.sites.uu.nl/epigraphy/search/) and [People & Parliament](https://people-and-parliament.hum.uu.nl/).
 
 I-analyzer does not have an "upload data" option (yet!). If you are interested in using I-analyzer as a way to publish your dataset, or to make it easier to search and analyse, you can go about this two ways:
 

diff --git a/backend/corpora/dbnl/citation/citation.md b/backend/corpora/dbnl/citation/citation.md
@@ -12,6 +12,9 @@ I-analyzer presents the [DBNL-dataset](https://www.kb.nl/onderzoeken-vinden/data
 
 > KB, Nationale Biliotheek. "DBNL-dataset". *I-analyzer*, 2023, {{ frontend_url }}/search/dbnl
 
+### Chicago "notes and bibliography" style
+> KB, Nationale Bibliotheek, "DBNL-dataset", distributed by I-analyzer, 2023. {{ frontend_url }}/search/dbnl.
+
 
 ## Citing a specific work
 
@@ -34,3 +37,11 @@ This describes the query to view all chapters of the book on I-analyzer.
 ### MLA style
 
 > Porjeere, Olivier. *Zanglievende uitspanningen*. Martinus de Bruijn, 1788. {{ frontend_url }}/search/dbnl?title_id=porj001zang01_01&sort=chapter_index,asc
+
+### Chicago "notes and bibliography" style
+#### First note
+> Olivier Porjeere, *Zanglievende uitspanningen* (Alkmaar: Martinus de Bruijn, 1788) {{ frontend_url }}/search/dbnl?title_id=porj001zang01_01&sort=chapter_index,asc.
+#### Shortened note
+> Porjeere, *Zanglievende uitspanningen*
+#### Bibliography entry
+> Porjeere, Olivier. *Zanglievende uitspanningen*. Alkmaar: Martinus de Bruijn, 1788. {{ frontend_url }}/search/dbnl?title_id=porj001zang01_01&sort=chapter_index,asc.
diff --git a/backend/corpora/dutchannualreports/dutchannualreports.py b/backend/corpora/dutchannualreports/dutchannualreports.py
@@ -72,7 +72,7 @@ def sources(self, start=min_date, end=max_date):
                 full_path = op.join(directory, filename)
                 file_path = op.join(rel_dir, filename)
                 image_path = op.join(
-                    rel_dir, name + '.' + self.scan_image_type)
+                    rel_dir, name + '.pdf')
                 if extension != '.xml':
                     logger.debug(self.non_xml_msg.format(full_path))
                     continue

diff --git a/backend/corpora/dutchnewspapers/dutchnewspapers_public.py b/backend/corpora/dutchnewspapers/dutchnewspapers_public.py
@@ -74,7 +74,7 @@ def sources(self, start=min_date, end=max_date):
                                 self.definition_pattern.search(filename)), None)
             if not definition_file:
                 continue
-            meta_dict = self.metadata_from_xml(definition_file, tags=[
+            meta_dict = self._metadata_from_xml(definition_file, tags=[
                     "title",
                     "date",
                     "publisher",

diff --git a/backend/corpora/ecco/ecco.py b/backend/corpora/ecco/ecco.py
@@ -91,7 +91,7 @@ def sources(self, start=min_date, end=max_date):
                             'Volume'
                         ]
 
-                        meta_dict = self.metadata_from_xml(
+                        meta_dict = self._metadata_from_xml(
                             full_path, tags=meta_tags)
                         meta_dict['id'] = record_id
                         meta_dict['category'] = category

diff --git a/backend/corpora/guardianobserver/guardianobserver.py b/backend/corpora/guardianobserver/guardianobserver.py
@@ -86,7 +86,8 @@ def sources(self, start=datetime.min, end=datetime.max):
             extractor=extract.XML(
                 tag='NumericPubDate', toplevel=True,
                 transform=lambda x: '{y}-{m}-{d}'.format(y=x[:4],m=x[4:6],d=x[6:])
-                )
+            ),
+            sortable=True,
         ),
         FieldDefinition(
             name='date-pub',

diff --git a/backend/corpora/parliament/citation/netherlands.md b/backend/corpora/parliament/citation/netherlands.md
@@ -0,0 +1,44 @@
+## Citing the entire corpus
+
+People & Parliament presents the *Dutch parliamentary data* corpus, which is a combination of the following:
+- Dutch parliamentary proceedings from 1814-2013, harvested and enriched in the [Political Mashup project](https://ssh.datastations.nl/dataset.xhtml?persistentId=doi:10.17026/dans-xk5-dw3s), retrieved 2020
+- Dutch parliamentary proceedings from 2014-2022, harvested and enriched by [ParlaMINT](https://www.clarin.eu/parlamint), first retrieved 2020 and updated 2023
+
+### Chicago "notes and bibliography" style
+> University of Jyväskylä and Utrecht University, "Dutch Parliamentary data", distributed by People & Parliament, 2023. {{ frontend_url }}/search/parliament-netherlands.
+
+### APA style
+
+> University of Jyväskylä and Utrecht University (2023). *Dutch Parliamentary data* [data set]. People & Parliament. {{ frontend_url }}/search/parliament-netherlands
+
+### MLA style
+
+[MLA guidelines](https://style.mla.org/) recommend against citing a database, and recommend [citing each individual work you use](https://style.mla.org/separate-entries-database-works/). If you want to cite the entire corpus nonetheless, we recommend the following format:
+
+> University of Jyväskylä and Utrecht University. "Dutch Parliamentary data". People & Parliament, 2023. {{ frontend_url }}/search/parliament-netherlands
+
+## Referring to a debate
+To get an URL for an entire debate, you can use the *view debate* link for a speech. This will get you a link like this:
+
+    {{ frontend_url }}/search/parliament-netherlands?debate_id=ParlaMint-NL_2021-12-21-eerstekamer-4&sort=sequence,asc
+
+## Citing a specific speech
+
+To cite a speech in the *Dutch Parliamentary data* corpus, you can retrieve a link by clicking the *link* icon underneath the speech's document tile. This should give you an url as follows:
+{{ frontend_url }}/document/parliament-netherlands/ParlaMint-NL_2021-12-21-eerstekamer-4.u1
+
+### Chicago "notes and bibliography" style
+#### First note
+> Mark Rutte in *Report of the meeting of the Dutch Lower House, Meeting 37, Session 2 (2021-12-21)*, 2021. {{ frontend_url }}/document/parliament-netherlands/ParlaMint-NL_2021-12-21-tweedekamer-2.u225.
+#### Shortened note
+> Rutte, *Meeting 37, Session 2 (2021-12-21)*
+#### Bibliography entry
+> Rutte, Mark. In *Report of the meeting of the Dutch Lower House, Meeting 37, Session 2 (2021-12-21)*, 2021. {{ frontend_url }}/document/parliament-netherlands/ParlaMint-NL_2021-12-21-tweedekamer-2.u225.
+
+### APA style
+
+> Rutte, M. (2021). In *Report of the meeting of the Dutch Lower House, Meeting 37, Session 2 (2021-12-21)*. {{ frontend_url }}/document/parliament-netherlands/ParlaMint-NL_2021-12-21-tweedekamer-2.u225
+
+### MLA style
+
+> Rutte, Mark. *Report of the meeting of the Dutch Lower House, Meeting 37, Session 2 (2021-12-21)*, 2021. {{ frontend_url }}/document/parliament-netherlands/ParlaMint-NL_2021-12-21-tweedekamer-2.u225
diff --git a/backend/corpora/parliament/description/netherlands.md b/backend/corpora/parliament/description/netherlands.md
@@ -1 +1 @@
-The debates of the First and Second Chamber of the bicameral parliament, enriched until the early 2010s by Maarten Marx for the Political Mashup project, and 2014-2020 by ParlaMINT. Metadata is provided.
+The debates of the First and Second Chamber of the bicameral parliament, enriched until the early 2010s by Maarten Marx for the Political Mashup project, and 2014-2023 by ParlaMINT. Metadata is provided.
diff --git a/backend/corpora/parliament/netherlands.py b/backend/corpora/parliament/netherlands.py
@@ -11,6 +11,7 @@
 from corpora.parliament.utils.parlamint import extract_all_party_data, extract_people_data, extract_role_data, party_attribute_extractor, person_attribute_extractor
 from corpora.utils.formatting import format_page_numbers
 from corpora.parliament.parliament import Parliament
+from corpora.utils.constants import document_context
 import corpora.parliament.utils.field_defaults as field_defaults
 import re
 
@@ -132,11 +133,13 @@ class ParliamentNetherlands(Parliament, XMLCorpusDefinition):
     es_index = getattr(settings, 'PP_NL_INDEX', 'parliament-netherlands')
     image = 'netherlands.jpg'
     description_page = 'netherlands.md'
+    citation_page = 'netherlands.md'
     tag_toplevel = lambda _, metadata: 'root' if is_old(metadata) else 'TEI'
     tag_entry = lambda _, metadata: 'speech' if is_old(metadata) else 'u'
     languages = ['nl']
 
     category = 'parliament'
+    document_context = document_context()
 
     def sources(self, start, end):
         logger = logging.getLogger(__name__)

diff --git a/backend/corpora/ublad/description/ublad.md b/backend/corpora/ublad/description/ublad.md
@@ -0,0 +1,7 @@
+Op 5 september 1969 kreeg de Universiteit Utrecht voor het eerst een onafhankelijk blad: _U utrechtse universitaire reflexen_. Dit blad kwam voort uit een fusie van twee andere tijdschriften: _Sol Iustitiae_ dat voornamelijk gericht was op studenten en _Solaire Reflexen_ dat meer was bedoeld voor medewerkers. U utrechtse universitaire reflexen was bedoeld voor alle geledingen.
+
+In 1974 veranderde de naam in het _Ublad_. Dat bleef zo tot de universiteit besloot het papieren Ublad digitaal te maken. Onder luid protest verdween het papieren Ublad en ontstond in april 2010 _DUB_, het digitale universiteitsblad.
+
+Om alle informatie uit het verleden toegankelijk te maken, heeft het Centre for Digital Humanities samen met de Universiteitsbibliotheek de oude jaargangen gedigitaliseerd. In I-analyzer kunt u alle jaargangen van U utrechtse universitaire reflexen en het Ublad vinden en doorzoeken.
+
+Het onafhankelijke Ublad geeft een kleurrijk verslag van wat er speelde op de universiteit, de stad en het studentenleven door middel van artikelen, foto’s en cartoons. De afbeelding die is gebruikt voor OCR is voor elke pagina bijgevoegd zodat u altijd het originele bronmateriaal kunt raadplegen.
diff --git a/backend/corpora/ublad/images/ublad.jpg b/backend/corpora/ublad/images/ublad.jpg
diff --git a/backend/corpora/ublad/tests/test_ublad.py b/backend/corpora/ublad/tests/test_ublad.py
@@ -0,0 +1,14 @@
+import locale
+import pytest
+from corpora.ublad.ublad import transform_date
+import datetime
+
+
+def test_transform_date():
+    datestring = '6 september 2007'
+    goal_date = datetime.date(2007, 9, 6)
+    try:
+        date = transform_date(datestring)
+    except locale.Error:
+        pytest.skip('Dutch Locale not installed in environment')
+    assert date == str(goal_date)