Skip to content

Commit

Permalink
feat(excel2xml): support French BC dates in find_date_in_string() (DE…
Browse files Browse the repository at this point in the history
  • Loading branch information
jnussbaum committed Dec 13, 2023
1 parent 6de6af7 commit 610f064
Show file tree
Hide file tree
Showing 3 changed files with 103 additions and 25 deletions.
45 changes: 24 additions & 21 deletions docs/excel2xml-module.md
Expand Up @@ -6,10 +6,10 @@

There are two kinds of Excel files that can be transformed into an XML file:

| structure | provenance | tool | example screenshot |
|------------------|-------------|--------------------------|----------------------------------------------------------|
| custom structure | customer | module `excel2xml` | ![](./assets/images/img-excel2xml-raw-data-category.png) |
| DSP structure | DSP server | CLI command `excel2xml` | ![](./assets/images/img-excel2xml-closeup.png) |
| structure | provenance | tool | example screenshot |
| ---------------- | ---------- | ----------------------- | -------------------------------------------------------- |
| custom structure | customer | module `excel2xml` | ![](./assets/images/img-excel2xml-raw-data-category.png) |
| DSP structure | DSP server | CLI command `excel2xml` | ![](./assets/images/img-excel2xml-closeup.png) |

The first use case is the most frequent: The DaSCH receives a data export from a research project. Every project uses
different software, so every project will deliver their data in a different structure. The screenshot is just a
Expand Down Expand Up @@ -182,7 +182,7 @@ With the help of Pandas, you can then iterate through the rows of your Excel/CSV
There are four kinds of resources that can be created:

| super | tag | method |
|--------------|----------------|---------------------|
| ------------ | -------------- | ------------------- |
| `Resource` | `<resource>` | `make_resource()` |
| `Annotation` | `<annotation>` | `make_annotation()` |
| `Region` | `<region>` | `make_region()` |
Expand Down Expand Up @@ -337,7 +337,7 @@ There are many problems that can occur with this simple approach! Often, a cell
might expect:

| cell content | return value of `bool(cell)` | You might have expected... |
|--------------|------------------------------|------------------------------------------------------------------|
| ------------ | ---------------------------- | ---------------------------------------------------------------- |
| 0 | False | True, because 0 is a valid integer for your integer property |
| " " | True | False, because an empty string is not usable for a text property |
| `numpy.nan` | True | False, because N/A is not a usable value |
Expand All @@ -364,18 +364,21 @@ Notes:

Supported date formats:

| Input | Output |
|-------------------|---------------------------------------|
| 0476_09_04 | GREGORIAN:CE:0476-09-04:CE:0476-09-04 |
| 0476-09-04 | GREGORIAN:CE:0476-09-04:CE:0476-09-04 |
| 30.4.2021 | GREGORIAN:CE:2021-04-30:CE:2021-04-30 |
| 5/11/2021 | GREGORIAN:CE:2021-11-05:CE:2021-11-05 |
| Jan 26, 1993 | GREGORIAN:CE:1993-01-26:CE:1993-01-26 |
| February26,2051 | GREGORIAN:CE:2051-02-26:CE:2051-02-26 |
| 28.2.-1.12.1515 | GREGORIAN:CE:1515-02-28:CE:1515-12-01 |
| 25.-26.2.0800 | GREGORIAN:CE:0800-02-25:CE:0800-02-26 |
| 1.9.2022-3.1.2024 | GREGORIAN:CE:2022-09-01:CE:2024-01-03 |
| 1848 | GREGORIAN:CE:1848:CE:1848 |
| 1849/1850 | GREGORIAN:CE:1849:CE:1850 |
| 1849/50 | GREGORIAN:CE:1849:CE:1850 |
| 1845-50 | GREGORIAN:CE:1845:CE:1850 |
| Input | Output |
| ------------------ | ------------------------------------- |
| 0476_09_04 | GREGORIAN:CE:0476-09-04:CE:0476-09-04 |
| 0476-09-04 | GREGORIAN:CE:0476-09-04:CE:0476-09-04 |
| 30.4.2021 | GREGORIAN:CE:2021-04-30:CE:2021-04-30 |
| 5/11/2021 | GREGORIAN:CE:2021-11-05:CE:2021-11-05 |
| Jan 26, 1993 | GREGORIAN:CE:1993-01-26:CE:1993-01-26 |
| 28.2.-1.12.1515 | GREGORIAN:CE:1515-02-28:CE:1515-12-01 |
| 25.-26.2.0800 | GREGORIAN:CE:0800-02-25:CE:0800-02-26 |
| 1.9.2022-3.1.2024 | GREGORIAN:CE:2022-09-01:CE:2024-01-03 |
| 1848 | GREGORIAN:CE:1848:CE:1848 |
| 1849/1850 | GREGORIAN:CE:1849:CE:1850 |
| 1849/50 | GREGORIAN:CE:1849:CE:1850 |
| 1845-50 | GREGORIAN:CE:1845:CE:1850 |
| 840-850 | GREGORIAN:CE:840:CE:850 |
| 840-1 | GREGORIAN:CE:840:CE:841 |
| 1000-900 av. J.-C. | GREGORIAN:BC:1000:BC:900 |
| 45 av. J.-C. | GREGORIAN:BC:45:BC:45 |
42 changes: 38 additions & 4 deletions src/dsp_tools/commands/excel2xml/excel2xml_lib.py
Expand Up @@ -61,14 +61,44 @@ def make_xsd_id_compatible(string: str) -> str:
return res


def _find_french_bc_date(
string: str,
lookbehind: str,
lookahead: str,
) -> Optional[str]:
french_bc_regex = r"av(?:\. |\.| )J\.?-?C\.?"
if not regex.search(french_bc_regex, string):
return None

year_regex = r"\d{1,5}"
sep_regex = r" ?- ?"

year_range_regex = rf"{lookbehind}({year_regex}){sep_regex}({year_regex}) {french_bc_regex}{lookahead}"
year_range = regex.search(year_range_regex, string)
if year_range:
start_year = int(year_range.group(1))
end_year = int(year_range.group(2))
if end_year > start_year:
return None
return f"GREGORIAN:BC:{start_year}:BC:{end_year}"

single_year_regex = rf"{lookbehind}({year_regex}) {french_bc_regex}{lookahead}"
single_year = regex.search(single_year_regex, string)
if single_year:
start_year = int(single_year.group(1))
return f"GREGORIAN:BC:{start_year}:BC:{start_year}"

return None


def find_date_in_string(string: str) -> Optional[str]:
"""
Checks if a string contains a date value (single date, or date range), and returns the first found date as
DSP-formatted string. Returns None if no date was found.
Notes:
- All dates are interpreted in the Christian era and the Gregorian calendar. There is no support for BC dates or
non-Gregorian calendars.
- All dates are interpreted in the Christian era and the Gregorian calendar.
- BC dates are only supported in French notation (e.g. 1000-900 av. J.-C.).
- The years 0000-2999 are supported, in 3/4-digit form.
- Dates written with slashes are always interpreted in a European manner: 5/11/2021 is the 5th of November.
Expand All @@ -78,16 +108,17 @@ def find_date_in_string(string: str) -> Optional[str]:
- 30.4.2021 -> GREGORIAN:CE:2021-04-30:CE:2021-04-30
- 5/11/2021 -> GREGORIAN:CE:2021-11-05:CE:2021-11-05
- Jan 26, 1993 -> GREGORIAN:CE:1993-01-26:CE:1993-01-26
- February26,2051 -> GREGORIAN:CE:2051-02-26:CE:2051-02-26
- 28.2.-1.12.1515 -> GREGORIAN:CE:1515-02-28:CE:1515-12-01
- 25.-26.2.0800 -> GREGORIAN:CE:0800-02-25:CE:0800-02-26
- 1.9.2022-3.1.2024 -> GREGORIAN:CE:2022-09-01:CE:2024-01-03
- 800 -> GREGORIAN:CE:800:CE:800
- 1848 -> GREGORIAN:CE:1848:CE:1848
- 1849/1850 -> GREGORIAN:CE:1849:CE:1850
- 1849/50 -> GREGORIAN:CE:1849:CE:1850
- 1845-50 -> GREGORIAN:CE:1845:CE:1850
- 840-50 -> GREGORIAN:CE:840:CE:850
- 840-1 -> GREGORIAN:CE:840:CE:841
- 1000-900 av. J.-C. -> GREGORIAN:BC:1000:BC:900
- 45 av. J.-C. -> GREGORIAN:BC:45:BC:45
Args:
string: string to check
Expand Down Expand Up @@ -144,6 +175,9 @@ def find_date_in_string(string: str) -> Optional[str]:
lookbehind = r"(?<![0-9A-Za-z])"
lookahead = r"(?![0-9A-Za-z])"

if french_bc_date := _find_french_bc_date(string=string, lookbehind=lookbehind, lookahead=lookahead):
return french_bc_date

# template: 2021-01-01 | 2015_01_02
iso_date = regex.search(rf"{lookbehind}{year_regex}[_-]([0-1][0-9])[_-]([0-3][0-9]){lookahead}", string)
# template: 6.-8.3.1948 | 6/2/1947 - 24.03.1948
Expand Down
41 changes: 41 additions & 0 deletions test/unittests/commands/excel2xml/test_excel2xml_lib.py
Expand Up @@ -278,6 +278,47 @@ def test_find_date_in_string(self) -> None:
for testcase, expected in testcases.items():
self.assertEqual(excel2xml.find_date_in_string(testcase), expected, msg=f"Failed with '{testcase}'")

def test_find_date_in_string_french_bc(self) -> None:
self.assertEqual(excel2xml.find_date_in_string("Text 12345 av. J.-C. text"), "GREGORIAN:BC:12345:BC:12345")
self.assertEqual(excel2xml.find_date_in_string("Text 2000 av. J.-C. text"), "GREGORIAN:BC:2000:BC:2000")
self.assertEqual(excel2xml.find_date_in_string("Text 250 av. J.-C. text"), "GREGORIAN:BC:250:BC:250")
self.assertEqual(excel2xml.find_date_in_string("Text 33 av. J.-C. text"), "GREGORIAN:BC:33:BC:33")
self.assertEqual(excel2xml.find_date_in_string("Text 1 av. J.-C. text"), "GREGORIAN:BC:1:BC:1")

def test_find_date_in_string_french_bc_ranges(self) -> None:
self.assertEqual(excel2xml.find_date_in_string("Text 99999-1000 av. J.-C. text"), "GREGORIAN:BC:99999:BC:1000")
self.assertEqual(excel2xml.find_date_in_string("Text 1125-1050 av. J.-C. text"), "GREGORIAN:BC:1125:BC:1050")
self.assertEqual(excel2xml.find_date_in_string("Text 1234-987 av. J.-C. text"), "GREGORIAN:BC:1234:BC:987")
self.assertEqual(excel2xml.find_date_in_string("Text 350-340 av. J.-C. text"), "GREGORIAN:BC:350:BC:340")
self.assertEqual(excel2xml.find_date_in_string("Text 842-98 av. J.-C. text"), "GREGORIAN:BC:842:BC:98")
self.assertEqual(excel2xml.find_date_in_string("Text 45-26 av. J.-C. text"), "GREGORIAN:BC:45:BC:26")
self.assertEqual(excel2xml.find_date_in_string("Text 53-7 av. J.-C. text"), "GREGORIAN:BC:53:BC:7")
self.assertEqual(excel2xml.find_date_in_string("Text 6-5 av. J.-C. text"), "GREGORIAN:BC:6:BC:5")

def test_find_date_in_string_french_bc_orthographical_variants(self) -> None:
self.assertEqual(excel2xml.find_date_in_string("Text 1 av. J.-C. text"), "GREGORIAN:BC:1:BC:1")
self.assertEqual(excel2xml.find_date_in_string("Text 1 av J.-C. text"), "GREGORIAN:BC:1:BC:1")
self.assertEqual(excel2xml.find_date_in_string("Text 1 av.J.-C. text"), "GREGORIAN:BC:1:BC:1")
self.assertEqual(excel2xml.find_date_in_string("Text 1 av. J.C. text"), "GREGORIAN:BC:1:BC:1")
self.assertEqual(excel2xml.find_date_in_string("Text 1 av. J-C text"), "GREGORIAN:BC:1:BC:1")
self.assertEqual(excel2xml.find_date_in_string("Text 1 av.JC text"), "GREGORIAN:BC:1:BC:1")
self.assertEqual(excel2xml.find_date_in_string("Text 1 av JC text"), "GREGORIAN:BC:1:BC:1")
self.assertEqual(excel2xml.find_date_in_string("Text 1 av. J.-C.text"), "GREGORIAN:BC:1:BC:1")

def test_find_date_in_string_french_bc_dash_variants(self) -> None:
self.assertEqual(excel2xml.find_date_in_string("Text 2000-1000 av. J.-C. text"), "GREGORIAN:BC:2000:BC:1000")
self.assertEqual(excel2xml.find_date_in_string("Text 2000- 1000 av. J.-C. text"), "GREGORIAN:BC:2000:BC:1000")
self.assertEqual(excel2xml.find_date_in_string("Text 2000 -1000 av. J.-C. text"), "GREGORIAN:BC:2000:BC:1000")
self.assertEqual(excel2xml.find_date_in_string("Text 2000 - 1000 av. J.-C. text"), "GREGORIAN:BC:2000:BC:1000")

def test_find_date_in_string_french_bc_invalid_syntax(self) -> None:
self.assertEqual(excel2xml.find_date_in_string("Text12 av. J.-C. text"), None)
self.assertEqual(excel2xml.find_date_in_string("Text 12 av. J.-Ctext"), None)
self.assertEqual(excel2xml.find_date_in_string("Text 1 avJC text"), None)

def test_find_date_in_string_french_bc_invalid_range(self) -> None:
self.assertEqual(excel2xml.find_date_in_string("Text 12-20 av. J.-C. text"), None)

def test_prepare_value(self) -> None:
identical_values = ["Test", "Test", "Test"]
different_values: list[Union[str, int, float]] = [1, 1.0, "1", "1.0", " 1 "]
Expand Down

0 comments on commit 610f064

Please sign in to comment.