Skip to content

Commit

Permalink
feat(xmlupload)!: allow both IDs and IRIs, remove --incremental flag (D…
Browse files Browse the repository at this point in the history
  • Loading branch information
jnussbaum committed Sep 1, 2023
1 parent 03e19d7 commit df1cf13
Show file tree
Hide file tree
Showing 13 changed files with 60 additions and 41 deletions.
5 changes: 2 additions & 3 deletions docs/cli-commands.md
Original file line number Diff line number Diff line change
Expand Up @@ -135,16 +135,15 @@ The following options are available:
- `-u` | `--user` (optional, default: `root@example.com`): username (e-mail) used for authentication with the DSP-API
- `-p` | `--password` (optional, default: `test`): password used for authentication with the DSP-API
- `-i` | `--imgdir` (optional, default: `.`): folder from where the paths in the `<bitstream>` tags are evaluated
- `-I` | `--incremental` (optional) : The links in the XML file point to IRIs (on the server)
instead of IDs (in the same XML file).
- `-V` | `--validate` (optional): validate the XML file without uploading it
- `-v` | `--verbose` (optional): print more information about the progress to the console
- `-m` | `--metrics` (optional): write metrics into a 'metrics' folder

Output:

- A file named `id2iri_mapping_[timestamp].json` is written to the current working directory.
This file should be kept if data is later added with the [`--incremental` option](./incremental-xmlupload.md)
This file should be kept if a second data delivery is added at a later point of time
[see here](./incremental-xmlupload.md).

The defaults are intended for local testing:

Expand Down
7 changes: 3 additions & 4 deletions docs/file-formats/xml-data-file.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@ After a successful upload of the data,
an output file is written (called `id2iri_mapping_[timestamp].json`)
with the mapping from the internal IDs used inside the XML
to their corresponding IRIs which uniquely identify them inside DSP.
This file should be kept if data is later added with the
`--incremental` [option](../incremental-xmlupload.md).
This file should be kept if a second data delivery is added at a later point of time
[see here](../incremental-xmlupload.md).

The import file must start with the standard XML header:

Expand Down Expand Up @@ -627,8 +627,7 @@ Attributes:
#### `<resptr>`

The `<resptr>` element contains either the internal ID of another resource inside the XML or the IRI of an already
existing resource on DSP. Inside the same XML file, a mixture of the two is not possible. If referencing existing
resources, `xmlupload --incremental` has to be used.
existing resource on DSP.

Attributes:

Expand Down
14 changes: 2 additions & 12 deletions docs/incremental-xmlupload.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ The file `additional_data.xml` contains references like `<resptr>http://rdfh.ch/
Such a file can be uploaded with

```bash
dsp-tools xmlupload --incremental additional_data.xml
dsp-tools xmlupload additional_data.xml
```


Expand All @@ -61,9 +61,6 @@ its internal IDs must be replaced by their respective IRIs.
That's where the JSON mapping file comes in:
It contains a mapping from `book_1` to `http://rdfh.ch/4123/nyOODvYySV2nJ5RWRdmOdQ`.


### id2iri

As a first step,
a new file must be generated
with the [`id2iri` command](./cli-commands.md#id2iri),
Expand All @@ -74,19 +71,12 @@ dsp-tools id2iri additional_data.xml id2iri_mapping_[timestamp].json
```



### incremental xmlupload

As second step, the newly generated XML file can be uploaded to DSP:

```bash
dsp-tools xmlupload --incremental additional_data_replaced_[timestamp].xml
dsp-tools xmlupload additional_data_replaced_[timestamp].xml
```

| <center>Important</center> |
|-------------------------------------------------------------------------------------------------------------------------------------------------|
| Internal IDs and IRIs cannot be mixed within the same file. An XML file uploaded with the incremental option must not contain any internal IDs. |



## 4. Continue an interruped xmlupload
Expand Down
7 changes: 0 additions & 7 deletions src/dsp_tools/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,12 +112,6 @@ def _make_parser(
parser_upload.add_argument(
"-i", "--imgdir", default=".", help="folder from where the paths in the <bitstream> tags are evaluated"
)
parser_upload.add_argument(
"-I",
"--incremental",
action="store_true",
help="The links in the XML file point to IRIs (on the server) instead of IDs (in the same XML file).",
)
parser_upload.add_argument(
"-V", "--validate-only", action="store_true", help="validate the XML file without uploading it"
)
Expand Down Expand Up @@ -466,7 +460,6 @@ def _call_requested_action(args: argparse.Namespace) -> bool:
imgdir=args.imgdir,
sipi=args.sipi_url,
verbose=args.verbose,
incremental=args.incremental,
save_metrics=args.metrics,
preprocessing_done=False,
)
Expand Down
1 change: 0 additions & 1 deletion src/dsp_tools/fast_xmlupload/upload_xml.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,6 @@ def fast_xmlupload(
imgdir=".",
sipi=sipi_url,
verbose=False,
incremental=False,
save_metrics=False,
preprocessing_done=True,
)
Expand Down
2 changes: 1 addition & 1 deletion src/dsp_tools/models/xmlresource.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ def get_props_with_links(self) -> list[XMLProperty]:

def get_resptrs(self) -> list[str]:
"""
Get a list of all resource id's that are referenced by this resource
Get a list of all resource IDs/IRIs that are referenced by this resource.
Returns:
List of resources identified by their unique id's (as given in the XML)
Expand Down
1 change: 0 additions & 1 deletion src/dsp_tools/utils/rosetta.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,6 @@ def _upload_xml(rosetta_folder: Path) -> bool:
imgdir=str(rosetta_folder),
sipi="http://0.0.0.0:1024",
verbose=False,
incremental=False,
save_metrics=False,
preprocessing_done=False,
)
Expand Down
11 changes: 3 additions & 8 deletions src/dsp_tools/utils/xml_upload.py
Original file line number Diff line number Diff line change
Expand Up @@ -178,6 +178,7 @@ def _remove_circular_references(
while len(resources) > 0 and cnt < 10000:
for resource in resources:
resptrs = resource.get_resptrs()
resptrs = [x for x in resptrs if not regex.search(r"https?://rdfh.ch/[a-fA-F0-9]{4}/\w{22}", x)]
if len(resptrs) == 0:
ok_resources.append(resource)
ok_res_ids.append(resource.id)
Expand Down Expand Up @@ -518,7 +519,6 @@ def xml_upload(
imgdir: str,
sipi: str,
verbose: bool = False,
incremental: bool = False,
save_metrics: bool = False,
preprocessing_done: bool = False,
) -> bool:
Expand All @@ -533,7 +533,6 @@ def xml_upload(
imgdir: the image directory
sipi: the sipi instance to be used
verbose: verbose option for the command, if used more output is given to the user
incremental: if set, IRIs instead of internal IDs are expected as resource pointers
save_metrics: if true, saves time measurements into a "metrics" folder in the current working directory
preprocessing_done: if set, all multimedia files referenced in the XML file must already be on the server
Expand Down Expand Up @@ -607,12 +606,8 @@ def xml_upload(
verbose=verbose,
)

# temporarily remove circular references, but only if not an incremental upload
if not incremental:
resources, stashed_xml_texts, stashed_resptr_props = _remove_circular_references(resources, verbose)
else:
stashed_xml_texts = dict()
stashed_resptr_props = dict()
# temporarily remove circular references
resources, stashed_xml_texts, stashed_resptr_props = _remove_circular_references(resources, verbose)

preparation_duration = datetime.now() - preparation_start
preparation_duration_ms = preparation_duration.seconds * 1000 + int(preparation_duration.microseconds / 1000)
Expand Down
1 change: 0 additions & 1 deletion test/e2e/test_00A1_import_scripts.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,6 @@ def test_import_scripts(self) -> None:
imgdir="src/dsp_tools/import_scripts/",
sipi="http://0.0.0.0:1024",
verbose=False,
incremental=False,
save_metrics=False,
preprocessing_done=False,
)
Expand Down
2 changes: 1 addition & 1 deletion test/e2e/test_cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -212,7 +212,7 @@ def test_xml_upload_incremental(self) -> None:
mapping_file.unlink()

second_xml_file_replaced = get_most_recent_glob_match(self.cwd / f"{second_xml_file_orig.stem}_replaced_*.xml")
self._make_cli_call(f"dsp-tools xmlupload --incremental -v {second_xml_file_replaced.absolute()}")
self._make_cli_call(f"dsp-tools xmlupload -v {second_xml_file_replaced.absolute()}")
second_xml_file_replaced.unlink()
self.assertListEqual(list(Path(self.cwd).glob("stashed_*_properties_*.txt")), [])

Expand Down
2 changes: 0 additions & 2 deletions test/e2e/test_xmlupload.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,6 @@ def test_xml_upload(self) -> None:
imgdir=self.imgdir,
sipi=self.sipi,
verbose=False,
incremental=False,
save_metrics=False,
preprocessing_done=False,
)
Expand All @@ -43,7 +42,6 @@ def test_xml_upload(self) -> None:
imgdir=self.imgdir,
sipi=self.sipi,
verbose=False,
incremental=False,
save_metrics=False,
preprocessing_done=False,
)
Expand Down
24 changes: 24 additions & 0 deletions testdata/id2iri/test-id2iri-data.xml
Original file line number Diff line number Diff line change
Expand Up @@ -80,4 +80,28 @@
</boolean-prop>
</resource>

<resource label="test_thing_2_in_same_file" restype=":TestThing2" id="test_thing_2_in_same_file">
<text-prop name=":hasSimpleText">
<text encoding="utf8">Text</text>
</text-prop>
</resource>

<resource label="links_to_same_file_wont_be_replaced" restype=":TestThing" id="links_to_same_file_wont_be_replaced">
<text-prop name=":hasSimpleText">
<text encoding="utf8">Text</text>
</text-prop>
<text-prop name=":hasRichtext">
<text encoding="xml">
Text with a <a class="salsah-link" href="IRI:no_replacements:IRI">link to no_replacements</a>
and a <a class="salsah-link" href="IRI:resptr_only:IRI">link to resptr_only</a>
</text>
</text-prop>
<resptr-prop name=":hasTestThing2">
<resptr>test_thing_2_in_same_file</resptr>
</resptr-prop>
<boolean-prop name=":hasBoolean">
<boolean>true</boolean>
</boolean-prop>
</resource>

</knora>
24 changes: 24 additions & 0 deletions testdata/id2iri/test-id2iri-output-expected.xml
Original file line number Diff line number Diff line change
Expand Up @@ -75,4 +75,28 @@
</boolean-prop>
</resource>

<resource label="test_thing_2_in_same_file" restype=":TestThing2" id="test_thing_2_in_same_file">
<text-prop name=":hasSimpleText">
<text encoding="utf8">Text</text>
</text-prop>
</resource>

<resource label="links_to_same_file_wont_be_replaced" restype=":TestThing" id="links_to_same_file_wont_be_replaced">
<text-prop name=":hasSimpleText">
<text encoding="utf8">Text</text>
</text-prop>
<text-prop name=":hasRichtext">
<text encoding="xml">
Text with a <a class="salsah-link" href="IRI:no_replacements:IRI">link to no_replacements</a>
and a <a class="salsah-link" href="IRI:resptr_only:IRI">link to resptr_only</a>
</text>
</text-prop>
<resptr-prop name=":hasTestThing2">
<resptr>test_thing_2_in_same_file</resptr>
</resptr-prop>
<boolean-prop name=":hasBoolean">
<boolean>true</boolean>
</boolean-prop>
</resource>

</knora>

0 comments on commit df1cf13

Please sign in to comment.