Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(id2iri): add flag to remove created resources from XML (DEV-2571) #491

Merged
merged 9 commits into from Aug 31, 2023
3 changes: 3 additions & 0 deletions README.md
Expand Up @@ -31,9 +31,12 @@ To get started quickly, without reading the details, just execute these commands
- `curl -sSL https://install.python-poetry.org | python3 -`
- `poetry self add poetry-exec-plugin`
- `poetry install`
- `poetry shell`
- `pre-commit install`
- `brew install imagemagick ffmpeg`

To learn more about the meaning of these commands, read the remainder of this README.



## Using poetry for dependency management
Expand Down
11 changes: 11 additions & 0 deletions docs/cli-commands.md
Expand Up @@ -265,8 +265,19 @@ by IRIs provided in a mapping file.
dsp-tools id2iri xmlfile.xml mapping.json
```

The following options are available:

- `-r` | `--remove-resources` (optional): remove resources if their ID is in the mapping
(this prevents doubled resources on the DSP server,
because the resources occurring in the mapping already exist on the DSP server)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add here that there can't be further statements with the resource in subject position. Because to me that would not be obvious.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we speak of 2 different things: You talk about triples (subject - predicate - object) in the database, and I talk about resources that can be uploaded (=<resource> tags in the XML file --> a resource like "Iliad Prooem" that can be linked at with a link like https://ark.dasch.swiss/ark:/72163/1/082E/40kW9f9=SzOnQiyvhBNSqw=.20220414T072754555597Z).

In the context of an xmlupload, we don't care about what kind of triples can be in the database. We only care about <resource>s that are uploaded. And if "Iliad Prooem" has been uploaded already, I don't want to upload it a second time.

The output file is written to `[original name]_replaced_[timestamp].xml`.

If the flag `--remove-resources` is set,
all resources of which the ID is in the mapping are removed from the XML file.
This prevents doubled resources on the DSP server,
because normally, the resources occurring in the mapping already exist on the DSP server.

jnussbaum marked this conversation as resolved.
Show resolved Hide resolved
This command cannot be used isolated,
because it is part of a bigger procedure
that is documented [here](./incremental-xmlupload.md).
Expand Down
28 changes: 27 additions & 1 deletion docs/incremental-xmlupload.md
Expand Up @@ -16,11 +16,12 @@ What is this mapping used for?
It can happen that at a later point of time,
additional data is uploaded.
Depending on what kind of references the additional data contains,
there are 3 cases how this can happen:
there are 4 cases how this can happen:

1. no references to existing resources: normal xmlupload
2. references to existing resources via IRIs: incremental xmlupload
3. references to existing resources via internal IDs: first id2iri, then incremental xmlupload
4. continue an interruped xmlupload: first id2iri, then incremental xmlupload



Expand Down Expand Up @@ -85,3 +86,28 @@ dsp-tools xmlupload --incremental additional_data_replaced_[timestamp].xml
| <center>Important</center> |
|-------------------------------------------------------------------------------------------------------------------------------------------------|
| Internal IDs and IRIs cannot be mixed within the same file. An XML file uploaded with the incremental option must not contain any internal IDs. |



## 4. Continue an interruped xmlupload

If an xmlupload didn't finish successfully,
some resources have already been created, but others not.
jnussbaum marked this conversation as resolved.
Show resolved Hide resolved
If the remaining resources have references to created ones,
these references must be made with IRIs,
and the created resources must be removed from the XML file
(otherwise they would be created a second time).

jnussbaum marked this conversation as resolved.
Show resolved Hide resolved
In such a case, proceed as follows:

```bash
dsp-tools xmlupload data.xml
# crash: some resources have been uploaded, and a id2iri_mapping_[timestamp].json file has been written
# fix the reason for the crash

# replace the IDs and remove the created resources with:
dsp-tools id2iri data.xml --remove-resources id2iri_mapping_[timestamp].json

# upload the outputted XML file with
dsp-tools xmlupload data_replaced_[timestamp].xml
```
jnussbaum marked this conversation as resolved.
Show resolved Hide resolved
3 changes: 3 additions & 0 deletions src/dsp_tools/cli.py
Expand Up @@ -223,6 +223,9 @@ def _make_parser(
help="Replace internal IDs of an XML file (resptr tags or salsah-links) by IRIs provided in a mapping file.",
)
parser_id2iri.set_defaults(action="id2iri")
parser_id2iri.add_argument(
"-r", "--remove-resources", action="store_true", help="remove resources if their ID is in the mapping"
)
parser_id2iri.add_argument("xmlfile", help="path to the XML file containing the data to be replaced")
parser_id2iri.add_argument("mapping", help="path to the JSON file containing the mapping of IDs to IRIs")

Expand Down
47 changes: 42 additions & 5 deletions src/dsp_tools/utils/id_to_iri.py
Expand Up @@ -75,7 +75,8 @@ def _replace_resptrs(
Returns:
a tuple of the modified XML tree and the set of the IDs that have been replaced
"""
resptr_elems = tree.xpath("/knora/resource/resptr-prop/resptr")
resptr_xpath = "|".join([f"/knora/{x}/resptr-prop/resptr" for x in ["resource", "annotation", "link", "region"]])
resptr_elems = tree.xpath(resptr_xpath)
resptr_elems_replaced = 0
for resptr_elem in resptr_elems:
value_before = resptr_elem.text
Expand Down Expand Up @@ -106,9 +107,8 @@ def _replace_salsah_links(
Returns:
a tuple of the modified XML tree and the set of the IDs that have been replaced
"""
salsah_links = [
x for x in tree.xpath("/knora/resource/text-prop/text//a") if x.attrib.get("class") == "salsah-link"
]
salsah_xpath = "|".join([f"/knora/{x}/text-prop/text//a" for x in ["resource", "annotation", "link", "region"]])
salsah_links = [x for x in tree.xpath(salsah_xpath) if x.attrib.get("class") == "salsah-link"]
salsah_links_replaced = 0
for salsah_link in salsah_links:
value_before = regex.sub("IRI:|:IRI", "", salsah_link.attrib.get("href", ""))
Expand Down Expand Up @@ -137,7 +137,7 @@ def _replace_ids_by_iris(
mapping: mapping of internal IDs to IRIs

Returns:
modified XML tree
a tuple of the modified XML tree and the success status
"""
success = True
used_mapping_entries: set[str] = set()
Expand All @@ -160,6 +160,36 @@ def _replace_ids_by_iris(
return tree, success


def _remove_resources_if_id_in_mapping(
tree: etree._Element,
mapping: dict[str, str],
) -> tuple[etree._Element, bool]:
"""
Remove all resources from the XML file if their ID is in the mapping.

Args:
tree: parsed XML file
mapping: mapping of internal IDs to IRIs

Returns:
a tuple of the modified XML tree and the success status
"""
success = True
jnussbaum marked this conversation as resolved.
Show resolved Hide resolved
resources = tree.xpath("|".join([f"/knora/{x}" for x in ["resource", "annotation", "link", "region"]]))
resources_to_remove = [x for x in resources if x.attrib.get("id") in mapping]
for resource in resources_to_remove:
jnussbaum marked this conversation as resolved.
Show resolved Hide resolved
resource.getparent().remove(resource)

msg = (
f"Removed {len(resources_to_remove)}/{len(resources)} resources from the XML file, "
"because their ID was in the mapping"
)
logger.info(msg)
print(msg)

return tree, success


def _write_output_file(
orig_xml_file: Path,
tree: etree._Element,
Expand All @@ -182,6 +212,7 @@ def _write_output_file(
def id_to_iri(
xml_file: str,
json_file: str,
remove_resource_if_id_in_mapping: bool = False,
) -> bool:
"""
Replace internal IDs of an XML file
Expand All @@ -193,6 +224,7 @@ def id_to_iri(
Args:
xml_file: the XML file with the data to be replaced
json_file: the JSON file with the mapping (dict) of internal IDs to IRIs
remove_resource_if_id_in_mapping: if True, remove all resources from the XML file if their ID is in the mapping

Raises:
BaseError: if one of the two input files is not a valid file
Expand All @@ -207,5 +239,10 @@ def id_to_iri(
tree=tree,
mapping=mapping,
)
if remove_resource_if_id_in_mapping:
tree, success = _remove_resources_if_id_in_mapping(
tree=tree,
mapping=mapping,
)
_write_output_file(orig_xml_file=xml_file_as_path, tree=tree)
return success
Copy link
Contributor

@gNahcab gNahcab Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returns:
      True if everything went well, False otherwise

description of methode 'id_to_iri' is misleading in my opinion. There is no possibility that success returns 'False' if I see correctly 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. This is part of a overarching design pattern that I often use in DSP-TOOLS: every top-level-function (like xmlupload, create, id2iri, ...) always returns True if successful, and False if not. The file src/dsp_tools/cli.py is designed in this way, so I really need to return a boolean.

Now the question is: Does it make sense to define success = True at the beginning of the function, and then at the end of the function return the unmodified success variable? Or shall I just return True at the end of the function?

Pros and cons of the current code:

  • Con: For the current use case, there's some superfluous code that "inflates" the code. (e.g. the functions have 2 return values, namely the actual one and the success variable)
  • Pro: if in the future, one of the lower-level functions introduces a "unhappy path" (where success=False), everything is already in place, and I don't have to modify the signatures of the functions (e.g. introducing a second return value)

@BalduinLandolt Have you got an opinion on this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me too it is unclear. If I have a variable with a bool which you return then I expect there to be a possibility for it to change. In this case I would just return True. If we need to implement a change, creating a new variable is not too difficult.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an issue with exceptions in general: They behave like an additional, implicit and untyped return channel - something we despise in the world of functional programming 🙂

The clean way to solve this is to either return something useful (some aggregation of error messages?) that is consumed by cli.py instead of throwing exceptions; or more pragmatically, remove all boolean return values that can not be false, because they are meaningless, and remove the success variable from cli.py because it can only be true as well and is therefor just as meaningless.
Then the distinction between success and failure would be determined exclusively by wether or not an exception is thrown.

In any case, this is probably out of scope for this PR and should be addressed separately

Copy link
Contributor

@gNahcab gNahcab Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe to avoid a possible bug in the future: just remove success from the methods that get called from within id_to_iri? since the latter could - if changes were made later in the first method that gets called - overwrite the success message of the first one
something like this:

 success = True
 xml_file_as_path, json_file_as_path = _check_input_parameters(xml_file=xml_file, json_file=json_file)
    mapping = _parse_json_file(json_file_as_path)
    tree = parse_xml_file(xml_file_as_path)
    tree = _replace_ids_by_iris(
        tree=tree,
        mapping=mapping,
    )
    if remove_resource_if_id_in_mapping:
        tree = _remove_resources_if_id_in_mapping(
            tree=tree,
            mapping=mapping,
        )
    _write_output_file(orig_xml_file=xml_file_as_path, tree=tree)
    return success

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify: cli.py is designed in a way that every function must return a success status. I did that (back then) because for the create and xmlupload commands, there are 3 possibilites:

  • return with success=True and no errors raised: everything perfect
  • return with success=False and no errors raised: some minor abnormalities occurred, but the process could finish nevertheless
  • error raised: a big abnormality occurred, the process could not finish

At least for the xmlupload command, this distinction is important.

This means for id2iri: Currently, it has to return a success status to cli.py. But of course, internally, the many success variables that are never modified are questionable. That's why I have removed them, and only at the very end, I return True.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And thank you @gNahcab, you noticed that in my previous code, the 2nd function overwrote the success status from the 1st function, which was a bug, of course.