Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(id2iri): add flag to remove created resources from XML (DEV-2571) #491

Merged
merged 9 commits into from Aug 31, 2023
3 changes: 3 additions & 0 deletions README.md
Expand Up @@ -31,9 +31,12 @@ To get started quickly, without reading the details, just execute these commands
- `curl -sSL https://install.python-poetry.org | python3 -`
- `poetry self add poetry-exec-plugin`
- `poetry install`
- `poetry shell`
- `pre-commit install`
- `brew install imagemagick ffmpeg`

To learn more about the meaning of these commands, read the remainder of this README.



## Using poetry for dependency management
Expand Down
9 changes: 9 additions & 0 deletions docs/cli-commands.md
Expand Up @@ -265,8 +265,17 @@ by IRIs provided in a mapping file.
dsp-tools id2iri xmlfile.xml mapping.json
```

The following options are available:

- `-r` | `--remove-resources` (optional): remove resources if their ID is in the mapping

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add here that there can't be further statements with the resource in subject position. Because to me that would not be obvious.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we speak of 2 different things: You talk about triples (subject - predicate - object) in the database, and I talk about resources that can be uploaded (=<resource> tags in the XML file --> a resource like "Iliad Prooem" that can be linked at with a link like https://ark.dasch.swiss/ark:/72163/1/082E/40kW9f9=SzOnQiyvhBNSqw=.20220414T072754555597Z).

In the context of an xmlupload, we don't care about what kind of triples can be in the database. We only care about <resource>s that are uploaded. And if "Iliad Prooem" has been uploaded already, I don't want to upload it a second time.

The output file is written to `[original name]_replaced_[timestamp].xml`.

If the flag `--remove-resources` is set,
all resources of which the ID is in the mapping are removed from the XML file.
This prevents doubled resources on the DSP server,
because normally, the resources occurring in the mapping already exist on the DSP server.

jnussbaum marked this conversation as resolved.
Show resolved Hide resolved
This command cannot be used isolated,
because it is part of a bigger procedure
that is documented [here](./incremental-xmlupload.md).
Expand Down
24 changes: 23 additions & 1 deletion docs/incremental-xmlupload.md
Expand Up @@ -16,11 +16,12 @@ What is this mapping used for?
It can happen that at a later point of time,
additional data is uploaded.
Depending on what kind of references the additional data contains,
there are 3 cases how this can happen:
there are 4 cases how this can happen:

1. no references to existing resources: normal xmlupload
2. references to existing resources via IRIs: incremental xmlupload
3. references to existing resources via internal IDs: first id2iri, then incremental xmlupload
4. continue an interruped xmlupload: first id2iri, then incremental xmlupload



Expand Down Expand Up @@ -85,3 +86,24 @@ dsp-tools xmlupload --incremental additional_data_replaced_[timestamp].xml
| <center>Important</center> |
|-------------------------------------------------------------------------------------------------------------------------------------------------|
| Internal IDs and IRIs cannot be mixed within the same file. An XML file uploaded with the incremental option must not contain any internal IDs. |



## 4. Continue an interruped xmlupload

If an xmlupload didn't finish successfully,
some resources have already been created, but others have not.
If one of the remaining resources references a created resource by its ID,
this ID must be replaced by the IRI of the created resource.

jnussbaum marked this conversation as resolved.
Show resolved Hide resolved
In addition, the created resources must be removed from the XML file,
otherwise they would be created a second time.

In such a case, proceed as follows:

1. Initial xmlupload: `dsp-tools xmlupload data.xml`
2. A crash happens. Some resources have been uploaded, and a `id2iri_mapping_[timestamp].json` file has been written
3. Fix the reason for the crash
4. Replace the IDs and remove the created resources with:
`dsp-tools id2iri data.xml --remove-resources id2iri_mapping_[timestamp].json``
5. Upload the outputted XML file with `dsp-tools xmlupload data_replaced_[timestamp].xml`
3 changes: 3 additions & 0 deletions src/dsp_tools/cli.py
Expand Up @@ -223,6 +223,9 @@ def _make_parser(
help="Replace internal IDs of an XML file (resptr tags or salsah-links) by IRIs provided in a mapping file.",
)
parser_id2iri.set_defaults(action="id2iri")
parser_id2iri.add_argument(
"-r", "--remove-resources", action="store_true", help="remove resources if their ID is in the mapping"
)
parser_id2iri.add_argument("xmlfile", help="path to the XML file containing the data to be replaced")
parser_id2iri.add_argument("mapping", help="path to the JSON file containing the mapping of IDs to IRIs")

Expand Down
57 changes: 46 additions & 11 deletions src/dsp_tools/utils/id_to_iri.py
Expand Up @@ -74,7 +74,8 @@ def _replace_resptrs(
Returns:
a tuple of the modified XML tree and the set of the IDs that have been replaced
"""
resptr_elems = tree.xpath("/knora/resource/resptr-prop/resptr")
resptr_xpath = "|".join([f"/knora/{x}/resptr-prop/resptr" for x in ["resource", "annotation", "link", "region"]])
resptr_elems = tree.xpath(resptr_xpath)
resptr_elems_replaced = 0
for resptr_elem in resptr_elems:
value_before = resptr_elem.text
Expand Down Expand Up @@ -105,9 +106,8 @@ def _replace_salsah_links(
Returns:
a tuple of the modified XML tree and the set of the IDs that have been replaced
"""
salsah_links = [
x for x in tree.xpath("/knora/resource/text-prop/text//a") if x.attrib.get("class") == "salsah-link"
]
salsah_xpath = "|".join([f"/knora/{x}/text-prop/text//a" for x in ["resource", "annotation", "link", "region"]])
salsah_links = [x for x in tree.xpath(salsah_xpath) if x.attrib.get("class") == "salsah-link"]
salsah_links_replaced = 0
for salsah_link in salsah_links:
value_before = regex.sub("IRI:|:IRI", "", salsah_link.attrib.get("href", ""))
Expand All @@ -125,7 +125,7 @@ def _replace_salsah_links(
def _replace_ids_by_iris(
tree: etree._Element,
mapping: dict[str, str],
) -> tuple[etree._Element, bool]:
) -> etree._Element:
"""
Iterate over the <resptr> tags and the salsah-links of the <text> tags,
and replace the internal IDs by IRIs.
Expand All @@ -136,9 +136,8 @@ def _replace_ids_by_iris(
mapping: mapping of internal IDs to IRIs

Returns:
modified XML tree
a tuple of the modified XML tree
"""
success = True
used_mapping_entries: set[str] = set()

tree, used_mapping_entries = _replace_resptrs(
Expand All @@ -156,7 +155,36 @@ def _replace_ids_by_iris(
logger.info(f"Used {len(used_mapping_entries)}/{len(mapping)} entries from the mapping file")
print(f"Used {len(used_mapping_entries)}/{len(mapping)} entries from the mapping file")

return tree, success
return tree


def _remove_resources_if_id_in_mapping(
tree: etree._Element,
mapping: dict[str, str],
) -> etree._Element:
"""
Remove all resources from the XML file if their ID is in the mapping.

Args:
tree: parsed XML file
mapping: mapping of internal IDs to IRIs

Returns:
a tuple of the modified XML tree
"""
resources = tree.xpath("|".join([f"/knora/{x}" for x in ["resource", "annotation", "link", "region"]]))
resources_to_remove = [x for x in resources if x.attrib.get("id") in mapping]
for resource in resources_to_remove:
jnussbaum marked this conversation as resolved.
Show resolved Hide resolved
resource.getparent().remove(resource)

msg = (
f"Removed {len(resources_to_remove)}/{len(resources)} resources from the XML file, "
"because their ID was in the mapping"
)
logger.info(msg)
print(msg)

return tree


def _write_output_file(
Expand All @@ -181,6 +209,7 @@ def _write_output_file(
def id_to_iri(
xml_file: str,
json_file: str,
remove_resource_if_id_in_mapping: bool = False,
) -> bool:
"""
Replace internal IDs of an XML file
Expand All @@ -192,19 +221,25 @@ def id_to_iri(
Args:
xml_file: the XML file with the data to be replaced
json_file: the JSON file with the mapping (dict) of internal IDs to IRIs
remove_resource_if_id_in_mapping: if True, remove all resources from the XML file if their ID is in the mapping

Raises:
BaseError: if one of the two input files is not a valid file

Returns:
True if everything went well, False otherwise
success status
"""
xml_file_as_path, json_file_as_path = _check_input_parameters(xml_file=xml_file, json_file=json_file)
mapping = _parse_json_file(json_file_as_path)
tree = parse_xml_file(xml_file_as_path)
tree, success = _replace_ids_by_iris(
tree = _replace_ids_by_iris(
tree=tree,
mapping=mapping,
)
if remove_resource_if_id_in_mapping:
tree = _remove_resources_if_id_in_mapping(
tree=tree,
mapping=mapping,
)
_write_output_file(orig_xml_file=xml_file_as_path, tree=tree)
return success
return True