Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(id2iri): add flag to remove created resources from XML (DEV-2571) #491

Merged
merged 9 commits into from Aug 31, 2023

Conversation

jnussbaum
Copy link
Collaborator

No description provided.

@jnussbaum jnussbaum self-assigned this Aug 28, 2023
@linear
Copy link

linear bot commented Aug 28, 2023

DEV-2571 id2iri should remove created resources from XML

Currently, the id2iri command replaces IDs in resptr links by IRIs. If the use case is a later data delivery, this is enough.

But if the use case is to continue a crashed xmlupload, this is not enough: The resource definitions that have an ID that is found in the id2iri mapping must be deleted, because this resource shouldn't be created a 2nd time.

tree, success = _remove_resources_if_id_in_mapping(
tree=tree,
mapping=mapping,
)
_write_output_file(orig_xml_file=xml_file_as_path, tree=tree)
return success
Copy link
Contributor

@gNahcab gNahcab Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returns:
      True if everything went well, False otherwise

description of methode 'id_to_iri' is misleading in my opinion. There is no possibility that success returns 'False' if I see correctly 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. This is part of a overarching design pattern that I often use in DSP-TOOLS: every top-level-function (like xmlupload, create, id2iri, ...) always returns True if successful, and False if not. The file src/dsp_tools/cli.py is designed in this way, so I really need to return a boolean.

Now the question is: Does it make sense to define success = True at the beginning of the function, and then at the end of the function return the unmodified success variable? Or shall I just return True at the end of the function?

Pros and cons of the current code:

  • Con: For the current use case, there's some superfluous code that "inflates" the code. (e.g. the functions have 2 return values, namely the actual one and the success variable)
  • Pro: if in the future, one of the lower-level functions introduces a "unhappy path" (where success=False), everything is already in place, and I don't have to modify the signatures of the functions (e.g. introducing a second return value)

@BalduinLandolt Have you got an opinion on this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me too it is unclear. If I have a variable with a bool which you return then I expect there to be a possibility for it to change. In this case I would just return True. If we need to implement a change, creating a new variable is not too difficult.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an issue with exceptions in general: They behave like an additional, implicit and untyped return channel - something we despise in the world of functional programming 🙂

The clean way to solve this is to either return something useful (some aggregation of error messages?) that is consumed by cli.py instead of throwing exceptions; or more pragmatically, remove all boolean return values that can not be false, because they are meaningless, and remove the success variable from cli.py because it can only be true as well and is therefor just as meaningless.
Then the distinction between success and failure would be determined exclusively by wether or not an exception is thrown.

In any case, this is probably out of scope for this PR and should be addressed separately

Copy link
Contributor

@gNahcab gNahcab Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe to avoid a possible bug in the future: just remove success from the methods that get called from within id_to_iri? since the latter could - if changes were made later in the first method that gets called - overwrite the success message of the first one
something like this:

 success = True
 xml_file_as_path, json_file_as_path = _check_input_parameters(xml_file=xml_file, json_file=json_file)
    mapping = _parse_json_file(json_file_as_path)
    tree = parse_xml_file(xml_file_as_path)
    tree = _replace_ids_by_iris(
        tree=tree,
        mapping=mapping,
    )
    if remove_resource_if_id_in_mapping:
        tree = _remove_resources_if_id_in_mapping(
            tree=tree,
            mapping=mapping,
        )
    _write_output_file(orig_xml_file=xml_file_as_path, tree=tree)
    return success

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify: cli.py is designed in a way that every function must return a success status. I did that (back then) because for the create and xmlupload commands, there are 3 possibilites:

  • return with success=True and no errors raised: everything perfect
  • return with success=False and no errors raised: some minor abnormalities occurred, but the process could finish nevertheless
  • error raised: a big abnormality occurred, the process could not finish

At least for the xmlupload command, this distinction is important.

This means for id2iri: Currently, it has to return a success status to cli.py. But of course, internally, the many success variables that are never modified are questionable. That's why I have removed them, and only at the very end, I return True.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And thank you @gNahcab, you noticed that in my previous code, the 2nd function overwrote the success status from the 1st function, which was a bug, of course.

- `-r` | `--remove-resources` (optional): remove resources if their ID is in the mapping
(this prevents doubled resources on the DSP server,
because the resources occurring in the mapping already exist on the DSP server)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add here that there can't be further statements with the resource in subject position. Because to me that would not be obvious.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we speak of 2 different things: You talk about triples (subject - predicate - object) in the database, and I talk about resources that can be uploaded (=<resource> tags in the XML file --> a resource like "Iliad Prooem" that can be linked at with a link like https://ark.dasch.swiss/ark:/72163/1/082E/40kW9f9=SzOnQiyvhBNSqw=.20220414T072754555597Z).

In the context of an xmlupload, we don't care about what kind of triples can be in the database. We only care about <resource>s that are uploaded. And if "Iliad Prooem" has been uploaded already, I don't want to upload it a second time.

docs/cli-commands.md Show resolved Hide resolved
docs/incremental-xmlupload.md Outdated Show resolved Hide resolved
docs/incremental-xmlupload.md Show resolved Hide resolved
docs/incremental-xmlupload.md Outdated Show resolved Hide resolved
src/dsp_tools/utils/id_to_iri.py Show resolved Hide resolved
src/dsp_tools/utils/id_to_iri.py Outdated Show resolved Hide resolved
tree, success = _remove_resources_if_id_in_mapping(
tree=tree,
mapping=mapping,
)
_write_output_file(orig_xml_file=xml_file_as_path, tree=tree)
return success
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me too it is unclear. If I have a variable with a bool which you return then I expect there to be a possibility for it to change. In this case I would just return True. If we need to implement a change, creating a new variable is not too difficult.

@jnussbaum jnussbaum merged commit bf25cb7 into main Aug 31, 2023
4 checks passed
@jnussbaum jnussbaum deleted the wip/dev-2571-remove-resources branch August 31, 2023 09:12
@daschbot daschbot mentioned this pull request Aug 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants