Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attaching files with spaces in the file name causes a ValueError when writing the AASX #236

Open
WelliSolutions opened this issue Feb 17, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@WelliSolutions
Copy link

I was adding a file to an AASX which contained spaces. When writing the AASX, I got the following traceback:

Traceback (most recent call last):
[...]
  File "[...]\venv\Lib\site-packages\basyx\aas\adapter\aasx.py", line 415, in write_aas
    self.write_all_aas_objects("/aasx/data.{}".format("json" if write_json else "xml"),
  File "[...]\venv\Lib\site-packages\basyx\aas\adapter\aasx.py", line 552, in write_all_aas_objects
    with self.writer.open_part(file_name, content_type) as p:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[...]\venv\Lib\site-packages\pyecma376_2\package_model.py", line 366, in open_part
    check_part_name(name)
  File "[...]\venv\Lib\site-packages\pyecma376_2\package_model.py", line 658, in check_part_name
    raise ValueError("{} is not an URI path with multiple segments (each not empty and not starting with '.') "
ValueError: '/aasx/attachments/file with space.txt' is not an URI path with multiple segments (each not empty and not starting with '.') or not starting with '/' or ending wit '/'

Note that the error message is a bit misleading. It's not because of the / and not because of the .. I double checked and it's the space that causes the problem.

My expectation is that the Basyx library handles the file name at the level of DictSupplementaryFileContainer and the add_file() method. As written in the tutorial, the method shall return the actual file name. If spaces are not allowed, please return a file name without spaces.

@WelliSolutions
Copy link
Author

More details:
In package_model.py (of pyecma376_2), the check is made:

def check_part_name(part_name: str) -> None:
    """ Check if `part_name` is a valid OPC part name in URI (not IRI) representation.

    :raises ValueError: if it is not.
    """
    if not RE_PART_NAME.match(part_name):
        raise ValueError("{} is not an URI path with multiple segments (each not empty and not starting with '.') "
                         "or not starting with '/' or ending wit '/'".format(repr(part_name)))

And RE_PART_NAME does not allow spaces:

RE_PART_NAME = re.compile(r'^(/[A-Za-z0-9\-\._~%:@!$&\'()*+,;=]*[A-Za-z0-9\-_~%:@!$&\'()*+,;=])+$')

Let's first conclude that \. and \' are equivalent to . and ', because it's inside a bracket expression and those don't need to be escaped there.

The first bracket expression contains the ., the second doesn't. Also, the first bracket expression is optional, the second isn't. So, to me, the Regex is checking whether something does not end with a dot, rather than "not starting with '.'" as mentioned in the error message. So, at least this part may be a bug in the pyecma library (either the error message or the Regex).

Other than that, I don't have enough knowledge to judge whether ECMA 376 allows spaces in file names or not.

@WelliSolutions
Copy link
Author

OPC Specification IMHO is here: https://standards.iso.org/ittf/PubliclyAvailableStandards/c077818_ISO_IEC_29500-2_2021(E).zip
(download link suggested by Wikipedia does not work).

The relevant chapter seems to be 6.2.2 Part Names. And 6.2.2.2 Syntax specifies:

No I18N segments shall end with a dot (“.”) character.

So this is really about ending with a dot.

Furthermore:

[...] where an I18N segment is a Unicode string that
matches the non-terminal isegment-nz and percent-encoding represents a character by the percent
character "%" followed by two hexadecimal digits, as specified in RFC 3986

So it seems to me that the Basyx LIbrar should percent-encode the space to %20.

@jkhsjdhjs
Copy link
Contributor

I agree, perhaps we should already call check_part_name() within DictSupplementaryFileContainer.add_file(), and if that raises an exception, percent encode the part name? Or just always percent encode?

@jkhsjdhjs
Copy link
Contributor

Alright, I did some digging:

When writing supplementary files, the AASXWriter does the following:

logger.debug("Writing supplementary file {} to AASX package ...".format(file_name))
with self.writer.open_part(file_name, content_type) as p:
file_store.write_file(file_name, p)
supplementary_file_names.append(pyecma376_2.package_model.normalize_part_name(file_name))
self._supplementary_part_names[file_name] = hash

For each supplementary file, it calls OPCPackageReader.open_part() with the filename supplied by the user, the way the file was added to the SupplementaryFileContainer. This determines the name the file will have in the AASX file.
Afterwards, it adds the normalized filename to the list supplementary_file_names, which will later be used to write the relationship files.

Normalization percent-encodes the string and converts it to lower case:
https://github.com/rwth-iat/PyECMA376-2/blob/e7c28ca9c9fbebfae3d3ab9a8abc104aa57f7913/pyecma376_2/package_model.py#L642-L645

Thus, it seems to me, that only the relationship files are expected to contain normalized, percent-encoded URIs, and the filenames are actually not expected to follow this format. If the filenames would be percent-encoded and converted to lower case, it would also be impossible to determine the original filename before packaging, and the assertion at the end of tutorial_aasx.py would never be true:

assert actual_file_name in new_file_store

Yet, OPCPackageWriter.open_part() and OPCPackageWriter.create_fragmented_part() both call check_part_name() on the passed filename, which validates the filename against the regular expression for part URIs.

So, I had a look at the specification you linked and it doesn't confirm my thoughts (well, partially):

7.3.4 Mapping logical item names to ZIP item names
For each logical item, the process of mapping of logical item names to ZIP item names shall involve the following steps, in order:

  1. Remove the leading forward slash (“/”) from the logical item name or, in the case of interleaved parts, from each of the logical item names within the complete sequence.
    Remove the leading forward slash (“/”) from the logical item name or, parts, from each of the logical item names within the complete sequence.
  2. Percent-encode every item non-ASCII character.

Validating the filenames via check_part_name() is wrong, as spaces are ASCII characters and thus allowed in filenames.
However, non-ASCII characters still need to be percent-encoded. Thus, the correct way would be to replace the call to check_part_name() by a call to a function that percent-encodes the passed filename, or at least to change the current validation.

@s-heppner
Copy link
Contributor

Did I understand correctly, that this is a bug in the check_part_name() function of the pyecma376_2 library?

@s-heppner s-heppner added the bug Something isn't working label Feb 18, 2024
@jkhsjdhjs
Copy link
Contributor

jkhsjdhjs commented Feb 18, 2024

I think the check_part_name() function is correct, what's incorrect is that this function is used to validate filenames for the zip package. IMO, a function to correctly percent-encode filenames for AASX packages should be implemented in pyecma376_2.

pyecma376_2 already has a function to percent-encode and convert to lower-case part names. However, the way I understand it is that this percent-encoding for part-names is different from the percent-encoding for filenames, as it

  1. converts the entire string to lower-case and
  2. also encodes characters that would be allowed in filenames, such as spaces.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants