Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore duplicate input archive entries #119

Closed
scottmarlow opened this issue Nov 2, 2020 · 8 comments
Closed

Ignore duplicate input archive entries #119

scottmarlow opened this issue Nov 2, 2020 · 8 comments
Assignees
Labels
enhancement New feature or request

Comments

@scottmarlow
Copy link
Contributor

https://issues.redhat.com/browse/WFLY-14014 is for a Jakarta EE 9 Platform TCK test archive that contains duplicate classes. I'm thinking that we could catch/log/ignore duplicate classes such as java.util.zip.ZipException: duplicate entry: com/sun/ts/tests/servlet/api/jakarta_servlet/singlethreadmodel/STMClientServlet$ThreadClient$TestThread.class

@scottmarlow
Copy link
Contributor Author

Or we could track already put class names in a HashSet, will create a pr for that.

@bjhargrave
Copy link
Member

This seems like garbage-in, garbage-out. That is, the input jar file is malformed. It should not have duplicate entries. It seems the best solution is to fix the input jar so that it is not malformed.

I don't think it is valid for the transformer to ignore duplicate entries in the input. Which of the duplicate entries is the correct entry? First, last, middle?

The proper solution is to fix the malformed Jakarta EE 9 Platform TCK test archive.

@scottmarlow
Copy link
Contributor Author

This seems like garbage-in, garbage-out. That is, the input jar file is malformed. It should not have duplicate entries. It seems the best solution is to fix the input jar so that it is not malformed.

I don't think it is valid for the transformer to ignore duplicate entries in the input. Which of the duplicate entries is the correct entry? First, last, middle?

I agree in that it hides the case when the entries are not duplicates, however, this is an under-specified area in terms of Jakarta EE requirements for a very long time. It also does seem like a mismatch between what jars allow and what java.util.zip.ZipOutputStream allows which may be different than what the Jakarta EE implementation classloader does.

The proper solution is to fix the malformed Jakarta EE 9 Platform TCK test archive.

It is too late in the Jakarta EE 9 Platform TCK schedule to make those change, however that is only my current use case.

If someone is trying to transform a large Jakarta EE 8 application to EE 9, they may not have the ability to update their application archive (or may not want to).

https://github.com/scottmarlow/transformer-1/tree/issues_119_ignoreduplicate is the potential workaround.

@bjhargrave
Copy link
Member

While the Zip File specification may allow duplicate entries, I don't think there is any support for this in the Java JAR File specification. Individual sections of the manifest do not permit duplicate entries.

It is too late in the Jakarta EE 9 Platform TCK schedule to make those change

Perhaps, but it seems fairly easy to delete the duplicate entry from the archive.

Also, I would have assumed the Jakarta EE 9 TCK would be using the jakarta.* package names already, no? So why would this archive need transformation?

https://github.com/scottmarlow/transformer-1/tree/issues_119_ignoreduplicate is the potential workaround.

This just silently ignores later duplicate entries. If the solution is to ignore later duplicate entries, wouldn't it be better to elide the duplicate entries up front before transformation to avoid other potential failures due to errors in transforming the later duplicate entries that will end up being ignored?

I would think we should have an option to enable this ignoring so that users must opt-in to this behavior. That is, the default behavior should be the current behavior where duplicate entries are a failure.

@scottmarlow
Copy link
Contributor Author

While the Zip File specification may allow duplicate entries, I don't think there is any support for this in the Java JAR File specification. Individual sections of the manifest do not permit duplicate entries.

It is too late in the Jakarta EE 9 Platform TCK schedule to make those change

Perhaps, but it seems fairly easy to delete the duplicate entry from the archive.

Yes, that is definitely one of the workarounds/solutions.

Also, I would have assumed the Jakarta EE 9 TCK would be using the jakarta.* package names already, no? So why would this archive need transformation?

I agree that it is definitely a no-op operation to transform an (already) EE 9 application to EE 9.

https://github.com/scottmarlow/transformer-1/tree/issues_119_ignoreduplicate is the potential workaround.

This just silently ignores later duplicate entries. If the solution is to ignore later duplicate entries, wouldn't it be better to elide the duplicate entries up front before transformation to avoid other potential failures due to errors in transforming the later duplicate entries that will end up being ignored?

Yes, I agree.

Perhaps they could first run their application through a filter that eliminates duplicate classes.

I think that it is good that we are discussing this case here, as many users will hit this and they will wonder if they are hitting a bug or if there is really something wrong with their application.

The currently error message is java.util.zip.ZipException: duplicate entry: com/sun/ts/tests/servlet/api/jakarta_servlet/singlethreadmodel/STMClientServlet$ThreadClient$TestThread.class which probably could be improved to include the archive name and possibly the top level archive name as well.

I would think we should have an option to enable this ignoring so that users must opt-in to this behavior. That is, the default behavior should be the current behavior where duplicate entries are a failure.

👍

@scottmarlow
Copy link
Contributor Author

scottmarlow commented Nov 3, 2020

Updated to fail by default unless system property "FAIL_DUPLICATE_ARCHIVE_ENTRIES" is true.

@tbitonti tbitonti changed the title Ignore duplicate classes in the input stream Ignore duplicate input archive entries Aug 17, 2021
@tbitonti tbitonti added the enhancement New feature or request label Aug 17, 2021
@tbitonti
Copy link
Contributor

Duplicated by #156. I'm cancelling that, and have modified this issue to include any duplication within an archive. The problem is not specific to class type entries.

My tendency is to add an option that controls the behavior. For example:
-duplicatePolicy ignore warning error
-duplicateSelection first last
My preference is either warning or error, plus last. warning plus last seems to match the most typical behavior of other tools.
(I don't like those particular option names.)

I agree that the best fix is to correct the original archive to remove the duplicate entry. I understand, however, that this is not always easy to make happen.

My understanding of how the error occurs has to do with how various archive builders function. That is, they stream out archive entries while building a list of entries to write to the archive central header (the table at the end of the archive). Many tools don't check for duplicates.

Within the archive format, the entries themselves can tolerate duplications. Each entry is a more-or-less self contained region of the archive which has sufficient data to be extracted. The central header can also tolerate duplications.

However, in either case, there are problems: Streaming through the entries, which is usually stateless except for the current entry cursor, causes the first duplicate to be overwritten by a second duplicate, while often leading to an unexpected file overwrite warning. Using a random access API (for example, using winzip) to view the archive will read the central header, and will usually build a table that takes only the last entry. (For example, using a sorted dictionary data structure.)

@bjhargrave
Copy link
Member

Java's ZipOutputStream implementation does not tolerate duplicate entries and it seems to me that a zip with duplicate entries is not well formed. What does duplicate entries mean? A simple unzip would have the last duplicate over writing the earlier duplicates. But in transforming a ZipInputStream into a ZipOutputStream, we would have to let the first duplicate entry take precedence as we could not reverse the stream to replace an earlier duplicate entry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants