Ignore duplicate input archive entries #119

scottmarlow · 2020-11-02T17:28:00Z

https://issues.redhat.com/browse/WFLY-14014 is for a Jakarta EE 9 Platform TCK test archive that contains duplicate classes. I'm thinking that we could catch/log/ignore duplicate classes such as java.util.zip.ZipException: duplicate entry: com/sun/ts/tests/servlet/api/jakarta_servlet/singlethreadmodel/STMClientServlet$ThreadClient$TestThread.class

The text was updated successfully, but these errors were encountered:

scottmarlow · 2020-11-02T18:58:44Z

Or we could track already put class names in a HashSet, will create a pr for that.

bjhargrave · 2020-11-02T19:02:43Z

This seems like garbage-in, garbage-out. That is, the input jar file is malformed. It should not have duplicate entries. It seems the best solution is to fix the input jar so that it is not malformed.

I don't think it is valid for the transformer to ignore duplicate entries in the input. Which of the duplicate entries is the correct entry? First, last, middle?

The proper solution is to fix the malformed Jakarta EE 9 Platform TCK test archive.

scottmarlow · 2020-11-02T19:45:43Z

This seems like garbage-in, garbage-out. That is, the input jar file is malformed. It should not have duplicate entries. It seems the best solution is to fix the input jar so that it is not malformed.

I don't think it is valid for the transformer to ignore duplicate entries in the input. Which of the duplicate entries is the correct entry? First, last, middle?

I agree in that it hides the case when the entries are not duplicates, however, this is an under-specified area in terms of Jakarta EE requirements for a very long time. It also does seem like a mismatch between what jars allow and what java.util.zip.ZipOutputStream allows which may be different than what the Jakarta EE implementation classloader does.

The proper solution is to fix the malformed Jakarta EE 9 Platform TCK test archive.

It is too late in the Jakarta EE 9 Platform TCK schedule to make those change, however that is only my current use case.

If someone is trying to transform a large Jakarta EE 8 application to EE 9, they may not have the ability to update their application archive (or may not want to).

https://github.com/scottmarlow/transformer-1/tree/issues_119_ignoreduplicate is the potential workaround.

bjhargrave · 2020-11-02T20:12:20Z

While the Zip File specification may allow duplicate entries, I don't think there is any support for this in the Java JAR File specification. Individual sections of the manifest do not permit duplicate entries.

It is too late in the Jakarta EE 9 Platform TCK schedule to make those change

Perhaps, but it seems fairly easy to delete the duplicate entry from the archive.

Also, I would have assumed the Jakarta EE 9 TCK would be using the jakarta.* package names already, no? So why would this archive need transformation?

https://github.com/scottmarlow/transformer-1/tree/issues_119_ignoreduplicate is the potential workaround.

This just silently ignores later duplicate entries. If the solution is to ignore later duplicate entries, wouldn't it be better to elide the duplicate entries up front before transformation to avoid other potential failures due to errors in transforming the later duplicate entries that will end up being ignored?

I would think we should have an option to enable this ignoring so that users must opt-in to this behavior. That is, the default behavior should be the current behavior where duplicate entries are a failure.

scottmarlow · 2020-11-02T20:37:39Z

While the Zip File specification may allow duplicate entries, I don't think there is any support for this in the Java JAR File specification. Individual sections of the manifest do not permit duplicate entries.

It is too late in the Jakarta EE 9 Platform TCK schedule to make those change

Perhaps, but it seems fairly easy to delete the duplicate entry from the archive.

Yes, that is definitely one of the workarounds/solutions.

Also, I would have assumed the Jakarta EE 9 TCK would be using the jakarta.* package names already, no? So why would this archive need transformation?

I agree that it is definitely a no-op operation to transform an (already) EE 9 application to EE 9.

https://github.com/scottmarlow/transformer-1/tree/issues_119_ignoreduplicate is the potential workaround.

This just silently ignores later duplicate entries. If the solution is to ignore later duplicate entries, wouldn't it be better to elide the duplicate entries up front before transformation to avoid other potential failures due to errors in transforming the later duplicate entries that will end up being ignored?

Yes, I agree.

Perhaps they could first run their application through a filter that eliminates duplicate classes.

I think that it is good that we are discussing this case here, as many users will hit this and they will wonder if they are hitting a bug or if there is really something wrong with their application.

The currently error message is java.util.zip.ZipException: duplicate entry: com/sun/ts/tests/servlet/api/jakarta_servlet/singlethreadmodel/STMClientServlet$ThreadClient$TestThread.class which probably could be improved to include the archive name and possibly the top level archive name as well.

I would think we should have an option to enable this ignoring so that users must opt-in to this behavior. That is, the default behavior should be the current behavior where duplicate entries are a failure.

👍

scottmarlow · 2020-11-03T18:53:53Z

Updated to fail by default unless system property "FAIL_DUPLICATE_ARCHIVE_ENTRIES" is true.

tbitonti · 2021-08-17T14:52:26Z

Duplicated by #156. I'm cancelling that, and have modified this issue to include any duplication within an archive. The problem is not specific to class type entries.

My tendency is to add an option that controls the behavior. For example:
-duplicatePolicy ignore warning error
-duplicateSelection first last
My preference is either warning or error, plus last. warning plus last seems to match the most typical behavior of other tools.
(I don't like those particular option names.)

I agree that the best fix is to correct the original archive to remove the duplicate entry. I understand, however, that this is not always easy to make happen.

My understanding of how the error occurs has to do with how various archive builders function. That is, they stream out archive entries while building a list of entries to write to the archive central header (the table at the end of the archive). Many tools don't check for duplicates.

Within the archive format, the entries themselves can tolerate duplications. Each entry is a more-or-less self contained region of the archive which has sufficient data to be extracted. The central header can also tolerate duplications.

However, in either case, there are problems: Streaming through the entries, which is usually stateless except for the current entry cursor, causes the first duplicate to be overwritten by a second duplicate, while often leading to an unexpected file overwrite warning. Using a random access API (for example, using winzip) to view the archive will read the central header, and will usually build a table that takes only the last entry. (For example, using a sorted dictionary data structure.)

bjhargrave · 2022-03-15T15:20:55Z

Java's ZipOutputStream implementation does not tolerate duplicate entries and it seems to me that a zip with duplicate entries is not well formed. What does duplicate entries mean? A simple unzip would have the last duplicate over writing the earlier duplicates. But in transforming a ZipInputStream into a ZipOutputStream, we would have to let the first duplicate entry take precedence as we could not reverse the stream to replace an earlier duplicate entry.

scottmarlow mentioned this issue Nov 3, 2020

issue[119] workaround for duplicate application classes in input archive #120

Closed

tbitonti changed the title ~~Ignore duplicate classes in the input stream~~ Ignore duplicate input archive entries Aug 17, 2021

tbitonti added the enhancement New feature or request label Aug 17, 2021

tbitonti self-assigned this Sep 8, 2021

tbitonti added the priority-tr label Sep 8, 2021

bjhargrave mentioned this issue Mar 15, 2022

Tolerate duplication on container actions #12

Closed

bjhargrave removed the priority-tr label Mar 22, 2022

This was referenced Apr 1, 2022

JakartaEE: Transform: Transformer Issues OpenLiberty/open-liberty#20686

Closed

JakartaEE: Transformer: JakartaEE10 Issues OpenLiberty/open-liberty#18908

Closed

bjhargrave closed this as completed in 1ff13ef Apr 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore duplicate input archive entries #119

Ignore duplicate input archive entries #119

scottmarlow commented Nov 2, 2020

scottmarlow commented Nov 2, 2020

bjhargrave commented Nov 2, 2020

scottmarlow commented Nov 2, 2020

bjhargrave commented Nov 2, 2020

scottmarlow commented Nov 2, 2020

scottmarlow commented Nov 3, 2020 •

edited

Loading

tbitonti commented Aug 17, 2021

bjhargrave commented Mar 15, 2022

Ignore duplicate input archive entries #119

Ignore duplicate input archive entries #119

Comments

scottmarlow commented Nov 2, 2020

scottmarlow commented Nov 2, 2020

bjhargrave commented Nov 2, 2020

scottmarlow commented Nov 2, 2020

bjhargrave commented Nov 2, 2020

scottmarlow commented Nov 2, 2020

scottmarlow commented Nov 3, 2020 • edited Loading

tbitonti commented Aug 17, 2021

bjhargrave commented Mar 15, 2022

scottmarlow commented Nov 3, 2020 •

edited

Loading