Skip to content

NIFI-12708 UnpackContent should allow the user to specify a character set to apply in reading paths and filenames#8350

Closed
umarhussain15 wants to merge 5 commits intoapache:mainfrom
umarhussain15:NIFI-12708-zip-encoding-option-in-unpackcontent
Closed

NIFI-12708 UnpackContent should allow the user to specify a character set to apply in reading paths and filenames#8350
umarhussain15 wants to merge 5 commits intoapache:mainfrom
umarhussain15:NIFI-12708-zip-encoding-option-in-unpackcontent

Conversation

@umarhussain15
Copy link
Contributor

Summary

NIFI-12708: add option in UnpackContent to specify encoding charset for filenames in zip unpacking

The processor can now take a filename encoding parameter and pass it to zip unpacking. This will allow
user to unzip files with specific encoding to get correct filenames in output.

This for example help with zip files created on Windows which by default uses Cp437 for filename encoding.
If the filename contains special character like German alphabet ä, ü etc., decoding this with Linux's
default encoding (usually UTF8) will contain ? in output. When the same file is processed with property
set with Cp437, the processor outputs correct filenames with special characters preserved.

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Pull Request Tracking

  • Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
  • Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000

Pull Request Formatting

  • Pull Request based on current revision of the main branch
  • Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

  • Build completed using mvn clean install -P contrib-check
    • JDK 21

Licensing

  • New dependencies are compatible with the Apache License 2.0 according to the License Policy
  • New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

  • Documentation formatting appears as expected in rendered files

…or filenames in zip unpacking

The processor can now take a filename encoding parameter and pass it to zip unpacking. This will allow
user to unzip files with specific encoding to get correct filenames in output.

This for example help with zip files created on Windows which by default uses Cp437 for filename encoding.
If the filename contains special character like German alphabet ä, ü etc., decoding this with Linux's
default encoding usually UTF8 output will contain `?` in it. When the same file is processed with property
set with `Cp437`, the processor outputs correct filenames with special characters preserved.

Signed-off-by: Umar Hussain <umarhussain.work@gmail.com>
Signed-off-by: Umar Hussain <umarhussain.work@gmail.com>
@joewitt
Copy link
Contributor

joewitt commented Feb 2, 2024

@umarhussain15 Great start here. Left you a series of comments plus Dan just left one to consider as well. Thanks!

Signed-off-by: Umar Hussain <umarhussain.work@gmail.com>
The unknown special character check is switched now to input unicode special character in zip creation and then check for it in output of processor.

Signed-off-by: Umar Hussain <umarhussain.work@gmail.com>
Signed-off-by: Umar Hussain <umarhussain.work@gmail.com>
@asfgit asfgit closed this in e00d2b6 Feb 7, 2024
@joewitt
Copy link
Contributor

joewitt commented Feb 7, 2024

Thanks @umarhussain15 - good work and good discussion

@umarhussain15 umarhussain15 deleted the NIFI-12708-zip-encoding-option-in-unpackcontent branch February 9, 2024 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants