Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syft seems unable to parse non UTF-8 pom.xml files #2044

Closed
westonsteimel opened this issue Aug 21, 2023 · 1 comment · Fixed by #2047
Closed

Syft seems unable to parse non UTF-8 pom.xml files #2044

westonsteimel opened this issue Aug 21, 2023 · 1 comment · Fixed by #2047
Assignees
Labels
bug Something isn't working good-first-issue Good for newcomers

Comments

@westonsteimel
Copy link
Contributor

westonsteimel commented Aug 21, 2023

What happened:

Running syft against the jar from https://repo1.maven.org/maven2/com/alogient/cameleon/java/sdk/cameleon4java-sdk/1.12.2/cameleon4java-sdk-1.12.2.jar gives the following warning:

[0000]  WARN failed to parse pom.xml: unable to unmarshal pom.xml: XML syntax error on line 52: invalid UTF-8 contents-path=META-INF/maven/com.alogient.cameleon.java.sdk/cameleon4java-sdk/pom.xml location=/camel

In this case the specific field causing the issue is an author name:

<name>J�r�me Mirc</name>

Syft seems to be trying to decode it using UTF-8, however, file seems to indicate it is ISO-8859

file META-INF/maven/com.alogient.cameleon.java.sdk/cameleon4java-sdk/pom.xml
META-INF/maven/com.alogient.cameleon.java.sdk/cameleon4java-sdk/pom.xml: ISO-8859 text, with CRLF line terminators

What you expected to happen:

Syft should be able to decode these documents and at least extract the groupid/artifactid. There are a large number of maven artifacts that end up with incorrect identifiers because syft cannot extract the information from the pom files.

Steps to reproduce the issue:

syft cameleon4java-sdk-1.12.2.jar

Anything else we need to know?:

Environment:

  • Output of syft version:
Application:        syft
Version:            0.87.1
JsonSchemaVersion:  10.0.0
BuildDate:          2023-08-17T18:57:49Z
GitCommit:          4762ba0943785fe778276893388e839e01787b45
GitDescription:     v0.87.1
Platform:           darwin/arm64
GoVersion:          go1.20.7
Compiler:           gc
  • OS (e.g: cat /etc/os-release or similar):
@westonsteimel westonsteimel added the bug Something isn't working label Aug 21, 2023
@willmurphyscode
Copy link
Contributor

Looks like a change to how character sets are handled is probably needed here:

func decodePomXML(content io.Reader) (project gopom.Project, err error) {
decoder := xml.NewDecoder(content)
// prevent against warnings for "xml: encoding "iso-8859-1" declared but Decoder.CharsetReader is nil"
decoder.CharsetReader = charset.NewReaderLabel
if err := decoder.Decode(&project); err != nil {
return project, fmt.Errorf("unable to unmarshal pom.xml: %w", err)
}
return project, nil
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good-first-issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants