Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve MIME detection for sitemaps #200

Merged

Conversation

sebastian-nagel
Copy link
Contributor

  • avoid NPE if no MIME type has been detected
  • allow optional leading white space before MIME patterns (after optional BOM)

- avoid NPE if no MIME type has been detected
- allow optional leading white space before MIME patterns
  (after optional BOM)
@kkrugler
Copy link
Contributor

Hi @sebastian-nagel - thanks for the PR. I took an initial look at the changes. The logic for handling leading whitespace seemed a bit overly complex at first glance, but I'd like to review more (pull the branch and look at it in Eclipse) before commenting.

@sebastian-nagel
Copy link
Contributor Author

  • improved the log messages for gzip-embedded sitemaps. The message
    Can't parse a sitemap with MediaType 'application/gzip' from '...'
    now becomes
    Failed to detect embedded MediaType of gzipped sitemap '...'
  • add support for application/rdf+xml (wrapping RSS feed)

private static class MimeTypeEntry {
private String mimeType;
private byte[] pattern;
private boolean allowBOM;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just a "isText" flag, which implies both BOM and leading whitespace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

for (MimeTypeEntry entry : mimeTypes) {
if (patternMatches(entry.getPattern(), content, offset, length)) {
return entry.getMimeType();
}
if (entry.allowBOM) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BOM has to be at the beginning. I also think we should simplify and assume offset == 0 always (nobody calls detect with non-zero offset). So just have an isBOM check (which advances the offset by BOM length), and then an isWhitespace check (same thing) and then the previous logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BOM has to be at the beginning.

Of course, but in my test set there are sitemaps with two BOMs. We would need also to change the BOMInputStream (also to skip white space), but detection comes first.

assume offset == 0 always (nobody calls detect with non-zero offset)

Agreed.


mimeTypes.add(new MimeTypeEntry(GZIP_MIMETYPES[0], "\037\213"));
mimeTypes.add(new MimeTypeEntry(GZIP_MIMETYPES[0], 0x1F, 0x8B));

maxPatternLength = 0;
for (MimeTypeEntry entry : mimeTypes) {
maxPatternLength = Math.max(maxPatternLength, entry.getPattern().length);
int length = entry.getPattern().length;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per comment below, I think we can skip this by just treating BOM and leading whitespace checks as special, and adjust the offset for those, then fall into the regular pattern check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method detect(InputStream is) requires the max. pattern length to create the byte[] array. The length must be longer for "text" patterns. But with a isTextPattern flag, this could be simplified by just adding LEADING_WHITESPACE_MAX_SKIP.

Copy link
Contributor

@kkrugler kkrugler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Sebastian - I think the detect() logic could be simpler, added comments re that.

@sebastian-nagel
Copy link
Contributor Author

Thanks, @kkrugler. I'll update the PR tomorrow.

- handle BOM and leading white space together
- remove parameter to detect patterns at a specific offset
int offsetBOM = -1;
int offsetSpace = -1;
public String detect(byte[] content, int length) {
int offsetText = -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this is initialized here, and changed in the per-mimetype loop below (more comments there).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I get it; you only want to do the advance once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I want to do it only once for all patterns and only if another non-text pattern didn't match before.

Copy link
Contributor

@kkrugler kkrugler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants