-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve MIME detection for sitemaps #200
Improve MIME detection for sitemaps #200
Conversation
sebastian-nagel
commented
Apr 10, 2018
- avoid NPE if no MIME type has been detected
- allow optional leading white space before MIME patterns (after optional BOM)
- avoid NPE if no MIME type has been detected - allow optional leading white space before MIME patterns (after optional BOM)
Hi @sebastian-nagel - thanks for the PR. I took an initial look at the changes. The logic for handling leading whitespace seemed a bit overly complex at first glance, but I'd like to review more (pull the branch and look at it in Eclipse) before commenting. |
|
private static class MimeTypeEntry { | ||
private String mimeType; | ||
private byte[] pattern; | ||
private boolean allowBOM; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think just a "isText" flag, which implies both BOM and leading whitespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
for (MimeTypeEntry entry : mimeTypes) { | ||
if (patternMatches(entry.getPattern(), content, offset, length)) { | ||
return entry.getMimeType(); | ||
} | ||
if (entry.allowBOM) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BOM has to be at the beginning. I also think we should simplify and assume offset == 0 always (nobody calls detect with non-zero offset). So just have an isBOM check (which advances the offset by BOM length), and then an isWhitespace check (same thing) and then the previous logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BOM has to be at the beginning.
Of course, but in my test set there are sitemaps with two BOMs. We would need also to change the BOMInputStream (also to skip white space), but detection comes first.
assume offset == 0 always (nobody calls detect with non-zero offset)
Agreed.
|
||
mimeTypes.add(new MimeTypeEntry(GZIP_MIMETYPES[0], "\037\213")); | ||
mimeTypes.add(new MimeTypeEntry(GZIP_MIMETYPES[0], 0x1F, 0x8B)); | ||
|
||
maxPatternLength = 0; | ||
for (MimeTypeEntry entry : mimeTypes) { | ||
maxPatternLength = Math.max(maxPatternLength, entry.getPattern().length); | ||
int length = entry.getPattern().length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per comment below, I think we can skip this by just treating BOM and leading whitespace checks as special, and adjust the offset for those, then fall into the regular pattern check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method detect(InputStream is)
requires the max. pattern length to create the byte[] array. The length must be longer for "text" patterns. But with a isTextPattern flag, this could be simplified by just adding LEADING_WHITESPACE_MAX_SKIP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Sebastian - I think the detect() logic could be simpler, added comments re that.
Thanks, @kkrugler. I'll update the PR tomorrow. |
- handle BOM and leading white space together - remove parameter to detect patterns at a specific offset
int offsetBOM = -1; | ||
int offsetSpace = -1; | ||
public String detect(byte[] content, int length) { | ||
int offsetText = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this is initialized here, and changed in the per-mimetype loop below (more comments there).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I get it; you only want to do the advance once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I want to do it only once for all patterns and only if another non-text pattern didn't match before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm