Improve MIME detection for sitemaps #200

sebastian-nagel · 2018-04-10T20:02:36Z

avoid NPE if no MIME type has been detected
allow optional leading white space before MIME patterns (after optional BOM)

- avoid NPE if no MIME type has been detected - allow optional leading white space before MIME patterns (after optional BOM)

kkrugler · 2018-04-13T04:10:01Z

Hi @sebastian-nagel - thanks for the PR. I took an initial look at the changes. The logic for handling leading whitespace seemed a bit overly complex at first glance, but I'd like to review more (pull the branch and look at it in Eclipse) before commenting.

sebastian-nagel · 2018-04-18T12:51:41Z

improved the log messages for gzip-embedded sitemaps. The message
Can't parse a sitemap with MediaType 'application/gzip' from '...'
now becomes
Failed to detect embedded MediaType of gzipped sitemap '...'
add support for application/rdf+xml (wrapping RSS feed)

kkrugler · 2018-04-23T15:45:33Z

src/main/java/crawlercommons/mimetypes/MimeTypeDetector.java

    private static class MimeTypeEntry {
        private String mimeType;
        private byte[] pattern;
+        private boolean allowBOM;


I think just a "isText" flag, which implies both BOM and leading whitespace.

kkrugler · 2018-04-23T15:48:40Z

src/main/java/crawlercommons/mimetypes/MimeTypeDetector.java

        for (MimeTypeEntry entry : mimeTypes) {
            if (patternMatches(entry.getPattern(), content, offset, length)) {
                return entry.getMimeType();
            }
+            if (entry.allowBOM) {


BOM has to be at the beginning. I also think we should simplify and assume offset == 0 always (nobody calls detect with non-zero offset). So just have an isBOM check (which advances the offset by BOM length), and then an isWhitespace check (same thing) and then the previous logic.

BOM has to be at the beginning.

Of course, but in my test set there are sitemaps with two BOMs. We would need also to change the BOMInputStream (also to skip white space), but detection comes first.

assume offset == 0 always (nobody calls detect with non-zero offset)

Agreed.

kkrugler · 2018-04-23T15:49:31Z

src/main/java/crawlercommons/mimetypes/MimeTypeDetector.java


        mimeTypes.add(new MimeTypeEntry(GZIP_MIMETYPES[0], "\037\213"));
        mimeTypes.add(new MimeTypeEntry(GZIP_MIMETYPES[0], 0x1F, 0x8B));

        maxPatternLength = 0;
        for (MimeTypeEntry entry : mimeTypes) {
-            maxPatternLength = Math.max(maxPatternLength, entry.getPattern().length);
+            int length = entry.getPattern().length;


As per comment below, I think we can skip this by just treating BOM and leading whitespace checks as special, and adjust the offset for those, then fall into the regular pattern check.

The method detect(InputStream is) requires the max. pattern length to create the byte[] array. The length must be longer for "text" patterns. But with a isTextPattern flag, this could be simplified by just adding LEADING_WHITESPACE_MAX_SKIP.

kkrugler

Hi Sebastian - I think the detect() logic could be simpler, added comments re that.

sebastian-nagel · 2018-04-23T19:40:08Z

Thanks, @kkrugler. I'll update the PR tomorrow.

- handle BOM and leading white space together - remove parameter to detect patterns at a specific offset

kkrugler · 2018-04-24T15:53:32Z

src/main/java/crawlercommons/mimetypes/MimeTypeDetector.java

-        int offsetBOM = -1;
-        int offsetSpace = -1;
+    public String detect(byte[] content, int length) {
+        int offsetText = -1;


I don't understand why this is initialized here, and changed in the per-mimetype loop below (more comments there).

OK, I get it; you only want to do the advance once.

Yes, I want to do it only once for all patterns and only if another non-text pattern didn't match before.

kkrugler

lgtm

Fix MIME detection for sitemaps:

34c19d8

- avoid NPE if no MIME type has been detected - allow optional leading white space before MIME patterns (after optional BOM)

sebastian-nagel mentioned this pull request Apr 10, 2018

Remove Tika dependency #198

Merged

sebastian-nagel requested a review from kkrugler April 10, 2018 20:05

sebastian-nagel mentioned this pull request Apr 11, 2018

Improve sitemap parsing #205

Merged

sebastian-nagel added 2 commits April 18, 2018 14:43

Improve logging of content type detection for gzip-compressed sitemaps

4780678

RDF-based RSS feeds: map MIME type, detect from content

72aa177

Fix error message format string

6714ea5

kkrugler reviewed Apr 23, 2018

View reviewed changes

Simplify MIME detection:

a6b3178

- handle BOM and leading white space together - remove parameter to detect patterns at a specific offset

kkrugler reviewed Apr 24, 2018

View reviewed changes

kkrugler approved these changes Apr 24, 2018

View reviewed changes

sebastian-nagel merged commit a9277ac into crawler-commons:master Apr 25, 2018

sebastian-nagel deleted the cc-198-fix-regressions branch April 25, 2018 07:21

jnioche added this to the 0.10 milestone Apr 25, 2018

jnioche added enhancement sitemaps labels Apr 25, 2018

sebastian-nagel mentioned this pull request Mar 14, 2019

Detection and parsing of XML sitemaps fails with whitespace before XML declaration #144

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve MIME detection for sitemaps #200

Improve MIME detection for sitemaps #200

sebastian-nagel commented Apr 10, 2018

kkrugler commented Apr 13, 2018

sebastian-nagel commented Apr 18, 2018

kkrugler Apr 23, 2018

sebastian-nagel Apr 23, 2018

kkrugler Apr 23, 2018

sebastian-nagel Apr 23, 2018

kkrugler Apr 23, 2018

sebastian-nagel Apr 23, 2018

kkrugler left a comment

sebastian-nagel commented Apr 23, 2018

kkrugler Apr 24, 2018

kkrugler Apr 24, 2018

sebastian-nagel Apr 25, 2018

kkrugler left a comment

Improve MIME detection for sitemaps #200

Improve MIME detection for sitemaps #200

Conversation

sebastian-nagel commented Apr 10, 2018

kkrugler commented Apr 13, 2018

sebastian-nagel commented Apr 18, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kkrugler left a comment

Choose a reason for hiding this comment

sebastian-nagel commented Apr 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kkrugler left a comment

Choose a reason for hiding this comment