[Sitemaps] support some extensions to the sitemaps protocol #162

tuxnco · 2017-04-25T16:21:56Z

Add support for the alternate links extension (#149), image extension (#36), video extension (#35) as well as news extension.

add .idea to git ignored files
add data model for sitemap's video extension
add data model for sitemap's news extension
add data model for sitemap's link (xhtml) extension
add data model for sitemap's image extension
add videos, news, images and links attributes to SiteMapURL
add parsing for video extension
add UT for Video extension, implement equality methods
add parsing of image sitemaps, unit tests and sample sitemap test resource
plug support for XHTML Links extraction from sitemap
add parser for news sitemaps, with some UT
add missing license, cleanup
add missing licenses, updated javadoc
implement hashCode where required, improve equals, cleanup
VideoAttributes.VideoPrice: fix equals and implement hashCode
add validation of media sitemaps
add dependency on org.apache.commons:commons-lang3@3.5

* add .idea to git ignored files * add data model for sitemap's video extension * add data model for sitemap's news extension * add data model for sitemap's link (xhtml) extension * add data model for sitemap's image extension * add videos, news, images and links attributes to SiteMapURL * add parsing for video extension * add UT for Video extension, implement equality methods * add parsing of image sitemaps, unit tests and sample sitemap test resource * plug support for XHTML Links extraction from sitemap * add parser for news sitemaps, with some UT * add missing license, cleanup * add missing licenses, updated javadoc * implement hashCode where required, improve equals, cleanup * VideoAttributes.VideoPrice: fix equals and implement hashCode * add validation of media sitemaps * add dependency on org.apache.commons:commons-lang3@3.5

lewismc · 2017-04-25T18:22:44Z

Hi @tuxnco the build is currently failing due to the following issues

[ERROR] Forbidden method invocation: java.lang.String#toLowerCase() [Uses default locale]

[ERROR]   in crawlercommons.sitemaps.SiteMapExtensionParser (SiteMapExtensionParser.java:249)

[ERROR] Forbidden method invocation: java.lang.String#toLowerCase() [Uses default locale]

[ERROR]   in crawlercommons.sitemaps.SiteMapExtensionParser (SiteMapExtensionParser.java:265)

[ERROR] Forbidden method invocation: java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default locale]

[ERROR]   in crawlercommons.sitemaps.SiteMapParser (SiteMapParser.java:817)

[ERROR] Forbidden method invocation: java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default locale]

[ERROR]   in crawlercommons.sitemaps.SiteMapParser (SiteMapParser.java:823)

[ERROR] Forbidden method invocation: java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default locale]

[ERROR]   in crawlercommons.sitemaps.SiteMapParser (SiteMapParser.java:831)

[ERROR] Forbidden method invocation: java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default locale]

[ERROR]   in crawlercommons.sitemaps.SiteMapParser (SiteMapParser.java:846)

[ERROR] Forbidden method invocation: java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default locale]

[ERROR]   in crawlercommons.sitemaps.SiteMapParser (SiteMapParser.java:854)

[ERROR] Forbidden method invocation: java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default locale]

[ERROR]   in crawlercommons.sitemaps.SiteMapParser (SiteMapParser.java:862)

[ERROR] Forbidden method invocation: java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default locale]

[ERROR]   in crawlercommons.sitemaps.VideoAttributes (VideoAttributes.java:444)

[ERROR] Forbidden method invocation: java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default locale]

[ERROR]   in crawlercommons.sitemaps.VideoAttributes$VideoPrice (VideoAttributes.java:391)

[ERROR] Scanned 43 (and 321 related) class file(s) for forbidden API invocations (in 0.20s), 10 error(s).

Can you please push a small commit to fix these issues? They are triggered by the forbidden API checker plugin.

tuxnco · 2017-04-25T20:02:55Z

Hi @lewismc, thank you very much for your quick look at this!
I fixed the issues spotted by the forbidden API checker plugin, let me know if you want to discuss this PR more in depth, comments are welcome!

For example, in order to provide a portable way to extract extended attributes from sitemaps (e.g. from <video_namespace:video> or <news_namespace:news> nodes) I had to turn on XML namespace awareness at the DOM DocumentBuilderFactory level, see there : https://github.com/cogniteev/crawler-commons/blob/e21228bd8d4da7efe7ed7d80442d55693f87e84a/src/main/java/crawlercommons/sitemaps/SiteMapParser.java#L360 )

lewismc · 2017-04-25T20:03:34Z

Yes I'll certainly review, this is a nice patch.

sebastian-nagel · 2017-04-25T20:47:57Z

Hi, thanks looks promising...

I've started to test the PR on a set of sitemaps (see here). I'll continue testing later but got stuck at this NPE:

% java ... crawlercommons.sitemaps.SiteMapTester https://shinpaideshou.wordpress.com/news-sitemap.xml
17/04/25 22:19:20 INFO sitemaps.SiteMapTester:64 - Parsing https://shinpaideshou.wordpress.com/news-sitemap.xml  using DOM parser
Exception in thread "main" java.lang.NullPointerException
        at crawlercommons.sitemaps.SiteMapExtensionParser.parseNewsNode(SiteMapExtensionParser.java:134)
        at crawlercommons.sitemaps.SiteMapExtensionParser.parseNews(SiteMapExtensionParser.java:91)
        at crawlercommons.sitemaps.SiteMapParser.parseXmlSitemap(SiteMapParser.java:445)

with Optional SAX Parser for Sitemaps #150 we have an alternative SAX-based parser which was aimed as a replacement for the DOM parser. Porting this PR to the SAX parser looks a lot of work, but maybe we now have to accept two competing implementations: a feature-rich DOM and a robust and fast SAX parser.
the {Image,Video,News,Link}Attributes classes share fields. Just as a suggestion: would it help to organize the classes by inheritance? If they implement an Attribute interface we could also hold them in a single attribute list.

tuxnco · 2017-04-25T21:33:08Z

@sebastian-nagel Thanks a lot for your feedback. Indeed the lack of "keywords" is actually well handled, only the stock_tickers attribute was impacted by the NPE issue here.

Regarding your previous comment, I agree with the second point. I clearly think there exists two different use cases:

find as much links as possible as fast as possible under possibly difficult circumstances, which is what the SAX-based parser aims to do (failsafe)
feature-rich DOM-based parser, aimed at providing compliance checking tools and extract as much metadata as possible, aware of the underlying objects it is handling...

These last words drive me the third and last point. I reckon the {Image,Video,News,Link}Attributes classes share fields. But an <image:title> does not bear the exact same semantic of a <video:title> (the later is required while the first is not, as per the "specs" if we can say).
Which is why I introduced no inheritance in the first place.
One possible refactoring could be as follow : Having an ExtensionParser abstract class, responsible for handling a given extension, that would add ExtensionMetada (TBD) descendants to the originating SiteMapURL instance. It could for example have the helpers used to extract DOM nodes and attributes.
That could be extended by others willing to provide implementation for other protocol extension.
However, I can not tell much about what the contract of ExtensionMetadata should look like, as this seems pretty much tied to what the extension provides. May be at least equals, hashCode and isValid ?

sebastian-nagel · 2017-04-25T22:17:01Z

@tuxnco never mind about the "keywords" (I didn't look at the code), but looks like there are definitely some more attributes to check:

17/04/26 00:05:07 INFO sitemaps.SiteMapTester:64 - Parsing http://www.hebdenbridgetimes.co.uk/sitemap-article-2015-18.xml  using DOM parser
Exception in thread "main" java.lang.NullPointerException
        at crawlercommons.sitemaps.SiteMapExtensionParser.parseVideoNode(SiteMapExtensionParser.java:206)
        at crawlercommons.sitemaps.SiteMapExtensionParser.parseVideos(SiteMapExtensionParser.java:81)
        at crawlercommons.sitemaps.SiteMapParser.parseXmlSitemap(SiteMapParser.java:443)

I'll answer tomorrow about the refactoring...

tuxnco · 2017-04-25T22:46:57Z

Thanks again Sebastian, let's continue the discussion tomorrow.
BTW, your commonscrawl-seed sitemaps dataset looks like a great stress test, any chance for me to access it ?
I looked around there : https://commoncrawl.s3.amazonaws.com/ but I think that is not the same bucket.

sebastian-nagel · 2017-04-26T12:55:30Z

It's only used internally, not a officially released data set. Sitemap URLs can be easily mined from the robots.txt data set e.g. by running https://github.com/commoncrawl/cc-mrjob/blob/master/sitemaps_from_robotstxt.py. But I'll send you the location of the test data.

I like your draft with ExtensionMetadata and ExtensionParser.

sebastian-nagel · 2017-05-04T16:34:46Z

src/main/java/crawlercommons/sitemaps/SiteMapExtensionParser.java

+        NamedNodeMap attributes = elem.getAttributes();
+        URI href = null;
+        Map<String, String> params = new HashMap<>(attributes.getLength()-1);
+        if (attributes != null) {


Swap with preceding line? attributes.getLength() may already cause a NPE.

sebastian-nagel · 2017-05-04T16:46:01Z

Hi @tuxnco, great! I can confirm that the latest version (5281a6e) is robust and does not fail with uncaught exceptions on any of the sitemaps contained in sitemap-test-2017-03-03.warc.gz and sitemap-test-2017-03-04.warc.gz.

Anyone to continue the review? Shall we improve the way attributes are added and accessed by inheriting from ExtensionMetadata and ExtensionParser as proposed?

jnioche · 2017-05-05T13:42:34Z

Thanks for this contrib @tuxnco and making concrete progress on a long-debated functionality.

Isn't there a way we could have a generic way of storing the additional data, e.g. a Map<String, Object>? I am not super comfortable with having extension-specific code in the core classes e.g. SiteMapURL. We could have extension specific code to simplify the access to the info stored in the Map.

Or is it what you were suggesting with ExtensionMetadata and ExtensionParser?

MichealKum · 2017-06-21T06:57:03Z

src/main/java/crawlercommons/sitemaps/ImageAttributes.java

+            return false;
+        }
+        ImageAttributes that = (ImageAttributes) other;
+        if (!Objects.equals(loc, that.loc)) {


Will it be odd if write all equals as a single return:
return Objects.equals(this.loc, that.loc)
&& Objects.equals(this.caption, that.caption)
...

MichealKum · 2017-06-21T06:58:25Z

src/main/java/crawlercommons/sitemaps/ImageAttributes.java

+
+    @Override
+    public int hashCode() {
+        int result = 37;


Use Objects.hash(Object...) method as well.

MichealKum · 2017-06-21T07:00:16Z

src/main/java/crawlercommons/sitemaps/ImageAttributes.java

+        result = 31 * result + (license == null ? 0 : license.hashCode());
+        return result;
+    }
+}


No new line at the end of the file. Does it covered by code formatter? (the same for .gitignore file)

MichealKum · 2017-06-21T07:05:16Z

src/main/java/crawlercommons/sitemaps/LinkAttributes.java

+    }
+
+    public Map<String, String> getParams() {
+        return params;


Usually a map is a mutable object. It means that the field could be changed outside of the instance, e.g. linkAttr.getParams().clean(). How about implement put and remove methods and return unmodifiebleMap?
But maybe it is to much for such a simple class.

MichealKum · 2017-06-21T07:06:05Z

src/main/java/crawlercommons/sitemaps/LinkAttributes.java

+    }
+
+    @Override
+    public boolean equals(Object other) {


See my comments for ImageAttirbutes regarding equals and hash implementation.

MichealKum · 2017-06-21T07:09:35Z

src/main/java/crawlercommons/sitemaps/SiteMapExtensionParser.java

+ * Moreover, only Google' video, images, links and news extensions are supported.
+ */
+public class SiteMapExtensionParser {
+    private final static Logger LOGGER = LoggerFactory.getLogger("sitemaps.parser.extension");


Use Class instance of SiteMapExtensionParser to create the logger. And rename it to LOG. See existing code for examples.

- update public suffix list to recent version of https://publicsuffix.org/list/public_suffix_list.dat - add method flag to force a check whether the domain has a valid effective TLD listed in the public suffix list - fix mixed case hostnames (wWW.eXample.com)

jnioche · 2017-12-07T21:10:43Z

Closing as progress on this has stalled and DOM parser has been removed.

@tuxnco

- optionally parse elements in the namespace of sitemap extensions: - Google video sitemaps (resolves crawler-commons#35) - Google image sitemaps (resolves crawler-commons#36) - Google news sitemaps - alternate links in sitemaps (resolves crawler-commons#149) - the code is taken from Tanguy Moal's (@tuxnco) PR crawler-commons#162 with the following modifications: - port from DOM to SAX parser - keep specific extensions separate from the "core" sitemap classes

tuxnco added 3 commits April 25, 2017 21:35

Specify locale when invoking String.lowercase or String.format

3a8eeff

Specify locale when getting Calendar default instance in tests

cd128ae

Specify timezone when getting calendar default instance

e21228b

Fix NPE around stock tickers spotted by @sebastian-nagel

36eef6e

Add checks around more attributes

10076db

tuxnco added 2 commits April 30, 2017 17:38

href property of a Link attribute is an URI not an URL

9d743a7

Change visibility of VideoPrice from package-private to public

5281a6e

tuxnco force-pushed the master branch from ea14a0c to 5281a6e Compare May 1, 2017 11:21

sebastian-nagel reviewed May 4, 2017

View reviewed changes

Fix HashMap init with initial capacity that could have been negative

9acf3cf

MichealKum reviewed Aug 21, 2017

View reviewed changes

sebastian-nagel mentioned this pull request Oct 10, 2017

Remove DOM-based sitemap parser? #177

Closed

jnioche closed this Dec 7, 2017

sebastian-nagel mentioned this pull request Mar 21, 2018

Full support for sitemap extensions and namespaces commoncrawl/news-crawl#25

Closed

sebastian-nagel mentioned this pull request Sep 28, 2018

Support sitemap extensions #218

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Sitemaps] support some extensions to the sitemaps protocol #162

[Sitemaps] support some extensions to the sitemaps protocol #162

tuxnco commented Apr 25, 2017

lewismc commented Apr 25, 2017

tuxnco commented Apr 25, 2017

lewismc commented Apr 25, 2017

sebastian-nagel commented Apr 25, 2017

tuxnco commented Apr 25, 2017

sebastian-nagel commented Apr 25, 2017

tuxnco commented Apr 25, 2017

sebastian-nagel commented Apr 26, 2017

sebastian-nagel May 4, 2017

sebastian-nagel commented May 4, 2017

jnioche commented May 5, 2017

MichealKum Jun 21, 2017

MichealKum Jun 21, 2017

MichealKum Jun 21, 2017

MichealKum Jun 21, 2017

MichealKum Jun 21, 2017

MichealKum Jun 21, 2017

jnioche commented Dec 7, 2017

[Sitemaps] support some extensions to the sitemaps protocol #162

[Sitemaps] support some extensions to the sitemaps protocol #162

Conversation

tuxnco commented Apr 25, 2017

lewismc commented Apr 25, 2017

tuxnco commented Apr 25, 2017

lewismc commented Apr 25, 2017

sebastian-nagel commented Apr 25, 2017

tuxnco commented Apr 25, 2017

sebastian-nagel commented Apr 25, 2017

tuxnco commented Apr 25, 2017

sebastian-nagel commented Apr 26, 2017

sebastian-nagel May 4, 2017

Choose a reason for hiding this comment

sebastian-nagel commented May 4, 2017

jnioche commented May 5, 2017

MichealKum Jun 21, 2017

Choose a reason for hiding this comment

MichealKum Jun 21, 2017

Choose a reason for hiding this comment

MichealKum Jun 21, 2017

Choose a reason for hiding this comment

MichealKum Jun 21, 2017

Choose a reason for hiding this comment

MichealKum Jun 21, 2017

Choose a reason for hiding this comment

MichealKum Jun 21, 2017

Choose a reason for hiding this comment

jnioche commented Dec 7, 2017