Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Sitemaps] support some extensions to the sitemaps protocol #162

Closed
wants to merge 9 commits into from

Conversation

tuxnco
Copy link

@tuxnco tuxnco commented Apr 25, 2017

Add support for the alternate links extension (#149), image extension (#36), video extension (#35) as well as news extension.

  • add .idea to git ignored files
  • add data model for sitemap's video extension
  • add data model for sitemap's news extension
  • add data model for sitemap's link (xhtml) extension
  • add data model for sitemap's image extension
  • add videos, news, images and links attributes to SiteMapURL
  • add parsing for video extension
  • add UT for Video extension, implement equality methods
  • add parsing of image sitemaps, unit tests and sample sitemap test resource
  • plug support for XHTML Links extraction from sitemap
  • add parser for news sitemaps, with some UT
  • add missing license, cleanup
  • add missing licenses, updated javadoc
  • implement hashCode where required, improve equals, cleanup
  • VideoAttributes.VideoPrice: fix equals and implement hashCode
  • add validation of media sitemaps
  • add dependency on org.apache.commons:commons-lang3@3.5

* add .idea to git ignored files
* add data model for sitemap's video extension
* add data model for sitemap's news extension
* add data model for sitemap's link (xhtml) extension
* add data model for sitemap's image extension
* add videos, news, images and links attributes to SiteMapURL
* add parsing for video extension
* add UT for Video extension, implement equality methods
* add parsing of image sitemaps, unit tests and sample sitemap test resource
* plug support for XHTML Links extraction from sitemap
* add parser for news sitemaps, with some UT
* add missing license, cleanup
* add missing licenses, updated javadoc
* implement hashCode where required, improve equals, cleanup
* VideoAttributes.VideoPrice: fix equals and implement hashCode
* add validation of media sitemaps
* add dependency on org.apache.commons:commons-lang3@3.5
@lewismc
Copy link
Member

lewismc commented Apr 25, 2017

Hi @tuxnco the build is currently failing due to the following issues

[ERROR] Forbidden method invocation: java.lang.String#toLowerCase() [Uses default locale]

[ERROR]   in crawlercommons.sitemaps.SiteMapExtensionParser (SiteMapExtensionParser.java:249)

[ERROR] Forbidden method invocation: java.lang.String#toLowerCase() [Uses default locale]

[ERROR]   in crawlercommons.sitemaps.SiteMapExtensionParser (SiteMapExtensionParser.java:265)

[ERROR] Forbidden method invocation: java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default locale]

[ERROR]   in crawlercommons.sitemaps.SiteMapParser (SiteMapParser.java:817)

[ERROR] Forbidden method invocation: java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default locale]

[ERROR]   in crawlercommons.sitemaps.SiteMapParser (SiteMapParser.java:823)

[ERROR] Forbidden method invocation: java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default locale]

[ERROR]   in crawlercommons.sitemaps.SiteMapParser (SiteMapParser.java:831)

[ERROR] Forbidden method invocation: java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default locale]

[ERROR]   in crawlercommons.sitemaps.SiteMapParser (SiteMapParser.java:846)

[ERROR] Forbidden method invocation: java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default locale]

[ERROR]   in crawlercommons.sitemaps.SiteMapParser (SiteMapParser.java:854)

[ERROR] Forbidden method invocation: java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default locale]

[ERROR]   in crawlercommons.sitemaps.SiteMapParser (SiteMapParser.java:862)

[ERROR] Forbidden method invocation: java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default locale]

[ERROR]   in crawlercommons.sitemaps.VideoAttributes (VideoAttributes.java:444)

[ERROR] Forbidden method invocation: java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default locale]

[ERROR]   in crawlercommons.sitemaps.VideoAttributes$VideoPrice (VideoAttributes.java:391)

[ERROR] Scanned 43 (and 321 related) class file(s) for forbidden API invocations (in 0.20s), 10 error(s).

Can you please push a small commit to fix these issues? They are triggered by the forbidden API checker plugin.

@tuxnco
Copy link
Author

tuxnco commented Apr 25, 2017

Hi @lewismc, thank you very much for your quick look at this!
I fixed the issues spotted by the forbidden API checker plugin, let me know if you want to discuss this PR more in depth, comments are welcome!

For example, in order to provide a portable way to extract extended attributes from sitemaps (e.g. from <video_namespace:video> or <news_namespace:news> nodes) I had to turn on XML namespace awareness at the DOM DocumentBuilderFactory level, see there : https://github.com/cogniteev/crawler-commons/blob/e21228bd8d4da7efe7ed7d80442d55693f87e84a/src/main/java/crawlercommons/sitemaps/SiteMapParser.java#L360 )

@lewismc
Copy link
Member

lewismc commented Apr 25, 2017

Yes I'll certainly review, this is a nice patch.

@sebastian-nagel
Copy link
Contributor

Hi, thanks looks promising...

  • I've started to test the PR on a set of sitemaps (see here). I'll continue testing later but got stuck at this NPE:
% java ... crawlercommons.sitemaps.SiteMapTester https://shinpaideshou.wordpress.com/news-sitemap.xml
17/04/25 22:19:20 INFO sitemaps.SiteMapTester:64 - Parsing https://shinpaideshou.wordpress.com/news-sitemap.xml  using DOM parser
Exception in thread "main" java.lang.NullPointerException
        at crawlercommons.sitemaps.SiteMapExtensionParser.parseNewsNode(SiteMapExtensionParser.java:134)
        at crawlercommons.sitemaps.SiteMapExtensionParser.parseNews(SiteMapExtensionParser.java:91)
        at crawlercommons.sitemaps.SiteMapParser.parseXmlSitemap(SiteMapParser.java:445)
  • with Optional SAX Parser for Sitemaps #150 we have an alternative SAX-based parser which was aimed as a replacement for the DOM parser. Porting this PR to the SAX parser looks a lot of work, but maybe we now have to accept two competing implementations: a feature-rich DOM and a robust and fast SAX parser.
  • the {Image,Video,News,Link}Attributes classes share fields. Just as a suggestion: would it help to organize the classes by inheritance? If they implement an Attribute interface we could also hold them in a single attribute list.

@tuxnco
Copy link
Author

tuxnco commented Apr 25, 2017

@sebastian-nagel Thanks a lot for your feedback. Indeed the lack of "keywords" is actually well handled, only the stock_tickers attribute was impacted by the NPE issue here.

Regarding your previous comment, I agree with the second point. I clearly think there exists two different use cases:

  1. find as much links as possible as fast as possible under possibly difficult circumstances, which is what the SAX-based parser aims to do (failsafe)
  2. feature-rich DOM-based parser, aimed at providing compliance checking tools and extract as much metadata as possible, aware of the underlying objects it is handling...

These last words drive me the third and last point. I reckon the {Image,Video,News,Link}Attributes classes share fields. But an <image:title> does not bear the exact same semantic of a <video:title> (the later is required while the first is not, as per the "specs" if we can say).
Which is why I introduced no inheritance in the first place.
One possible refactoring could be as follow : Having an ExtensionParser abstract class, responsible for handling a given extension, that would add ExtensionMetada (TBD) descendants to the originating SiteMapURL instance. It could for example have the helpers used to extract DOM nodes and attributes.
That could be extended by others willing to provide implementation for other protocol extension.
However, I can not tell much about what the contract of ExtensionMetadata should look like, as this seems pretty much tied to what the extension provides. May be at least equals, hashCode and isValid ?

@sebastian-nagel
Copy link
Contributor

@tuxnco never mind about the "keywords" (I didn't look at the code), but looks like there are definitely some more attributes to check:

17/04/26 00:05:07 INFO sitemaps.SiteMapTester:64 - Parsing http://www.hebdenbridgetimes.co.uk/sitemap-article-2015-18.xml  using DOM parser
Exception in thread "main" java.lang.NullPointerException
        at crawlercommons.sitemaps.SiteMapExtensionParser.parseVideoNode(SiteMapExtensionParser.java:206)
        at crawlercommons.sitemaps.SiteMapExtensionParser.parseVideos(SiteMapExtensionParser.java:81)
        at crawlercommons.sitemaps.SiteMapParser.parseXmlSitemap(SiteMapParser.java:443)

I'll answer tomorrow about the refactoring...

@tuxnco
Copy link
Author

tuxnco commented Apr 25, 2017

Thanks again Sebastian, let's continue the discussion tomorrow.
BTW, your commonscrawl-seed sitemaps dataset looks like a great stress test, any chance for me to access it ?
I looked around there : https://commoncrawl.s3.amazonaws.com/ but I think that is not the same bucket.

@sebastian-nagel
Copy link
Contributor

It's only used internally, not a officially released data set. Sitemap URLs can be easily mined from the robots.txt data set e.g. by running https://github.com/commoncrawl/cc-mrjob/blob/master/sitemaps_from_robotstxt.py. But I'll send you the location of the test data.

I like your draft with ExtensionMetadata and ExtensionParser.

NamedNodeMap attributes = elem.getAttributes();
URI href = null;
Map<String, String> params = new HashMap<>(attributes.getLength()-1);
if (attributes != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Swap with preceding line? attributes.getLength() may already cause a NPE.

@sebastian-nagel
Copy link
Contributor

Hi @tuxnco, great! I can confirm that the latest version (5281a6e) is robust and does not fail with uncaught exceptions on any of the sitemaps contained in sitemap-test-2017-03-03.warc.gz and sitemap-test-2017-03-04.warc.gz.

Anyone to continue the review? Shall we improve the way attributes are added and accessed by inheriting from ExtensionMetadata and ExtensionParser as proposed?

@jnioche
Copy link
Contributor

jnioche commented May 5, 2017

Thanks for this contrib @tuxnco and making concrete progress on a long-debated functionality.

Isn't there a way we could have a generic way of storing the additional data, e.g. a Map<String, Object>? I am not super comfortable with having extension-specific code in the core classes e.g. SiteMapURL. We could have extension specific code to simplify the access to the info stored in the Map.

Or is it what you were suggesting with ExtensionMetadata and ExtensionParser?

return false;
}
ImageAttributes that = (ImageAttributes) other;
if (!Objects.equals(loc, that.loc)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be odd if write all equals as a single return:
return Objects.equals(this.loc, that.loc)
&& Objects.equals(this.caption, that.caption)
...


@Override
public int hashCode() {
int result = 37;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Objects.hash(Object...) method as well.

result = 31 * result + (license == null ? 0 : license.hashCode());
return result;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No new line at the end of the file. Does it covered by code formatter? (the same for .gitignore file)

}

public Map<String, String> getParams() {
return params;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually a map is a mutable object. It means that the field could be changed outside of the instance, e.g. linkAttr.getParams().clean(). How about implement put and remove methods and return unmodifiebleMap?
But maybe it is to much for such a simple class.

}

@Override
public boolean equals(Object other) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments for ImageAttirbutes regarding equals and hash implementation.

* Moreover, only Google' video, images, links and news extensions are supported.
*/
public class SiteMapExtensionParser {
private final static Logger LOGGER = LoggerFactory.getLogger("sitemaps.parser.extension");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Class instance of SiteMapExtensionParser to create the logger. And rename it to LOG. See existing code for examples.

sebastian-nagel added a commit to sebastian-nagel/crawler-commons that referenced this pull request Oct 17, 2017
- update public suffix list to recent version of
      https://publicsuffix.org/list/public_suffix_list.dat
- add method flag to force a check whether the domain has a valid
      effective TLD listed in the public suffix list
- fix mixed case hostnames (wWW.eXample.com)
@jnioche
Copy link
Contributor

jnioche commented Dec 7, 2017

Closing as progress on this has stalled and DOM parser has been removed.

@jnioche jnioche closed this Dec 7, 2017
sebastian-nagel added a commit to sebastian-nagel/crawler-commons that referenced this pull request Sep 28, 2018
- optionally parse elements in the namespace of sitemap extensions:
  - Google video sitemaps (resolves crawler-commons#35)
  - Google image sitemaps (resolves crawler-commons#36)
  - Google news sitemaps
  - alternate links in sitemaps (resolves crawler-commons#149)
- the code is taken from Tanguy Moal's (@tuxnco) PR crawler-commons#162
  with the following modifications:
  - port from DOM to SAX parser
  - keep specific extensions separate from the "core" sitemap classes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants