New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Sitemaps] support some extensions to the sitemaps protocol #162
Conversation
* add .idea to git ignored files * add data model for sitemap's video extension * add data model for sitemap's news extension * add data model for sitemap's link (xhtml) extension * add data model for sitemap's image extension * add videos, news, images and links attributes to SiteMapURL * add parsing for video extension * add UT for Video extension, implement equality methods * add parsing of image sitemaps, unit tests and sample sitemap test resource * plug support for XHTML Links extraction from sitemap * add parser for news sitemaps, with some UT * add missing license, cleanup * add missing licenses, updated javadoc * implement hashCode where required, improve equals, cleanup * VideoAttributes.VideoPrice: fix equals and implement hashCode * add validation of media sitemaps * add dependency on org.apache.commons:commons-lang3@3.5
Hi @tuxnco the build is currently failing due to the following issues
Can you please push a small commit to fix these issues? They are triggered by the forbidden API checker plugin. |
Hi @lewismc, thank you very much for your quick look at this! For example, in order to provide a portable way to extract extended attributes from sitemaps (e.g. from |
Yes I'll certainly review, this is a nice patch. |
Hi, thanks looks promising...
|
@sebastian-nagel Thanks a lot for your feedback. Indeed the lack of "keywords" is actually well handled, only the Regarding your previous comment, I agree with the second point. I clearly think there exists two different use cases:
These last words drive me the third and last point. I reckon the {Image,Video,News,Link}Attributes classes share fields. But an |
@tuxnco never mind about the "keywords" (I didn't look at the code), but looks like there are definitely some more attributes to check:
I'll answer tomorrow about the refactoring... |
Thanks again Sebastian, let's continue the discussion tomorrow. |
It's only used internally, not a officially released data set. Sitemap URLs can be easily mined from the robots.txt data set e.g. by running https://github.com/commoncrawl/cc-mrjob/blob/master/sitemaps_from_robotstxt.py. But I'll send you the location of the test data. I like your draft with |
NamedNodeMap attributes = elem.getAttributes(); | ||
URI href = null; | ||
Map<String, String> params = new HashMap<>(attributes.getLength()-1); | ||
if (attributes != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Swap with preceding line? attributes.getLength()
may already cause a NPE.
Hi @tuxnco, great! I can confirm that the latest version (5281a6e) is robust and does not fail with uncaught exceptions on any of the sitemaps contained in sitemap-test-2017-03-03.warc.gz and sitemap-test-2017-03-04.warc.gz. Anyone to continue the review? Shall we improve the way attributes are added and accessed by inheriting from |
Thanks for this contrib @tuxnco and making concrete progress on a long-debated functionality. Isn't there a way we could have a generic way of storing the additional data, e.g. a Map<String, Object>? I am not super comfortable with having extension-specific code in the core classes e.g. SiteMapURL. We could have extension specific code to simplify the access to the info stored in the Map. Or is it what you were suggesting with ExtensionMetadata and ExtensionParser? |
return false; | ||
} | ||
ImageAttributes that = (ImageAttributes) other; | ||
if (!Objects.equals(loc, that.loc)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it be odd if write all equals as a single return:
return Objects.equals(this.loc, that.loc)
&& Objects.equals(this.caption, that.caption)
...
|
||
@Override | ||
public int hashCode() { | ||
int result = 37; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Objects.hash(Object...) method as well.
result = 31 * result + (license == null ? 0 : license.hashCode()); | ||
return result; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No new line at the end of the file. Does it covered by code formatter? (the same for .gitignore file)
} | ||
|
||
public Map<String, String> getParams() { | ||
return params; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually a map is a mutable object. It means that the field could be changed outside of the instance, e.g. linkAttr.getParams().clean(). How about implement put and remove methods and return unmodifiebleMap?
But maybe it is to much for such a simple class.
} | ||
|
||
@Override | ||
public boolean equals(Object other) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comments for ImageAttirbutes regarding equals and hash implementation.
* Moreover, only Google' video, images, links and news extensions are supported. | ||
*/ | ||
public class SiteMapExtensionParser { | ||
private final static Logger LOGGER = LoggerFactory.getLogger("sitemaps.parser.extension"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Class instance of SiteMapExtensionParser to create the logger. And rename it to LOG. See existing code for examples.
- update public suffix list to recent version of https://publicsuffix.org/list/public_suffix_list.dat - add method flag to force a check whether the domain has a valid effective TLD listed in the public suffix list - fix mixed case hostnames (wWW.eXample.com)
Closing as progress on this has stalled and DOM parser has been removed. |
- optionally parse elements in the namespace of sitemap extensions: - Google video sitemaps (resolves crawler-commons#35) - Google image sitemaps (resolves crawler-commons#36) - Google news sitemaps - alternate links in sitemaps (resolves crawler-commons#149) - the code is taken from Tanguy Moal's (@tuxnco) PR crawler-commons#162 with the following modifications: - port from DOM to SAX parser - keep specific extensions separate from the "core" sitemap classes
Add support for the alternate links extension (#149), image extension (#36), video extension (#35) as well as news extension.