-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: NUTCH-1129 microdata for Nutch 1.x #205
WIP: NUTCH-1129 microdata for Nutch 1.x #205
Conversation
da44932
to
70741f7
Compare
Hi @thilohaas this patch is too large for us to merge into Nutch master branch... |
Will try this patch while waiting for it to be merged into the official repo... thanks, man! :) |
Hi @simoncpu , there is no way we can merge this code into master branch of Nutch... it is simply too much of a change. |
a2ca570
to
a8533dd
Compare
c6e7b61
to
0a3d5cb
Compare
Sorry, I didn't accidentally added changes from another local test-branch. Should be cleaned up now and only contain any23 plugin relevant changes. |
0a3d5cb
to
bb92ab7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider my comments @thilohaas thank you for updating the patch...
Metadata metadata = parse.getData().getParseMeta(); | ||
|
||
for (String triple : triples) { | ||
sb.append(triple); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously we discussed and agreed that this was not an optional solution for associating triples with the Metadata. I still agree with that. We need to think of a more efficient manner for persisting triples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the time being, I would move forward with implementing this. We can work on improving an appropriate storage mechanisms/location for Any23 extractions later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @lewismc taking over this PR, going to adapt the format of the triplets to an array of objects if that's ok.
(example data taken from http://dbpedia.org/data/Z%C3%BCrich.ntriples)
triples: [
{
key: 'http://www.w3.org/2002/07/owl#sameAs',
short_key: 'sameAs', // for convenience
value: http://rdf.freebase.com/ns/m.08966'',
},
{
key: 'http://dbpedia.org/property/yearHumidity',
short_key: 'yearHumidity',
value: '77',
}
]
* uses the <a href="http://any23.apache.org">Apache Any23</a> library | ||
* for parsing and extracting structured data in RDF format from a | ||
* variety of Web documents. Currently it supports the following | ||
* input formats:</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest the comment, including a list of the supported formats is not really necessary. You can just link back to the any23.apache.org homepage for a list of supported formats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the list. Just link to the Any23 Webpage.
I tried building using the updated patch but got this:
|
@simoncpu this may be intermittent... please report back here if it does not resolve itself. I am aware that this SNAPSHOT dependency has given us problems in the past. We may need to push a fix somewhere in Any23 e.g. upgrade the commons-csv library. |
@lewismc It still didn't work, so I just grabbed the jar file at: http://svn.apache.org/repos/asf/any23/repo-ext/org/apache/commons/commons-csv/1.0-SNAPSHOT-rev1148315/. :) |
OK this is an issue. The solution is to address https://issues.apache.org/jira/browse/ANY23-264 |
@thilohaas I tested this on a website with Microdata, but it can't index anything... EDIT: The error is: |
@thilohaas can you consider the comments above please? @simoncpu thank you for trying out the patch... please keep providing feedback. Did you manage to debug the source of the ParseException? The URL you provide is not actually available... have you tried it on anything else? An example would be https://www.w3.org |
@lewismc Here's one of the URLs that I've tried: http://mcdonalds.jobs/salt-lake-city-ut/general-manager/2947B6E7B04147FFBEE1445E66D7EA67/job/ BTW, the previous patch was able to parse the Microdata without problems. :) EDIT, here's the full output:
|
I get a parser error using the Any23 Webservice |
Sadly I'm currently too busy, but will definitely look into it as soon as possible. btw the any23 webservice seems to be broken, as it's failing on all websites I've tried. For example google as well: http://any23.org/any23/?format=best&uri=https%3A%2F%2Fgoogle.com&validation-mode=none |
Is it planned to have this patch available also in the Nutch 2.x branch? |
@hostingnuggets I don't see why not. If you feel like submitting a PR then I will review it. |
@lewismc sorry I am no Java dev here but I would nevertheless like to help if you can assist me here a bit. Do I understand correctly that the first step for that would be to take the 16 files which @thilohaas modified (total of two commits) and apply them to the master branch of nutch, see if it works and if yes create a new branch and submit a pull request? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @smartive please take a look at my comments.
build.xml
Outdated
@@ -1031,7 +1031,8 @@ | |||
|
|||
<source path="${basedir}/src/java/" /> | |||
<source path="${basedir}/src/test/" output="build/test/classes" /> | |||
|
|||
<source path="${plugins.dir}/any23/src/java/" /> | |||
<source path="${plugins.dir}/any23/src/test/" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add entries for the javadoc
and eclipse
targets as well.
Additionally, please add entries to default.properties
as appropriate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ivy/ivy.xml
Outdated
@@ -66,6 +66,7 @@ | |||
<!-- End of Hadoop Dependencies --> | |||
|
|||
<dependency org="org.apache.tika" name="tika-core" rev="1.17" /> | |||
<dependency org="org.apache.any23" name="apache-any23-core" rev="2.0"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upgrade this to 2.1 and correct formatting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -0,0 +1,8 @@ | |||
1. Upgrade Any23 dependency in trunk/ivy/ivy.xml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can go ahead and upgrade this when you make the update to the any23 2.1 dependency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import java.io.IOException; | ||
import java.net.URISyntaxException; | ||
import java.nio.charset.Charset; | ||
import java.util.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove thi wildcard and use explicit imports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import org.apache.any23.writer.TripleHandlerException; | ||
import org.apache.hadoop.conf.Configuration; | ||
import org.apache.nutch.metadata.Metadata; | ||
import org.apache.nutch.parse.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use explicit imports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* uses the <a href="http://any23.apache.org">Apache Any23</a> library | ||
* for parsing and extracting structured data in RDF format from a | ||
* variety of Web documents. Currently it supports the following | ||
* input formats:</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the list. Just link to the Any23 Webpage.
private void parse(String url, String htmlContent) throws URISyntaxException, IOException, TripleHandlerException { | ||
Any23 any23 = new Any23(); | ||
ByteArrayOutputStream baos = new ByteArrayOutputStream(); | ||
TripleHandler tHandler = new NTriplesWriter(baos); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please write the triples as Turtle, it is easier to read and hence debug if this breaks in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignore the comment.
Metadata metadata = parse.getData().getParseMeta(); | ||
|
||
for (String triple : triples) { | ||
sb.append(triple); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the time being, I would move forward with implementing this. We can work on improving an appropriate storage mechanisms/location for Any23 extractions later on.
* limitations under the License. | ||
*/ | ||
/** | ||
* @author lewismc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove lewismc and add comment similar to that present within Any23ParseFilter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*/ | ||
package org.apache.nutch.any23; | ||
|
||
import static org.junit.Assert.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove wildcard and use explicit imports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my comments.
ivy/ivy.xml
Outdated
@@ -66,6 +66,13 @@ | |||
<!-- End of Hadoop Dependencies --> | |||
|
|||
<dependency org="org.apache.tika" name="tika-core" rev="1.17" /> | |||
<dependency org="net.sourceforge.owlapi" name="owlapi-api" rev="5.1.0" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dependencies need to go into the plugin ivy.xml. They do not belong here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/plugin/parse-tika/ivy.xml
Outdated
@@ -44,6 +44,10 @@ | |||
<exclude org="org.slf4j" name="slf4j-api" /> | |||
<exclude org="commons-lang" name="commons-lang" /> | |||
<exclude org="com.google.protobuf" name="protobuf-java" /> | |||
<exclude org="org.apache.poi" name="poi" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these being excluded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent @nmaro |
@lewismc quick question: what is your favorite way to run single plugin tests? We are trying with |
@nmaro I run them in Eclipse, by the way, I have been working hard on Any23 to improve microdata extraction and a bunch of other stuff. We will be releasing Any23 2.2 reasonably soon so we can make the Any23 upgrade here in Nutch as well. |
@lewismc ok. I added an ant command that allows one to run single plugin tests like |
@nmaro yes please submit a PR, thanks |
@lewismc Requested changes done - please note that
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks overall much much better. We are nearly getting there :)
Can you please address the comments I've made.
ivy/ivysettings.xml
Outdated
<property name="any23.svn.apache.org" | ||
value="http://svn.apache.org/repos/asf/any23/repo-ext/" | ||
override="false"/> | ||
<property name="ebi.ac.uk" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed... is there any compelling reason to have it included?
ivy/ivysettings.xml
Outdated
@@ -47,6 +53,18 @@ | |||
pattern="${maven2.pattern.ext}" | |||
m2compatible="true" | |||
/> | |||
<ibiblio name="ebi" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed... is there any compelling reason to have it included?
ivy/ivysettings.xml
Outdated
@@ -64,6 +82,8 @@ | |||
<resolver ref="maven2"/> | |||
<resolver ref="apache-snapshot"/> | |||
<resolver ref="sonatype"/> | |||
<resolver ref="any23-svn-repository"/> | |||
<resolver ref="ebi"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed... is there any compelling reason to have it included?
ivy/ivysettings.xml
Outdated
@@ -75,6 +95,8 @@ | |||
<resolver ref="maven2"/> | |||
<resolver ref="apache-snapshot"/> | |||
<resolver ref="sonatype"/> | |||
<resolver ref="any23-svn-repository"/> | |||
<resolver ref="ebi"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed... is there any compelling reason to have it included?
src/bin/crawl
Outdated
# -sm Path to sitemap URL file(s) | ||
# CrawlDir Directory where the crawl/link/segments dirs are saved | ||
# NumRounds The number of rounds to run this crawl for | ||
# Usage: crawl [options] <crawl_dir> <num_rounds> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, i don't think that this content should be included in this patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/plugin/any23/ivy.xml
Outdated
|
||
<dependencies> | ||
<dependency org="org.semanticweb.owlapi" name="owlapi" rev="3.2.4" conf="*->default"/> | ||
<dependency org="org.apache.commons" name="commons-rdf-api" rev="0.5.0" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is both commons-rdf-api and owlapi included here?
*/ | ||
package org.apache.nutch.any23; | ||
|
||
import java.io.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use explicit imports
any23.setMIMETypeDetector(null); | ||
|
||
try { | ||
// Fix input to avoid extraction error (https://github.com/semarglproject/semargl/issues/37#issuecomment-69381281) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally this code should be ported over to Any23's BaseRDFParser implementations... i think the RDFa11 parser is an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to ignore this within the scope of this merge request as we already invested many hours into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will happily do it over at Any23 in the future.
*/ | ||
public final static String ANY23_TRIPLES = "Any23-Triples"; | ||
|
||
public static final String ANY_23_EXTRACTORS_CONF = "any23.extractors"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to introduce a new option we need to actually include it in nutch-default.xml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Thank you @mfeltscher |
@lewismc all comments addressed. |
On a Mac with jdk 8 installed, I ran into failure on the javadoc task complaining about the java version. Upon deeper inspection I determined the failure condition was tripping up on ant.java.version equals 1.6 - running Ant -v and it said my are in the JAVA_HOME (jdk 8) is 1.6! Super strange... I removed the ant.java.version checks in java doc task and reran... ant zip-bin with java 8 finished successfully!! However, the reason I'm posting here is, I noticed 19 errors and 106 warnings in the java doc task. Here is the first few errors it encountered: [javadoc] /nutch/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java:30: error: package org.apache.any23 does not exist |
Hi @ferrerod thank you for posting. I think this merely has to do with the Javadoc links not being available |
@nmaro @lewismc I'm new to nutch and stumbled upon this merged PR after realizing I wanted much better crawling results of sites leveraging schema.org. This is likely the wrong place to pose my questions, but it's the first exposure to Any23 within Nutch 1.x. Do please redirect me to a better place to post questions. And,...I may not even be asking the right questions but here goes: Q: How do I gain Any23 microdata parsing / indexing capabilities introduced with this PR? Do I replace How do I expose the discovered microdata items to end-user such as Solr? For example, what are the microdata items and how should I map them to Solr in |
Hi @ferrerod this is a good question and I would like to answer it on the mailing list. Please ask it on user@ |
Thank you Lewis, I just sent an email: |
No description provided.