WIP: NUTCH-1129 microdata for Nutch 1.x #205

thilohaas · 2017-07-20T06:25:01Z

No description provided.

lewismc · 2017-07-26T15:23:27Z

Hi @thilohaas this patch is too large for us to merge into Nutch master branch...
Can you please separate our your code to implement Microdata support? We can then review that patch alone.

simoncpu · 2017-07-26T21:35:26Z

Will try this patch while waiting for it to be merged into the official repo... thanks, man! :)

lewismc · 2017-07-27T08:13:46Z

Hi @simoncpu , there is no way we can merge this code into master branch of Nutch... it is simply too much of a change.
This patch needs to be reduced in size to be considered.
Thank you for all contributions to Nutch, we welcome all, we need to make sure that the software is high quality and stable.

thilohaas · 2017-08-21T17:40:43Z

Sorry, I didn't accidentally added changes from another local test-branch. Should be cleaned up now and only contain any23 plugin relevant changes.

lewismc

Please consider my comments @thilohaas thank you for updating the patch...

lewismc · 2017-08-21T17:57:20Z

src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java

+    Metadata metadata = parse.getData().getParseMeta();
+
+    for (String triple : triples) {
+      sb.append(triple);


Previously we discussed and agreed that this was not an optional solution for associating triples with the Metadata. I still agree with that. We need to think of a more efficient manner for persisting triples.

For the time being, I would move forward with implementing this. We can work on improving an appropriate storage mechanisms/location for Any23 extractions later on.

Hi @lewismc taking over this PR, going to adapt the format of the triplets to an array of objects if that's ok.

(example data taken from http://dbpedia.org/data/Z%C3%BCrich.ntriples)

triples: [ { key: 'http://www.w3.org/2002/07/owl#sameAs', short_key: 'sameAs', // for convenience value: http://rdf.freebase.com/ns/m.08966'', }, { key: 'http://dbpedia.org/property/yearHumidity', short_key: 'yearHumidity', value: '77', } ]

lewismc · 2017-08-21T17:58:29Z

src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java

+ * uses the <a href="http://any23.apache.org">Apache Any23</a> library
+ * for parsing and extracting structured data in RDF format from a
+ * variety of Web documents. Currently it supports the following
+ * input formats:</p>


To be honest the comment, including a list of the supported formats is not really necessary. You can just link back to the any23.apache.org homepage for a list of supported formats.

Please remove the list. Just link to the Any23 Webpage.

simoncpu · 2017-08-30T20:11:09Z

I tried building using the updated patch but got this:

[ivy:resolve] WARN:     ::::::::::::::::::::::::::::::::::::::::::::::
[ivy:resolve] WARN:     ::          UNRESOLVED DEPENDENCIES         ::
[ivy:resolve] WARN:     ::::::::::::::::::::::::::::::::::::::::::::::
[ivy:resolve] WARN:     :: org.apache.commons#commons-csv;1.0-SNAPSHOT-rev1148315: not found
[ivy:resolve] WARN:     ::::::::::::::::::::::::::::::::::::::::::::::

lewismc · 2017-08-30T20:23:15Z

@simoncpu this may be intermittent... please report back here if it does not resolve itself. I am aware that this SNAPSHOT dependency has given us problems in the past. We may need to push a fix somewhere in Any23 e.g. upgrade the commons-csv library.

simoncpu · 2017-08-31T14:10:25Z

@lewismc It still didn't work, so I just grabbed the jar file at: http://svn.apache.org/repos/asf/any23/repo-ext/org/apache/commons/commons-csv/1.0-SNAPSHOT-rev1148315/. :)

lewismc · 2017-08-31T14:18:32Z

OK this is an issue. The solution is to address https://issues.apache.org/jira/browse/ANY23-264

simoncpu · 2017-09-05T03:34:04Z

@thilohaas I tested this on a website with Microdata, but it can't index anything...

EDIT: The error is:
Error parsing: http://example.org/website-with-microdata: org.apache.nutch.parse.ParseException: Unable to successfully parse content

lewismc · 2017-09-05T16:26:03Z

@thilohaas can you consider the comments above please?

@simoncpu thank you for trying out the patch... please keep providing feedback. Did you manage to debug the source of the ParseException? The URL you provide is not actually available... have you tried it on anything else? An example would be https://www.w3.org

simoncpu · 2017-09-05T17:22:44Z

@lewismc Here's one of the URLs that I've tried:

http://mcdonalds.jobs/salt-lake-city-ut/general-manager/2947B6E7B04147FFBEE1445E66D7EA67/job/

BTW, the previous patch was able to parse the Microdata without problems. :)

EDIT, here's the full output:

Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold: -1
Thread FetcherThread has no more work available
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=1
fetcher.maxNum.threads can't be < than 50 : using 50 instead
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0
-activeThreads=0
Fetcher: finished at 2017-09-05 17:25:43, elapsed: 00:00:08
Parsing : 20170905172529
/home/simoncpu/nutch/runtime/local/bin/nutch parse -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true -D mapreduce.task.skip.start.attempts=2 -D mapreduce.map.skip.maxrecords=1 crawl-dir/segments/20170905172529
ParseSegment: starting at 2017-09-05 17:25:45
ParseSegment: segment: crawl-dir/segments/20170905172529
Error parsing: http://mcdonalds.jobs/salt-lake-city-ut/general-manager/2947B6E7B04147FFBEE1445E66D7EA67/job/: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content
Parsed (225ms):http://mcdonalds.jobs/salt-lake-city-ut/general-manager/2947B6E7B04147FFBEE1445E66D7EA67/job/
ParseSegment: finished at 2017-09-05 17:25:51, elapsed: 00:00:06
CrawlDB update
/home/simoncpu/nutch/runtime/local/bin/nutch updatedb -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true crawl-dir/crawldb crawl-dir/segments/20170905172529
CrawlDb update: starting at 2017-09-05 17:25:53
CrawlDb update: db: crawl-dir/crawldb
CrawlDb update: segments: [crawl-dir/segments/20170905172529]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2017-09-05 17:25:59, elapsed: 00:00:05
Link inversion
/home/simoncpu/nutch/runtime/local/bin/nutch invertlinks crawl-dir/linkdb crawl-dir/segments/20170905172529
LinkDb: starting at 2017-09-05 17:26:01
LinkDb: linkdb: crawl-dir/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: internal links will be ignored.
LinkDb: adding segment: crawl-dir/segments/20170905172529
LinkDb: finished at 2017-09-05 17:26:06, elapsed: 00:00:04
Dedup on crawldb
/home/simoncpu/nutch/runtime/local/bin/nutch dedup crawl-dir/crawldb
DeduplicationJob: starting at 2017-09-05 17:26:07
Deduplication: 0 documents marked as duplicates
Deduplication: Updating status of duplicate urls into crawl db.
Deduplication finished at 2017-09-05 17:26:15, elapsed: 00:00:07
Indexing 20170905172529 to index
/home/simoncpu/nutch/runtime/local/bin/nutch index crawl-dir/crawldb -linkdb crawl-dir/linkdb crawl-dir/segments/20170905172529
Segment dir is complete: crawl-dir/segments/20170905172529.
Indexer: starting at 2017-09-05 17:26:17
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
ElasticRestIndexWriter
        elastic.rest.host : hostname
        elastic.rest.port : port
        elastic.rest.index : elastic index command
        elastic.rest.max.bulk.docs : elastic bulk index doc counts. (default 250)
        elastic.rest.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)


Indexer: number of documents indexed, deleted, or skipped:
Indexer: finished at 2017-09-05 17:26:23, elapsed: 00:00:05
Cleaning up index if possible
/home/simoncpu/nutch/runtime/local/bin/nutch clean crawl-dir/crawldb
Wed Sep 6 01:26:28 DST 2017 : Finished loop with 1 iterations

lewismc · 2017-09-05T18:15:17Z

I get a parser error using the Any23 Webservice

thilohaas · 2017-09-05T18:29:35Z

Sadly I'm currently too busy, but will definitely look into it as soon as possible.
Do you maybe have an idea of how to pass an array or hash of strings to the filter (see my comment on the PR)? So I would be able to simplify the process and come up with an alternative way of storing triples on the documents.

btw the any23 webservice seems to be broken, as it's failing on all websites I've tried. For example google as well: http://any23.org/any23/?format=best&uri=https%3A%2F%2Fgoogle.com&validation-mode=none

hostingnuggets · 2017-11-20T17:02:08Z

Is it planned to have this patch available also in the Nutch 2.x branch?

lewismc · 2017-12-02T17:07:18Z

@hostingnuggets I don't see why not. If you feel like submitting a PR then I will review it.

hostingnuggets · 2017-12-02T20:40:25Z

@lewismc sorry I am no Java dev here but I would nevertheless like to help if you can assist me here a bit. Do I understand correctly that the first step for that would be to take the 16 files which @thilohaas modified (total of two commits) and apply them to the master branch of nutch, see if it works and if yes create a new branch and submit a pull request?

lewismc

Thanks @smartive please take a look at my comments.

lewismc · 2017-12-20T08:46:43Z

build.xml

@@ -1031,7 +1031,8 @@

        <source path="${basedir}/src/java/" />
        <source path="${basedir}/src/test/" output="build/test/classes" />
-
+        <source path="${plugins.dir}/any23/src/java/" />
+        <source path="${plugins.dir}/any23/src/test/" />


Please add entries for the javadoc and eclipse targets as well.

Additionally, please add entries to default.properties as appropriate

lewismc · 2017-12-20T08:47:08Z

ivy/ivy.xml

@@ -66,6 +66,7 @@
 		<!-- End of Hadoop Dependencies -->

 		<dependency org="org.apache.tika" name="tika-core" rev="1.17" />
+        <dependency org="org.apache.any23" name="apache-any23-core" rev="2.0"/>


Upgrade this to 2.1 and correct formatting.

lewismc · 2017-12-20T08:48:11Z

src/plugin/any23/howto_upgrade_any23.txt

@@ -0,0 +1,8 @@
+1. Upgrade Any23 dependency in trunk/ivy/ivy.xml


You can go ahead and upgrade this when you make the update to the any23 2.1 dependency

lewismc · 2017-12-20T08:49:03Z

src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java

+import java.io.IOException;
+import java.net.URISyntaxException;
+import java.nio.charset.Charset;
+import java.util.*;


Please remove thi wildcard and use explicit imports.

lewismc · 2017-12-20T08:49:16Z

src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java

+import org.apache.any23.writer.TripleHandlerException;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.nutch.metadata.Metadata;
+import org.apache.nutch.parse.*;


Please use explicit imports.

lewismc · 2017-12-20T08:49:49Z

src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java

+ * uses the <a href="http://any23.apache.org">Apache Any23</a> library
+ * for parsing and extracting structured data in RDF format from a
+ * variety of Web documents. Currently it supports the following
+ * input formats:</p>


Please remove the list. Just link to the Any23 Webpage.

lewismc · 2017-12-20T08:51:01Z

src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java

+    private void parse(String url, String htmlContent) throws URISyntaxException, IOException, TripleHandlerException {
+      Any23 any23 = new Any23();
+      ByteArrayOutputStream baos = new ByteArrayOutputStream();
+      TripleHandler tHandler = new NTriplesWriter(baos);


Please write the triples as Turtle, it is easier to read and hence debug if this breaks in the future.

http://any23.apache.org/apidocs/index.html?org/apache/any23/writer/TurtleWriter.html

Ignore the comment.

lewismc · 2017-12-20T08:52:23Z

src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java

+    Metadata metadata = parse.getData().getParseMeta();
+
+    for (String triple : triples) {
+      sb.append(triple);


For the time being, I would move forward with implementing this. We can work on improving an appropriate storage mechanisms/location for Any23 extractions later on.

lewismc · 2017-12-20T08:53:33Z

src/plugin/any23/src/java/org/apache/nutch/any23/package-info.java

+ * limitations under the License.
+ */
+/**
+ * @author lewismc


Remove lewismc and add comment similar to that present within Any23ParseFilter

c866518 and 747d939

lewismc · 2017-12-20T08:53:54Z

src/plugin/any23/src/test/org/apache/nutch/any23/TestAny23IndexingFilter.java

+ */
+package org.apache.nutch.any23;
+
+import static org.junit.Assert.*;


Remove wildcard and use explicit imports

lewismc

Please see my comments.

lewismc · 2018-01-03T20:07:04Z

ivy/ivy.xml

@@ -66,6 +66,13 @@
 		<!-- End of Hadoop Dependencies -->

 		<dependency org="org.apache.tika" name="tika-core" rev="1.17" />
+		<dependency org="net.sourceforge.owlapi" name="owlapi-api" rev="5.1.0" />


The dependencies need to go into the plugin ivy.xml. They do not belong here.

lewismc · 2018-01-03T20:08:42Z

src/plugin/parse-tika/ivy.xml

@@ -44,6 +44,10 @@
      <exclude org="org.slf4j" name="slf4j-api" />
      <exclude org="commons-lang" name="commons-lang" />
      <exclude org="com.google.protobuf" name="protobuf-java" />
+      <exclude org="org.apache.poi" name="poi" />


Why are these being excluded?

lewismc · 2018-01-08T13:07:26Z

Excellent @nmaro

nickredmark · 2018-01-08T15:46:27Z

@lewismc quick question: what is your favorite way to run single plugin tests?

We are trying with ant compile-core-test && ant runtime && ./runtime/local/bin/nutch junit org.apache.nutch.parse.metatags.TestAny23ParseFilter but we are encountering some problems. Is this the recommended way?

lewismc · 2018-01-08T16:06:34Z

@nmaro I run them in Eclipse, by the way, I have been working hard on Any23 to improve microdata extraction and a bunch of other stuff. We will be releasing Any23 2.2 reasonably soon so we can make the Any23 upgrade here in Nutch as well.

nickredmark · 2018-01-08T16:19:15Z

@lewismc ok. I added an ant command that allows one to run single plugin tests like ant -Dplugin=any23 test-plugin (and works) can I check that in?

lewismc · 2018-01-08T20:47:02Z

@nmaro yes please submit a PR, thanks

nickredmark · 2018-01-10T10:49:26Z

@lewismc Requested changes done - please note that

I had to extend the elastic http plugin to handle lists of Map objects that it previously just stringified
Any23 couldn't detect as many triples as you expected in your tests, had to lower the number - but it's good enough for us for now, people can still expand the any23 scope if they find out what the problem is
Data is now indexed as follows (example after crawling https://smartive.ch/jobs):

          "structured_data": [
            {
              "node": "<https://smartive.ch/jobs>",
              "value": "\"IE-edge,chrome=1\"@de",
              "key": "<http://vocab.sindice.net/any23#X-UA-Compatible>",
              "short_key": "X-UA-Compatible"
            },
            {
              "node": "<https://smartive.ch/jobs>",
              "value": "\"Wir sind smartive \\u2014 eine dynamische, innovative Schweizer Webentwicklungsagentur. Die Realisierung zeitgem\\u00E4sser Webl\\u00F6sungen geh\\u00F6rt genauso zu unserer Passion, wie die konstruktive Zusammenarbeit mit unseren Kundinnen und Kunden.\"@de",
              "key": "<http://vocab.sindice.net/any23#description>",
              "short_key": "description"
            },
            {
              "node": "<https://smartive.ch/jobs>",
              "value": "\"width=device-width, initial-scale=1, shrink-to-fit=no\"@de",
              "key": "<http://vocab.sindice.net/any23#viewport>",
              "short_key": "viewport"
            },
            {
              "node": "<https://smartive.ch/jobs>",
              "value": "\"width=device-width,initial-scale=1\"@de",
              "key": "<http://vocab.sindice.net/any23#viewport>",
              "short_key": "viewport"
            },
            {
              "node": "<https://smartive.ch/jobs>",
              "value": "\"ie=edge\"@de",
              "key": "<http://vocab.sindice.net/any23#x-ua-compatible>",
              "short_key": "x-ua-compatible"
            }
          ],

lewismc

This looks overall much much better. We are nearly getting there :)
Can you please address the comments I've made.

lewismc · 2018-01-10T16:29:13Z

ivy/ivysettings.xml

+  <property name="any23.svn.apache.org"
+    value="http://svn.apache.org/repos/asf/any23/repo-ext/"
+    override="false"/>
+  <property name="ebi.ac.uk"


This can be removed... is there any compelling reason to have it included?

lewismc · 2018-01-10T16:29:23Z

ivy/ivysettings.xml

@@ -47,6 +53,18 @@
      pattern="${maven2.pattern.ext}"
      m2compatible="true"
      />
+    <ibiblio name="ebi"


This can be removed... is there any compelling reason to have it included?

lewismc · 2018-01-10T16:29:28Z

ivy/ivysettings.xml

@@ -64,6 +82,8 @@
      <resolver ref="maven2"/>
      <resolver ref="apache-snapshot"/>
      <resolver ref="sonatype"/>
+      <resolver ref="any23-svn-repository"/>
+      <resolver ref="ebi"/>


This can be removed... is there any compelling reason to have it included?

lewismc · 2018-01-10T16:29:33Z

ivy/ivysettings.xml

@@ -75,6 +95,8 @@
      <resolver ref="maven2"/>
      <resolver ref="apache-snapshot"/>
      <resolver ref="sonatype"/>
+      <resolver ref="any23-svn-repository"/>
+      <resolver ref="ebi"/>


This can be removed... is there any compelling reason to have it included?

lewismc · 2018-01-10T16:30:04Z

src/bin/crawl

-#    -sm             Path to sitemap URL file(s)
-#    CrawlDir        Directory where the crawl/link/segments dirs are saved
-#    NumRounds       The number of rounds to run this crawl for
+# Usage: crawl [options] <crawl_dir> <num_rounds>


Unfortunately, i don't think that this content should be included in this patch.

@lewismc I went ahead and merged master back into this feature branch so this is solved - see 1e2c848

lewismc · 2018-01-10T16:31:01Z

src/plugin/any23/ivy.xml

+
+  <dependencies>
+    <dependency org="org.semanticweb.owlapi" name="owlapi" rev="3.2.4" conf="*->default"/>
+    <dependency org="org.apache.commons" name="commons-rdf-api" rev="0.5.0" />


Why is both commons-rdf-api and owlapi included here?

lewismc · 2018-01-10T16:31:59Z

src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java

+ */
+package org.apache.nutch.any23;
+
+import java.io.*;


Please use explicit imports

lewismc · 2018-01-10T16:34:00Z

src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java

+      any23.setMIMETypeDetector(null);
+
+      try {
+        // Fix input to avoid extraction error (https://github.com/semarglproject/semargl/issues/37#issuecomment-69381281)


Ideally this code should be ported over to Any23's BaseRDFParser implementations... i think the RDFa11 parser is an example.

I'm going to ignore this within the scope of this merge request as we already invested many hours into it.

I will happily do it over at Any23 in the future.

lewismc · 2018-01-10T16:35:39Z

src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java

+   */
+  public final static String ANY23_TRIPLES = "Any23-Triples";
+
+  public static final String ANY_23_EXTRACTORS_CONF = "any23.extractors";


If we are going to introduce a new option we need to actually include it in nutch-default.xml

lewismc · 2018-01-10T18:12:20Z

Thank you @mfeltscher

…rtive/nutch into feature/NUTCH-1129-microdata

nickredmark · 2018-01-11T09:30:31Z

@lewismc all comments addressed.

ferrerod · 2018-02-08T05:30:45Z

On a Mac with jdk 8 installed, I ran into failure on the javadoc task complaining about the java version. Upon deeper inspection I determined the failure condition was tripping up on ant.java.version equals 1.6 - running Ant -v and it said my are in the JAVA_HOME (jdk 8) is 1.6! Super strange...

I removed the ant.java.version checks in java doc task and reran...

ant zip-bin with java 8 finished successfully!! However, the reason I'm posting here is, I noticed 19 errors and 106 warnings in the java doc task. Here is the first few errors it encountered:

[javadoc] /nutch/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java:30: error: package org.apache.any23 does not exist
[javadoc] import org.apache.any23.Any23;
[javadoc] ^
[javadoc] /nutch/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java:31: error: package org.apache.any23.extractor does not exist
[javadoc] import org.apache.any23.extractor.ExtractionException;
[javadoc] ^
[javadoc] /nutch/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java:32: error: package org.apache.any23.writer does not exist
[javadoc] import org.apache.any23.writer.BenchmarkTripleHandler;
[javadoc] ^
[javadoc] /nutch/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java:33: error: package org.apache.any23.writer does not exist
[javadoc] import org.apache.any23.writer.NTriplesWriter;
[javadoc] ^
[javadoc] /nutch/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java:34: error: package org.apache.any23.writer does not exist
[javadoc] import org.apache.any23.writer.TripleHandler;
[javadoc] ^
[javadoc] /nutch/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java:35: error: package org.apache.any23.writer does not exist
[javadoc] import org.apache.any23.writer.TripleHandlerException;
[javadoc] ^
[javadoc] /nutch/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java:43: error: package org.ccil.cowan.tagsoup does not exist
[javadoc] import org.ccil.cowan.tagsoup.XMLWriter;
[javadoc] ^
[javadoc] /nutch/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java:44: error: package org.ccil.cowan.tagsoup.jaxp does not exist
[javadoc] import org.ccil.cowan.tagsoup.jaxp.SAXParserImpl;
[javadoc] ^
[javadoc] /nutch/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java:87: error: cannot find symbol
[javadoc] Any23Parser(String url, String htmlContent, String contentType, String... extractorNames) throws TripleHandlerException {

lewismc · 2018-02-08T06:03:50Z

Hi @ferrerod thank you for posting. I think this merely has to do with the Javadoc links not being available
https://github.com/apache/nutch/blob/master/default.properties#L43-L51
If you are able to fix it, then please by all means open a PR and we can review. Thank you

ferrerod · 2018-02-08T06:50:12Z

@nmaro @lewismc I'm new to nutch and stumbled upon this merged PR after realizing I wanted much better crawling results of sites leveraging schema.org. This is likely the wrong place to pose my questions, but it's the first exposure to Any23 within Nutch 1.x. Do please redirect me to a better place to post questions. And,...I may not even be asking the right questions but here goes:

How do I expose the discovered microdata items to end-user such as Solr? For example, what are the microdata items and how should I map them to Solr in solrindex-mapping.xml ?

lewismc · 2018-02-08T08:14:48Z

Hi @ferrerod this is a good question and I would like to answer it on the mailing list. Please ask it on user@
http://nutch.apache.org/mailing_lists.html

ferrerod · 2018-02-08T17:21:24Z

Thank you Lewis, I just sent an email:
https://www.mail-archive.com/user@nutch.apache.org/msg15974.html

thilohaas force-pushed the feature/NUTCH-1129-microdata branch from da44932 to 70741f7 Compare July 21, 2017 15:24

thilohaas force-pushed the feature/NUTCH-1129-microdata branch 2 times, most recently from a2ca570 to a8533dd Compare August 21, 2017 17:07

Add patch from NUTCH-1129

6a9d082

thilohaas force-pushed the feature/NUTCH-1129-microdata branch 2 times, most recently from c6e7b61 to 0a3d5cb Compare August 21, 2017 17:39

update to nutch 1.*

bb92ab7

thilohaas force-pushed the feature/NUTCH-1129-microdata branch from 0a3d5cb to bb92ab7 Compare August 21, 2017 17:42

lewismc requested changes Aug 21, 2017

View reviewed changes

lewismc mentioned this pull request Sep 5, 2017

Any23 elastic rest5 #220

Closed

Merge branch 'master' into feature/NUTCH-1129-microdata

b62c2c8

lewismc requested changes Dec 20, 2017

View reviewed changes

Dominique Wirz added 2 commits December 20, 2017 18:47

add any23 to javadoc and eclipse build targets

a1bcad5

update any23 dependency to 2.1

c380e2d

Merge branch 'master' into feature/NUTCH-1129-microdata

1a2f1b8

lewismc requested changes Jan 3, 2018

View reviewed changes

Dominique Wirz and others added 4 commits January 4, 2018 11:30

undo remove exclude from tika plugin

d0329a5

fix/move any23 ivy dependencies

f02fb23

fix input to avoid extraction error

2c7b230

Merge branch 'master' into feature/NUTCH-1129-microdata

b40eed4

nickredmark added 2 commits January 10, 2018 11:38

store structured_data as an array of objects

1607580

fix assert wildcard import

3d940d7

lewismc requested changes Jan 10, 2018

View reviewed changes

Merge branch 'master' into feature/NUTCH-1129-microdata

1e2c848

nickredmark added 5 commits January 11, 2018 09:54

add any23 config

811a623

remove unneeded dependencies

f9981d4

use explicit imports

e1ef4ab

remove other unneeded dependency

4d9b197

Merge branch 'feature/NUTCH-1129-microdata' of https://github.com/sma…

dd17b7b

…rtive/nutch into feature/NUTCH-1129-microdata

lewismc approved these changes Jan 11, 2018

View reviewed changes

lewismc merged commit f82959d into apache:master Jan 11, 2018

		@@ -0,0 +1,8 @@
		1. Upgrade Any23 dependency in trunk/ivy/ivy.xml

WIP: NUTCH-1129 microdata for Nutch 1.x #205

WIP: NUTCH-1129 microdata for Nutch 1.x #205

Conversation

thilohaas commented Jul 20, 2017

lewismc commented Jul 26, 2017

simoncpu commented Jul 26, 2017

lewismc commented Jul 27, 2017

thilohaas commented Aug 21, 2017

lewismc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simoncpu commented Aug 30, 2017

lewismc commented Aug 30, 2017

simoncpu commented Aug 31, 2017

lewismc commented Aug 31, 2017

simoncpu commented Sep 5, 2017 • edited Loading

lewismc commented Sep 5, 2017

simoncpu commented Sep 5, 2017 • edited Loading

lewismc commented Sep 5, 2017

thilohaas commented Sep 5, 2017

hostingnuggets commented Nov 20, 2017

lewismc commented Dec 2, 2017

hostingnuggets commented Dec 2, 2017

lewismc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lewismc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lewismc commented Jan 8, 2018

nickredmark commented Jan 8, 2018

lewismc commented Jan 8, 2018

nickredmark commented Jan 8, 2018

lewismc commented Jan 8, 2018

nickredmark commented Jan 10, 2018 • edited Loading

lewismc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lewismc commented Jan 10, 2018

nickredmark commented Jan 11, 2018

ferrerod commented Feb 8, 2018

lewismc commented Feb 8, 2018

ferrerod commented Feb 8, 2018

lewismc commented Feb 8, 2018

simoncpu commented Sep 5, 2017 •

edited

Loading

simoncpu commented Sep 5, 2017 •

edited

Loading

nickredmark commented Jan 10, 2018 •

edited

Loading

ferrerod commented Feb 8, 2018 •

edited

Loading