Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: NUTCH-1129 microdata for Nutch 1.x #205

Merged
merged 25 commits into from
Jan 11, 2018

Conversation

thilohaas
Copy link

No description provided.

@thilohaas thilohaas force-pushed the feature/NUTCH-1129-microdata branch from da44932 to 70741f7 Compare July 21, 2017 15:24
@lewismc
Copy link
Member

lewismc commented Jul 26, 2017

Hi @thilohaas this patch is too large for us to merge into Nutch master branch...
Can you please separate our your code to implement Microdata support? We can then review that patch alone.

@simoncpu
Copy link

Will try this patch while waiting for it to be merged into the official repo... thanks, man! :)

@lewismc
Copy link
Member

lewismc commented Jul 27, 2017

Hi @simoncpu , there is no way we can merge this code into master branch of Nutch... it is simply too much of a change.
This patch needs to be reduced in size to be considered.
Thank you for all contributions to Nutch, we welcome all, we need to make sure that the software is high quality and stable.

@thilohaas thilohaas force-pushed the feature/NUTCH-1129-microdata branch 2 times, most recently from a2ca570 to a8533dd Compare August 21, 2017 17:07
@thilohaas thilohaas force-pushed the feature/NUTCH-1129-microdata branch 2 times, most recently from c6e7b61 to 0a3d5cb Compare August 21, 2017 17:39
@thilohaas
Copy link
Author

Sorry, I didn't accidentally added changes from another local test-branch. Should be cleaned up now and only contain any23 plugin relevant changes.

Copy link
Member

@lewismc lewismc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider my comments @thilohaas thank you for updating the patch...

Metadata metadata = parse.getData().getParseMeta();

for (String triple : triples) {
sb.append(triple);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we discussed and agreed that this was not an optional solution for associating triples with the Metadata. I still agree with that. We need to think of a more efficient manner for persisting triples.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the time being, I would move forward with implementing this. We can work on improving an appropriate storage mechanisms/location for Any23 extractions later on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lewismc taking over this PR, going to adapt the format of the triplets to an array of objects if that's ok.

(example data taken from http://dbpedia.org/data/Z%C3%BCrich.ntriples)

triples: [
  {
    key: 'http://www.w3.org/2002/07/owl#sameAs',
    short_key: 'sameAs', // for convenience
    value: http://rdf.freebase.com/ns/m.08966'',
  },
  {
    key: 'http://dbpedia.org/property/yearHumidity',
    short_key: 'yearHumidity',
    value: '77',
  }
]

* uses the <a href="http://any23.apache.org">Apache Any23</a> library
* for parsing and extracting structured data in RDF format from a
* variety of Web documents. Currently it supports the following
* input formats:</p>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest the comment, including a list of the supported formats is not really necessary. You can just link back to the any23.apache.org homepage for a list of supported formats.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the list. Just link to the Any23 Webpage.

@simoncpu
Copy link

I tried building using the updated patch but got this:

[ivy:resolve] WARN:     ::::::::::::::::::::::::::::::::::::::::::::::
[ivy:resolve] WARN:     ::          UNRESOLVED DEPENDENCIES         ::
[ivy:resolve] WARN:     ::::::::::::::::::::::::::::::::::::::::::::::
[ivy:resolve] WARN:     :: org.apache.commons#commons-csv;1.0-SNAPSHOT-rev1148315: not found
[ivy:resolve] WARN:     ::::::::::::::::::::::::::::::::::::::::::::::

@lewismc
Copy link
Member

lewismc commented Aug 30, 2017

@simoncpu this may be intermittent... please report back here if it does not resolve itself. I am aware that this SNAPSHOT dependency has given us problems in the past. We may need to push a fix somewhere in Any23 e.g. upgrade the commons-csv library.

@simoncpu
Copy link

@lewismc It still didn't work, so I just grabbed the jar file at: http://svn.apache.org/repos/asf/any23/repo-ext/org/apache/commons/commons-csv/1.0-SNAPSHOT-rev1148315/. :)

@lewismc
Copy link
Member

lewismc commented Aug 31, 2017

OK this is an issue. The solution is to address https://issues.apache.org/jira/browse/ANY23-264

@simoncpu
Copy link

simoncpu commented Sep 5, 2017

@thilohaas I tested this on a website with Microdata, but it can't index anything...

EDIT: The error is:
Error parsing: http://example.org/website-with-microdata: org.apache.nutch.parse.ParseException: Unable to successfully parse content

@lewismc lewismc mentioned this pull request Sep 5, 2017
@lewismc
Copy link
Member

lewismc commented Sep 5, 2017

@thilohaas can you consider the comments above please?

@simoncpu thank you for trying out the patch... please keep providing feedback. Did you manage to debug the source of the ParseException? The URL you provide is not actually available... have you tried it on anything else? An example would be https://www.w3.org

@simoncpu
Copy link

simoncpu commented Sep 5, 2017

@lewismc Here's one of the URLs that I've tried:

http://mcdonalds.jobs/salt-lake-city-ut/general-manager/2947B6E7B04147FFBEE1445E66D7EA67/job/

BTW, the previous patch was able to parse the Microdata without problems. :)

EDIT, here's the full output:

Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold: -1
Thread FetcherThread has no more work available
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=1
fetcher.maxNum.threads can't be < than 50 : using 50 instead
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0
-activeThreads=0
Fetcher: finished at 2017-09-05 17:25:43, elapsed: 00:00:08
Parsing : 20170905172529
/home/simoncpu/nutch/runtime/local/bin/nutch parse -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true -D mapreduce.task.skip.start.attempts=2 -D mapreduce.map.skip.maxrecords=1 crawl-dir/segments/20170905172529
ParseSegment: starting at 2017-09-05 17:25:45
ParseSegment: segment: crawl-dir/segments/20170905172529
Error parsing: http://mcdonalds.jobs/salt-lake-city-ut/general-manager/2947B6E7B04147FFBEE1445E66D7EA67/job/: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content
Parsed (225ms):http://mcdonalds.jobs/salt-lake-city-ut/general-manager/2947B6E7B04147FFBEE1445E66D7EA67/job/
ParseSegment: finished at 2017-09-05 17:25:51, elapsed: 00:00:06
CrawlDB update
/home/simoncpu/nutch/runtime/local/bin/nutch updatedb -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true crawl-dir/crawldb crawl-dir/segments/20170905172529
CrawlDb update: starting at 2017-09-05 17:25:53
CrawlDb update: db: crawl-dir/crawldb
CrawlDb update: segments: [crawl-dir/segments/20170905172529]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2017-09-05 17:25:59, elapsed: 00:00:05
Link inversion
/home/simoncpu/nutch/runtime/local/bin/nutch invertlinks crawl-dir/linkdb crawl-dir/segments/20170905172529
LinkDb: starting at 2017-09-05 17:26:01
LinkDb: linkdb: crawl-dir/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: internal links will be ignored.
LinkDb: adding segment: crawl-dir/segments/20170905172529
LinkDb: finished at 2017-09-05 17:26:06, elapsed: 00:00:04
Dedup on crawldb
/home/simoncpu/nutch/runtime/local/bin/nutch dedup crawl-dir/crawldb
DeduplicationJob: starting at 2017-09-05 17:26:07
Deduplication: 0 documents marked as duplicates
Deduplication: Updating status of duplicate urls into crawl db.
Deduplication finished at 2017-09-05 17:26:15, elapsed: 00:00:07
Indexing 20170905172529 to index
/home/simoncpu/nutch/runtime/local/bin/nutch index crawl-dir/crawldb -linkdb crawl-dir/linkdb crawl-dir/segments/20170905172529
Segment dir is complete: crawl-dir/segments/20170905172529.
Indexer: starting at 2017-09-05 17:26:17
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
ElasticRestIndexWriter
        elastic.rest.host : hostname
        elastic.rest.port : port
        elastic.rest.index : elastic index command
        elastic.rest.max.bulk.docs : elastic bulk index doc counts. (default 250)
        elastic.rest.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)


Indexer: number of documents indexed, deleted, or skipped:
Indexer: finished at 2017-09-05 17:26:23, elapsed: 00:00:05
Cleaning up index if possible
/home/simoncpu/nutch/runtime/local/bin/nutch clean crawl-dir/crawldb
Wed Sep 6 01:26:28 DST 2017 : Finished loop with 1 iterations

@lewismc
Copy link
Member

lewismc commented Sep 5, 2017

I get a parser error using the Any23 Webservice

@thilohaas
Copy link
Author

Sadly I'm currently too busy, but will definitely look into it as soon as possible.
Do you maybe have an idea of how to pass an array or hash of strings to the filter (see my comment on the PR)? So I would be able to simplify the process and come up with an alternative way of storing triples on the documents.

btw the any23 webservice seems to be broken, as it's failing on all websites I've tried. For example google as well: http://any23.org/any23/?format=best&uri=https%3A%2F%2Fgoogle.com&validation-mode=none

@hostingnuggets
Copy link

Is it planned to have this patch available also in the Nutch 2.x branch?

@lewismc
Copy link
Member

lewismc commented Dec 2, 2017

@hostingnuggets I don't see why not. If you feel like submitting a PR then I will review it.

@hostingnuggets
Copy link

@lewismc sorry I am no Java dev here but I would nevertheless like to help if you can assist me here a bit. Do I understand correctly that the first step for that would be to take the 16 files which @thilohaas modified (total of two commits) and apply them to the master branch of nutch, see if it works and if yes create a new branch and submit a pull request?

Copy link
Member

@lewismc lewismc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @smartive please take a look at my comments.

build.xml Outdated
@@ -1031,7 +1031,8 @@

<source path="${basedir}/src/java/" />
<source path="${basedir}/src/test/" output="build/test/classes" />

<source path="${plugins.dir}/any23/src/java/" />
<source path="${plugins.dir}/any23/src/test/" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add entries for the javadoc and eclipse targets as well.

Additionally, please add entries to default.properties as appropriate

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ivy/ivy.xml Outdated
@@ -66,6 +66,7 @@
<!-- End of Hadoop Dependencies -->

<dependency org="org.apache.tika" name="tika-core" rev="1.17" />
<dependency org="org.apache.any23" name="apache-any23-core" rev="2.0"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upgrade this to 2.1 and correct formatting.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,8 @@
1. Upgrade Any23 dependency in trunk/ivy/ivy.xml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can go ahead and upgrade this when you make the update to the any23 2.1 dependency

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import java.io.IOException;
import java.net.URISyntaxException;
import java.nio.charset.Charset;
import java.util.*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove thi wildcard and use explicit imports.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import org.apache.any23.writer.TripleHandlerException;
import org.apache.hadoop.conf.Configuration;
import org.apache.nutch.metadata.Metadata;
import org.apache.nutch.parse.*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use explicit imports.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* uses the <a href="http://any23.apache.org">Apache Any23</a> library
* for parsing and extracting structured data in RDF format from a
* variety of Web documents. Currently it supports the following
* input formats:</p>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the list. Just link to the Any23 Webpage.

private void parse(String url, String htmlContent) throws URISyntaxException, IOException, TripleHandlerException {
Any23 any23 = new Any23();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
TripleHandler tHandler = new NTriplesWriter(baos);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please write the triples as Turtle, it is easier to read and hence debug if this breaks in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore the comment.

Metadata metadata = parse.getData().getParseMeta();

for (String triple : triples) {
sb.append(triple);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the time being, I would move forward with implementing this. We can work on improving an appropriate storage mechanisms/location for Any23 extractions later on.

* limitations under the License.
*/
/**
* @author lewismc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove lewismc and add comment similar to that present within Any23ParseFilter

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*/
package org.apache.nutch.any23;

import static org.junit.Assert.*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove wildcard and use explicit imports

Copy link
Member

@lewismc lewismc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my comments.

ivy/ivy.xml Outdated
@@ -66,6 +66,13 @@
<!-- End of Hadoop Dependencies -->

<dependency org="org.apache.tika" name="tika-core" rev="1.17" />
<dependency org="net.sourceforge.owlapi" name="owlapi-api" rev="5.1.0" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dependencies need to go into the plugin ivy.xml. They do not belong here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -44,6 +44,10 @@
<exclude org="org.slf4j" name="slf4j-api" />
<exclude org="commons-lang" name="commons-lang" />
<exclude org="com.google.protobuf" name="protobuf-java" />
<exclude org="org.apache.poi" name="poi" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these being excluded?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lewismc
Copy link
Member

lewismc commented Jan 8, 2018

Excellent @nmaro

@nickredmark
Copy link
Contributor

@lewismc quick question: what is your favorite way to run single plugin tests?

We are trying with ant compile-core-test && ant runtime && ./runtime/local/bin/nutch junit org.apache.nutch.parse.metatags.TestAny23ParseFilter but we are encountering some problems. Is this the recommended way?

@lewismc
Copy link
Member

lewismc commented Jan 8, 2018

@nmaro I run them in Eclipse, by the way, I have been working hard on Any23 to improve microdata extraction and a bunch of other stuff. We will be releasing Any23 2.2 reasonably soon so we can make the Any23 upgrade here in Nutch as well.

@nickredmark
Copy link
Contributor

@lewismc ok. I added an ant command that allows one to run single plugin tests like ant -Dplugin=any23 test-plugin (and works) can I check that in?

@lewismc
Copy link
Member

lewismc commented Jan 8, 2018

@nmaro yes please submit a PR, thanks

@nickredmark
Copy link
Contributor

nickredmark commented Jan 10, 2018

@lewismc Requested changes done - please note that

  • I had to extend the elastic http plugin to handle lists of Map objects that it previously just stringified
  • Any23 couldn't detect as many triples as you expected in your tests, had to lower the number - but it's good enough for us for now, people can still expand the any23 scope if they find out what the problem is
  • Data is now indexed as follows (example after crawling https://smartive.ch/jobs):
          "structured_data": [
            {
              "node": "<https://smartive.ch/jobs>",
              "value": "\"IE-edge,chrome=1\"@de",
              "key": "<http://vocab.sindice.net/any23#X-UA-Compatible>",
              "short_key": "X-UA-Compatible"
            },
            {
              "node": "<https://smartive.ch/jobs>",
              "value": "\"Wir sind smartive \\u2014 eine dynamische, innovative Schweizer Webentwicklungsagentur. Die Realisierung zeitgem\\u00E4sser Webl\\u00F6sungen geh\\u00F6rt genauso zu unserer Passion, wie die konstruktive Zusammenarbeit mit unseren Kundinnen und Kunden.\"@de",
              "key": "<http://vocab.sindice.net/any23#description>",
              "short_key": "description"
            },
            {
              "node": "<https://smartive.ch/jobs>",
              "value": "\"width=device-width, initial-scale=1, shrink-to-fit=no\"@de",
              "key": "<http://vocab.sindice.net/any23#viewport>",
              "short_key": "viewport"
            },
            {
              "node": "<https://smartive.ch/jobs>",
              "value": "\"width=device-width,initial-scale=1\"@de",
              "key": "<http://vocab.sindice.net/any23#viewport>",
              "short_key": "viewport"
            },
            {
              "node": "<https://smartive.ch/jobs>",
              "value": "\"ie=edge\"@de",
              "key": "<http://vocab.sindice.net/any23#x-ua-compatible>",
              "short_key": "x-ua-compatible"
            }
          ],

Copy link
Member

@lewismc lewismc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks overall much much better. We are nearly getting there :)
Can you please address the comments I've made.

<property name="any23.svn.apache.org"
value="http://svn.apache.org/repos/asf/any23/repo-ext/"
override="false"/>
<property name="ebi.ac.uk"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed... is there any compelling reason to have it included?

@@ -47,6 +53,18 @@
pattern="${maven2.pattern.ext}"
m2compatible="true"
/>
<ibiblio name="ebi"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed... is there any compelling reason to have it included?

@@ -64,6 +82,8 @@
<resolver ref="maven2"/>
<resolver ref="apache-snapshot"/>
<resolver ref="sonatype"/>
<resolver ref="any23-svn-repository"/>
<resolver ref="ebi"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed... is there any compelling reason to have it included?

@@ -75,6 +95,8 @@
<resolver ref="maven2"/>
<resolver ref="apache-snapshot"/>
<resolver ref="sonatype"/>
<resolver ref="any23-svn-repository"/>
<resolver ref="ebi"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed... is there any compelling reason to have it included?

src/bin/crawl Outdated
# -sm Path to sitemap URL file(s)
# CrawlDir Directory where the crawl/link/segments dirs are saved
# NumRounds The number of rounds to run this crawl for
# Usage: crawl [options] <crawl_dir> <num_rounds>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, i don't think that this content should be included in this patch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lewismc I went ahead and merged master back into this feature branch so this is solved - see 1e2c848


<dependencies>
<dependency org="org.semanticweb.owlapi" name="owlapi" rev="3.2.4" conf="*->default"/>
<dependency org="org.apache.commons" name="commons-rdf-api" rev="0.5.0" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is both commons-rdf-api and owlapi included here?

*/
package org.apache.nutch.any23;

import java.io.*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use explicit imports

any23.setMIMETypeDetector(null);

try {
// Fix input to avoid extraction error (https://github.com/semarglproject/semargl/issues/37#issuecomment-69381281)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this code should be ported over to Any23's BaseRDFParser implementations... i think the RDFa11 parser is an example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to ignore this within the scope of this merge request as we already invested many hours into it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will happily do it over at Any23 in the future.

*/
public final static String ANY23_TRIPLES = "Any23-Triples";

public static final String ANY_23_EXTRACTORS_CONF = "any23.extractors";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to introduce a new option we need to actually include it in nutch-default.xml

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@lewismc
Copy link
Member

lewismc commented Jan 10, 2018

Thank you @mfeltscher

@nickredmark
Copy link
Contributor

@lewismc all comments addressed.

@lewismc lewismc merged commit f82959d into apache:master Jan 11, 2018
@ferrerod
Copy link

ferrerod commented Feb 8, 2018

On a Mac with jdk 8 installed, I ran into failure on the javadoc task complaining about the java version. Upon deeper inspection I determined the failure condition was tripping up on ant.java.version equals 1.6 - running Ant -v and it said my are in the JAVA_HOME (jdk 8) is 1.6! Super strange...

I removed the ant.java.version checks in java doc task and reran...

ant zip-bin with java 8 finished successfully!! However, the reason I'm posting here is, I noticed 19 errors and 106 warnings in the java doc task. Here is the first few errors it encountered:

[javadoc] /nutch/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java:30: error: package org.apache.any23 does not exist
[javadoc] import org.apache.any23.Any23;
[javadoc] ^
[javadoc] /nutch/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java:31: error: package org.apache.any23.extractor does not exist
[javadoc] import org.apache.any23.extractor.ExtractionException;
[javadoc] ^
[javadoc] /nutch/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java:32: error: package org.apache.any23.writer does not exist
[javadoc] import org.apache.any23.writer.BenchmarkTripleHandler;
[javadoc] ^
[javadoc] /nutch/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java:33: error: package org.apache.any23.writer does not exist
[javadoc] import org.apache.any23.writer.NTriplesWriter;
[javadoc] ^
[javadoc] /nutch/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java:34: error: package org.apache.any23.writer does not exist
[javadoc] import org.apache.any23.writer.TripleHandler;
[javadoc] ^
[javadoc] /nutch/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java:35: error: package org.apache.any23.writer does not exist
[javadoc] import org.apache.any23.writer.TripleHandlerException;
[javadoc] ^
[javadoc] /nutch/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java:43: error: package org.ccil.cowan.tagsoup does not exist
[javadoc] import org.ccil.cowan.tagsoup.XMLWriter;
[javadoc] ^
[javadoc] /nutch/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java:44: error: package org.ccil.cowan.tagsoup.jaxp does not exist
[javadoc] import org.ccil.cowan.tagsoup.jaxp.SAXParserImpl;
[javadoc] ^
[javadoc] /nutch/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java:87: error: cannot find symbol
[javadoc] Any23Parser(String url, String htmlContent, String contentType, String... extractorNames) throws TripleHandlerException {

@lewismc
Copy link
Member

lewismc commented Feb 8, 2018

Hi @ferrerod thank you for posting. I think this merely has to do with the Javadoc links not being available
https://github.com/apache/nutch/blob/master/default.properties#L43-L51
If you are able to fix it, then please by all means open a PR and we can review. Thank you

@ferrerod
Copy link

ferrerod commented Feb 8, 2018

@nmaro @lewismc I'm new to nutch and stumbled upon this merged PR after realizing I wanted much better crawling results of sites leveraging schema.org. This is likely the wrong place to pose my questions, but it's the first exposure to Any23 within Nutch 1.x. Do please redirect me to a better place to post questions. And,...I may not even be asking the right questions but here goes:

Q: How do I gain Any23 microdata parsing / indexing capabilities introduced with this PR? Do I replace parse-(html|tika)|index-(basic|anchor) in plugin.includes with something like
parse-(html|tika|any23)|index-(basic|anchor|any23)

How do I expose the discovered microdata items to end-user such as Solr? For example, what are the microdata items and how should I map them to Solr in solrindex-mapping.xml ?

@lewismc
Copy link
Member

lewismc commented Feb 8, 2018

Hi @ferrerod this is a good question and I would like to answer it on the mailing list. Please ask it on user@
http://nutch.apache.org/mailing_lists.html

@ferrerod
Copy link

ferrerod commented Feb 8, 2018

Thank you Lewis, I just sent an email:
https://www.mail-archive.com/user@nutch.apache.org/msg15974.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants