fix for NUTCH-1480 contributed by r0ann3l #218

r0ann3l · 2017-08-28T17:20:39Z

With this patch now we can have many instances of the same IndexWriter class, but with different configurations. Also, we can copy, rename or remove fields of documents for every index writer individually. Besides, the parameters needed by the index writers will be into separated XML files, so them will be not into nutch-site.xml anymore.

jorgelbg · 2017-08-29T15:50:32Z

This PR includes more changes than the original ticket and breaks BC with custom indexers. @r0ann3l could you squash all the changes into a single commit? that would help in the review process since this PR has a lot of changes.

…configurations.

lewismc

This looks very good. Please consider my comments @r0ann3l thank you

lewismc · 2017-09-05T21:29:44Z

src/java/org/apache/nutch/indexer/IndexWriterConfig.java

@@ -0,0 +1,95 @@
+package org.apache.nutch.indexer;


Please add license header

lewismc · 2017-09-05T21:30:01Z

src/java/org/apache/nutch/indexer/IndexWriterConfig.java

+import org.w3c.dom.Element;
+import org.w3c.dom.NodeList;
+
+import java.util.*;


Please use explicit imports instead of wildcard

lewismc · 2017-09-05T21:30:41Z

src/java/org/apache/nutch/indexer/IndexWriters.java

+
+          HashMap<String, Extension> extensionMap = new HashMap<>();
+          for (Extension extension : extensions) {
+            LOG.info("Index writer " + extension.getClazz() + " identified.");


Please use parameterized logging

lewismc · 2017-09-05T21:31:21Z

src/java/org/apache/nutch/indexer/MappingReader.java

+import org.w3c.dom.Node;
+import org.w3c.dom.NodeList;
+
+import java.util.*;


Do not use wildcard

lewismc · 2017-09-05T21:32:50Z

src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitMQConstants.java

@@ -17,28 +17,26 @@
 package org.apache.nutch.indexwriter.rabbit;

 interface RabbitMQConstants {
-    String RABBIT_PREFIX = "rabbitmq.indexer";


Why did you remove all of these?

Hi @lewismc. The prefix is not necessary anymore. The new structure allows us to have the same key of a parameter to many index writers without ambiguity or confusion. The prefix makes a parameter key larger and really I do not believe that is necessary.

lewismc · 2017-09-05T21:33:16Z

src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrConstants.java

@@ -19,19 +19,23 @@
 public interface SolrConstants {
  public static final String SOLR_PREFIX = "solr.";


Any reason to remove all of these as well?

It's still used for ZOOKEEPER_HOSTS = SOLR_PREFIX + "zookeeper.hosts"

lewismc · 2017-09-05T21:34:44Z

src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java

+          solrClients.add(sc);
+        }
+        break;
+      case "concurrent":


Can you throw unsupported Exception at this stage? and also a default case?

jorgelbg · 2017-09-05T23:41:00Z

src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitIndexWriter.java

+            "\t" + RabbitMQConstants.SERVER_USERNAME + " : Username for authentication\n" +
+            "\t" + RabbitMQConstants.SERVER_PASSWORD + " : Password for authentication\n" +
+            "\t" + RabbitMQConstants.COMMIT_SIZE + " : Buffer size when sending to RabbitMQ (default 250)\n";
+  }


For consistency probably use the StringBuilder pattern here?

jorgelbg · 2017-09-05T23:46:55Z

conf/index-writers.xsd

+        <xs:element name="writer" type="writerType" maxOccurs="unbounded" minOccurs="1">
+          <xs:annotation>
+            <xs:documentation>
+              Contains the all configuration of a particular index writer.


typo? Also it would be a good idea to have empty lines at the end of this file and the conf/index-writers.xml.template for git/diff compatiblity.

r0ann3l · 2017-09-13T16:17:30Z

Thanks @lewismc and @jorgelbg for your reviews. All your comments have been fixed.

lewismc · 2017-09-14T18:05:28Z

Tested this on Solr 6 and works well... any comments folks?

sebastian-nagel · 2017-09-20T08:36:15Z

Looks good! I've tried to use indexer-dummy with this PR applied - it took long to configure the index-writers.xml properly, so we should definitely add "stub" sections for all index writers which are (still) based on configuration properties. All index writers should work out-of-the-box!

sebastian-nagel · 2017-09-20T08:43:01Z

conf/index-writers.xml.template

+        <field source="boost"/>
+      </remove>
+    </mapping>
+  </writer>


Add stub sections for all indexer-* plugins so that they work out-of-the-box without modifications of the index-writers.xml required, e.g. for indexer-dummy:

<writer id="indexer_dummy_1" class="org.apache.nutch.indexwriter.dummy.DummyIndexWriter"> <parameters> <param name="dummy-dummy" value="dummy-dummy"/> </parameters> <mapping> <copy> <field source="dummy-dummy" dest="dummy-dummy"/> </copy> <rename/> <remove/> </mapping> </writer>

That's long for a dummy section, but the schema (index-writers.xsd) and the IndexWriters class requires all the elements and attributes. Maybe it's better to "relax" the schema, make elements/attributes optional and make IndexWriters not fail with NPEs.

…ema, copy to the same field is not allowed and IndexWriterParams class to facilitate the process of obtaining values from index-writers.xml file.

r0ann3l · 2017-11-08T14:23:44Z

Thanks @sebastian-nagel for your review. Sections for all indexer-* plugins were added, so they work out-of-the-box as you required in your comments. Also, it is not mandatory to specify fields for the actions (the schema is relaxed).

I included a new change, to avoid duplicate values in a field when someone tries to copy to the same field, like:

<copy>
	<field source="title" dest="title"/>
</copy>

In addition, I added a new class (IndexWriterParams) to facilitate the process of obtaining and parsing values from the index-writers.xml file. Now, an instance of IndexWriterParams is passed to each IndexWriter instead of HashMap.

- Solr: - do not copy fields to target not contained in default schema.xml - use "nutch" as default core name - indexer-dummy: file in working directory (write permissions should be granted)

sebastian-nagel · 2017-11-17T14:15:58Z

Hi @r0ann3l, thanks! I've continued testing, and was able to feed two Solr indexes in parallel. Great! Afaics, all requested changes have been made (also that of @lewismc).

To make the configuration work out of the box, I would suggest 3 changes:

use only field names defined in the default schema.xml
`ERROR: [doc=http://nutch.apache.org/] unknown field 'search'
default Solr core name should be "nutch" as described in the tutorial

I've tried to fix these issues in "a fork of NUTCH-1480". Feel free to cherry pick it from there.

I've also tried to make indexer-dummy work. Without success, the file is created but then overwritten:

there are two instances of IndexWriters active, each having a separate instance of DummyIndexWriter.
- the instance created from IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:39) writes into the file
- but later on the instance created from IndexWriters.open(IndexWriters.java:187) opens the file anew, at the end there is an empty file. Because it's two instances there is no possibility to check whether the file writer is already instantiated.

I see two potential solutions:

the IndexWriter interface method open(job, name) was defined with file indexers in mind (cf. NUTCH-1541/CSVIndexWriter), an index writer can then decide to do nothing when called with name "commit".
do not call the commit() method explicitly (ev. also remove it from the interface: it does not safely work in distributed mode because it's not run in the reducers (see the comment in RabbitIndexWriter).

I tend to the second solution. It would also solve the problem of having two IndexWriters instances active. What do you think?

- Logs for IndexerOutputFormat class to show the description of writers on the terminal. - IndexWriters instance, describe() method call and commit() call, moved from IndexingJob to IndexerOutputFormat. - The key of CACHE on ObjectCache.java:32 is now the UUID.

r0ann3l · 2017-11-24T13:59:48Z

Hi @sebastian-nagel, thank you very much for your comments!!! I agree with your suggestions and I included the changes you propose from your fork.

About indexer-dummy, I also tried to make it work, but it was not possible. In theory, you can build as many instances of IndexWriters as you want, that you will always get the same instance, because it gotten from cache. So, the first issue I found was the ObjectCache uses the Configuration object itself as the key, and this object is not the same in each call. This causes that there are two instances of IndexWriters writing to same file, as you say. So, I replaced the key of ObjectCache with the UUID of the Configuration object.

Now, we have only one instance of IndexWriters, but there is another problem: when you try to commit the writers in IndexingJob.index(IndexingJob.java151) it is already closed from IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:44). Therefore, I moved the commit() call from IndexingJob to IndexerOutputFormat, just before the close() method is called.

I also, moved the indexers description from IndexingJob to IndexerOutputFormat, to avoid to build IndexWriters instance twice.

Thanks

sebastian-nagel · 2017-11-29T21:08:19Z

Hi @r0ann3l,
did you verify that the change using NutchConfiguration.getUUID(conf) changes the behavior. Cf. NUTCH-2407 which let me doubt it as it's a only random UUID for each Configuration object.

Using the UUID in the ObjectCache makes the unit tests fail (TestGenerator): in fact the ObjectCache now returns the same object even if the configuration is different. We need actually really implement a hash value for Configuration objects.

lewismc · 2018-01-03T17:36:37Z

@r0ann3l can you please update this PR inline with master?

r0ann3l · 2018-01-17T16:15:43Z

Hi @sebastian-nagel:
In this case I propose to use an internal CACHE object, as the PluginRepository does, to store the IndexWriters object. The code could be something like this:

private static final WeakHashMap<String, IndexWriters> CACHE = new WeakHashMap<>();
public static synchronized IndexWriters get(Configuration conf) {
  String uuid = NutchConfiguration.getUUID(conf);
  if (uuid == null) {
    uuid = "nonNutchConf@" + conf.hashCode();
  }
  return CACHE.computeIfAbsent(uuid, k -> new IndexWriters(conf));
}

What do you think?

# Conflicts: # src/java/org/apache/nutch/indexer/IndexWriter.java # src/java/org/apache/nutch/indexer/IndexWriters.java # src/java/org/apache/nutch/indexer/IndexerOutputFormat.java # src/java/org/apache/nutch/indexer/IndexingJob.java # src/plugin/indexer-cloudsearch/src/java/org/apache/nutch/indexwriter/cloudsearch/CloudSearchIndexWriter.java # src/plugin/indexer-dummy/src/java/org/apache/nutch/indexwriter/dummy/DummyIndexWriter.java # src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestIndexWriter.java # src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticConstants.java # src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java # src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitIndexWriter.java # src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitMQConstants.java # src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java # src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrUtils.java

- Internal cache. - Fixed the unit test of the Elastic index writer.

r0ann3l · 2018-05-20T23:51:05Z

Hi @sebastian-nagel,

I changed the use of ObjectCache.class to an internal CACHE object, to avoid changing the behavior of other functionalities. In this case it is not necessary to have a different instance of IndexWriters.class for each Configuration.class. This is beacuse the index writer's configuration is handled in other individual file.

Also, I fixed an issue in TestElasticIndexWriter.class (associated with the use of IndexWriterParams.class), which causes the unit test fail.

sebastian-nagel · 2018-05-31T15:05:47Z

Thanks, @r0ann3l! +1 - I've tested the solution in local and pseudo-distributed mode and was able to index into Solr (a single index). If there are no objections I'll commit/merge soon.

r0ann3l force-pushed the NUTCH-1480 branch 20 times, most recently from a849eb1 to e4a7f87 Compare August 31, 2017 15:23

Fixes for NUTCH-1480: Multiple index writer instances with different …

e4a7f87

…configurations.

lewismc requested changes Sep 5, 2017

View reviewed changes

jorgelbg reviewed Sep 5, 2017

View reviewed changes

Fixes for NUTCH-1480: Some improvements based on reviewers feedback.

86cd375

sebastian-nagel reviewed Sep 20, 2017

View reviewed changes

Fixes for NUTCH-1480: Sections for all indexer-* plugins, relaxed sch…

84246a9

…ema, copy to the same field is not allowed and IndexWriterParams class to facilitate the process of obtaining values from index-writers.xml file.

r0ann3l and others added 4 commits November 8, 2017 09:28

Merge branch 'master' into NUTCH-1480

a674df5

Merge branch 'master' into NUTCH-1480

e125379

Adapt defaults of index-writers.xml to work out-of-the-box:

5cd48d8

- Solr: - do not copy fields to target not contained in default schema.xml - use "nutch" as default core name - indexer-dummy: file in working directory (write permissions should be granted)

Improve log messages of indexer-dummy

ca5d242

r0ann3l force-pushed the NUTCH-1480 branch from cf819af to 7e9d1df Compare November 24, 2017 13:55

r0ann3l added 2 commits January 12, 2018 11:08

Merge branch 'master' into NUTCH-1480

54b7fa8

Fixes for NUTCH-1480: Support for NUTCH-2484 and NUTCH-2380.

d45510c

r0ann3l added 3 commits May 18, 2018 22:23

Fixes for NUTCH-1480: Merge branch 'master' into NUTCH-1480

b4e5393

Fixes for NUTCH-1480: Changes:

041927a

- Internal cache. - Fixed the unit test of the Elastic index writer.

sebastian-nagel merged commit 02afd5b into apache:master Jun 1, 2018

sebastian-nagel mentioned this pull request Jun 1, 2018

fix for NUTCH-2580 contributed by r0ann3l #335

Merged

r0ann3l deleted the NUTCH-1480 branch March 27, 2019 16:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix for NUTCH-1480 contributed by r0ann3l #218

fix for NUTCH-1480 contributed by r0ann3l #218

r0ann3l commented Aug 28, 2017

jorgelbg commented Aug 29, 2017

lewismc left a comment

lewismc Sep 5, 2017

lewismc Sep 5, 2017

lewismc Sep 5, 2017

lewismc Sep 5, 2017

lewismc Sep 5, 2017

r0ann3l Sep 13, 2017

lewismc Sep 5, 2017

sebastian-nagel Jun 1, 2018 •

edited

lewismc Sep 5, 2017

jorgelbg Sep 5, 2017

jorgelbg Sep 5, 2017

r0ann3l commented Sep 13, 2017

lewismc commented Sep 14, 2017

sebastian-nagel commented Sep 20, 2017

sebastian-nagel Sep 20, 2017

r0ann3l commented Nov 8, 2017

sebastian-nagel commented Nov 17, 2017

r0ann3l commented Nov 24, 2017 •

edited

sebastian-nagel commented Nov 29, 2017

lewismc commented Jan 3, 2018

r0ann3l commented Jan 17, 2018

r0ann3l commented May 20, 2018

sebastian-nagel commented May 31, 2018

		@@ -19,19 +19,23 @@
		public interface SolrConstants {
		public static final String SOLR_PREFIX = "solr.";

fix for NUTCH-1480 contributed by r0ann3l #218

fix for NUTCH-1480 contributed by r0ann3l #218

Conversation

r0ann3l commented Aug 28, 2017

jorgelbg commented Aug 29, 2017

lewismc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sebastian-nagel Jun 1, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

r0ann3l commented Sep 13, 2017

lewismc commented Sep 14, 2017

sebastian-nagel commented Sep 20, 2017

Choose a reason for hiding this comment

r0ann3l commented Nov 8, 2017

sebastian-nagel commented Nov 17, 2017

r0ann3l commented Nov 24, 2017 • edited

sebastian-nagel commented Nov 29, 2017

lewismc commented Jan 3, 2018

r0ann3l commented Jan 17, 2018

r0ann3l commented May 20, 2018

sebastian-nagel commented May 31, 2018

sebastian-nagel Jun 1, 2018 •

edited

r0ann3l commented Nov 24, 2017 •

edited