Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix for NUTCH-1480 contributed by r0ann3l #218

Merged
merged 13 commits into from Jun 1, 2018

Conversation

r0ann3l
Copy link
Contributor

@r0ann3l r0ann3l commented Aug 28, 2017

With this patch now we can have many instances of the same IndexWriter class, but with different configurations. Also, we can copy, rename or remove fields of documents for every index writer individually. Besides, the parameters needed by the index writers will be into separated XML files, so them will be not into nutch-site.xml anymore.

@jorgelbg
Copy link
Member

This PR includes more changes than the original ticket and breaks BC with custom indexers. @r0ann3l could you squash all the changes into a single commit? that would help in the review process since this PR has a lot of changes.

@r0ann3l r0ann3l force-pushed the NUTCH-1480 branch 20 times, most recently from a849eb1 to e4a7f87 Compare August 31, 2017 15:23
Copy link
Member

@lewismc lewismc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very good. Please consider my comments @r0ann3l thank you

@@ -0,0 +1,95 @@
package org.apache.nutch.indexer;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add license header

import org.w3c.dom.Element;
import org.w3c.dom.NodeList;

import java.util.*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use explicit imports instead of wildcard


HashMap<String, Extension> extensionMap = new HashMap<>();
for (Extension extension : extensions) {
LOG.info("Index writer " + extension.getClazz() + " identified.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use parameterized logging

import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

import java.util.*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not use wildcard

@@ -17,28 +17,26 @@
package org.apache.nutch.indexwriter.rabbit;

interface RabbitMQConstants {
String RABBIT_PREFIX = "rabbitmq.indexer";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove all of these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lewismc. The prefix is not necessary anymore. The new structure allows us to have the same key of a parameter to many index writers without ambiguity or confusion. The prefix makes a parameter key larger and really I do not believe that is necessary.

@@ -19,19 +19,23 @@
public interface SolrConstants {
public static final String SOLR_PREFIX = "solr.";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to remove all of these as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still used for ZOOKEEPER_HOSTS = SOLR_PREFIX + "zookeeper.hosts"

solrClients.add(sc);
}
break;
case "concurrent":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you throw unsupported Exception at this stage? and also a default case?

"\t" + RabbitMQConstants.SERVER_USERNAME + " : Username for authentication\n" +
"\t" + RabbitMQConstants.SERVER_PASSWORD + " : Password for authentication\n" +
"\t" + RabbitMQConstants.COMMIT_SIZE + " : Buffer size when sending to RabbitMQ (default 250)\n";
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency probably use the StringBuilder pattern here?

<xs:element name="writer" type="writerType" maxOccurs="unbounded" minOccurs="1">
<xs:annotation>
<xs:documentation>
Contains the all configuration of a particular index writer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo? Also it would be a good idea to have empty lines at the end of this file and the conf/index-writers.xml.template for git/diff compatiblity.

@r0ann3l
Copy link
Contributor Author

r0ann3l commented Sep 13, 2017

Thanks @lewismc and @jorgelbg for your reviews. All your comments have been fixed.

@lewismc
Copy link
Member

lewismc commented Sep 14, 2017

Tested this on Solr 6 and works well... any comments folks?

@sebastian-nagel
Copy link
Contributor

Looks good! I've tried to use indexer-dummy with this PR applied - it took long to configure the index-writers.xml properly, so we should definitely add "stub" sections for all index writers which are (still) based on configuration properties. All index writers should work out-of-the-box!

<field source="boost"/>
</remove>
</mapping>
</writer>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add stub sections for all indexer-* plugins so that they work out-of-the-box without modifications of the index-writers.xml required, e.g. for indexer-dummy:

  <writer id="indexer_dummy_1" class="org.apache.nutch.indexwriter.dummy.DummyIndexWriter">
    <parameters>
      <param name="dummy-dummy" value="dummy-dummy"/>
    </parameters>
    <mapping>
      <copy>
        <field source="dummy-dummy" dest="dummy-dummy"/>
      </copy>
      <rename/>
      <remove/>
    </mapping>
  </writer>

That's long for a dummy section, but the schema (index-writers.xsd) and the IndexWriters class requires all the elements and attributes. Maybe it's better to "relax" the schema, make elements/attributes optional and make IndexWriters not fail with NPEs.

…ema, copy to the same field is not allowed and IndexWriterParams class to facilitate the process of obtaining values from index-writers.xml file.
@r0ann3l
Copy link
Contributor Author

r0ann3l commented Nov 8, 2017

Thanks @sebastian-nagel for your review. Sections for all indexer-* plugins were added, so they work out-of-the-box as you required in your comments. Also, it is not mandatory to specify fields for the actions (the schema is relaxed).

I included a new change, to avoid duplicate values in a field when someone tries to copy to the same field, like:

<copy>
	<field source="title" dest="title"/>
</copy>

In addition, I added a new class (IndexWriterParams) to facilitate the process of obtaining and parsing values from the index-writers.xml file. Now, an instance of IndexWriterParams is passed to each IndexWriter instead of HashMap.

r0ann3l and others added 4 commits November 8, 2017 09:28
- Solr:
  - do not copy fields to target not contained in default schema.xml
  - use "nutch" as default core name
- indexer-dummy: file in working directory (write permissions should be granted)
@sebastian-nagel
Copy link
Contributor

Hi @r0ann3l, thanks! I've continued testing, and was able to feed two Solr indexes in parallel. Great! Afaics, all requested changes have been made (also that of @lewismc).

To make the configuration work out of the box, I would suggest 3 changes:

  • use only field names defined in the default schema.xml
    `ERROR: [doc=http://nutch.apache.org/] unknown field 'search'
  • default Solr core name should be "nutch" as described in the tutorial

I've tried to fix these issues in "a fork of NUTCH-1480". Feel free to cherry pick it from there.

I've also tried to make indexer-dummy work. Without success, the file is created but then overwritten:

  • there are two instances of IndexWriters active, each having a separate instance of DummyIndexWriter.
    • the instance created from IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:39) writes into the file
    • but later on the instance created from IndexWriters.open(IndexWriters.java:187) opens the file anew, at the end there is an empty file. Because it's two instances there is no possibility to check whether the file writer is already instantiated.

I see two potential solutions:

  1. the IndexWriter interface method open(job, name) was defined with file indexers in mind (cf. NUTCH-1541/CSVIndexWriter), an index writer can then decide to do nothing when called with name "commit".
  2. do not call the commit() method explicitly (ev. also remove it from the interface: it does not safely work in distributed mode because it's not run in the reducers (see the comment in RabbitIndexWriter).

I tend to the second solution. It would also solve the problem of having two IndexWriters instances active. What do you think?

- Logs for IndexerOutputFormat class to show the description of writers on the terminal.
- IndexWriters instance, describe() method call and commit() call, moved from IndexingJob to IndexerOutputFormat.
- The key of CACHE on ObjectCache.java:32 is now the UUID.
@r0ann3l
Copy link
Contributor Author

r0ann3l commented Nov 24, 2017

Hi @sebastian-nagel, thank you very much for your comments!!! I agree with your suggestions and I included the changes you propose from your fork.

About indexer-dummy, I also tried to make it work, but it was not possible. In theory, you can build as many instances of IndexWriters as you want, that you will always get the same instance, because it gotten from cache. So, the first issue I found was the ObjectCache uses the Configuration object itself as the key, and this object is not the same in each call. This causes that there are two instances of IndexWriters writing to same file, as you say. So, I replaced the key of ObjectCache with the UUID of the Configuration object.

Now, we have only one instance of IndexWriters, but there is another problem: when you try to commit the writers in IndexingJob.index(IndexingJob.java151) it is already closed from IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:44). Therefore, I moved the commit() call from IndexingJob to IndexerOutputFormat, just before the close() method is called.

I also, moved the indexers description from IndexingJob to IndexerOutputFormat, to avoid to build IndexWriters instance twice.

Thanks

@sebastian-nagel
Copy link
Contributor

Hi @r0ann3l,
did you verify that the change using NutchConfiguration.getUUID(conf) changes the behavior. Cf. NUTCH-2407 which let me doubt it as it's a only random UUID for each Configuration object.

Using the UUID in the ObjectCache makes the unit tests fail (TestGenerator): in fact the ObjectCache now returns the same object even if the configuration is different. We need actually really implement a hash value for Configuration objects.

@lewismc
Copy link
Member

lewismc commented Jan 3, 2018

@r0ann3l can you please update this PR inline with master?

@r0ann3l
Copy link
Contributor Author

r0ann3l commented Jan 17, 2018

Hi @sebastian-nagel:
In this case I propose to use an internal CACHE object, as the PluginRepository does, to store the IndexWriters object. The code could be something like this:

private static final WeakHashMap<String, IndexWriters> CACHE = new WeakHashMap<>();
public static synchronized IndexWriters get(Configuration conf) {
  String uuid = NutchConfiguration.getUUID(conf);
  if (uuid == null) {
    uuid = "nonNutchConf@" + conf.hashCode();
  }
  return CACHE.computeIfAbsent(uuid, k -> new IndexWriters(conf));
}

What do you think?

# Conflicts:
#	src/java/org/apache/nutch/indexer/IndexWriter.java
#	src/java/org/apache/nutch/indexer/IndexWriters.java
#	src/java/org/apache/nutch/indexer/IndexerOutputFormat.java
#	src/java/org/apache/nutch/indexer/IndexingJob.java
#	src/plugin/indexer-cloudsearch/src/java/org/apache/nutch/indexwriter/cloudsearch/CloudSearchIndexWriter.java
#	src/plugin/indexer-dummy/src/java/org/apache/nutch/indexwriter/dummy/DummyIndexWriter.java
#	src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestIndexWriter.java
#	src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticConstants.java
#	src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java
#	src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitIndexWriter.java
#	src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitMQConstants.java
#	src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java
#	src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrUtils.java
- Internal cache.
- Fixed the unit test of the Elastic index writer.
@r0ann3l
Copy link
Contributor Author

r0ann3l commented May 20, 2018

Hi @sebastian-nagel,

I changed the use of ObjectCache.class to an internal CACHE object, to avoid changing the behavior of other functionalities. In this case it is not necessary to have a different instance of IndexWriters.class for each Configuration.class. This is beacuse the index writer's configuration is handled in other individual file.

Also, I fixed an issue in TestElasticIndexWriter.class (associated with the use of IndexWriterParams.class), which causes the unit test fail.

@sebastian-nagel
Copy link
Contributor

Thanks, @r0ann3l! +1 - I've tested the solution in local and pseudo-distributed mode and was able to index into Solr (a single index). If there are no objections I'll commit/merge soon.

@sebastian-nagel sebastian-nagel merged commit 02afd5b into apache:master Jun 1, 2018
@r0ann3l r0ann3l deleted the NUTCH-1480 branch March 27, 2019 16:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants