New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix for NUTCH-1480 contributed by r0ann3l #218
Conversation
This PR includes more changes than the original ticket and breaks BC with custom indexers. @r0ann3l could you squash all the changes into a single commit? that would help in the review process since this PR has a lot of changes. |
a849eb1
to
e4a7f87
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very good. Please consider my comments @r0ann3l thank you
@@ -0,0 +1,95 @@ | |||
package org.apache.nutch.indexer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add license header
import org.w3c.dom.Element; | ||
import org.w3c.dom.NodeList; | ||
|
||
import java.util.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use explicit imports instead of wildcard
|
||
HashMap<String, Extension> extensionMap = new HashMap<>(); | ||
for (Extension extension : extensions) { | ||
LOG.info("Index writer " + extension.getClazz() + " identified."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use parameterized logging
import org.w3c.dom.Node; | ||
import org.w3c.dom.NodeList; | ||
|
||
import java.util.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not use wildcard
@@ -17,28 +17,26 @@ | |||
package org.apache.nutch.indexwriter.rabbit; | |||
|
|||
interface RabbitMQConstants { | |||
String RABBIT_PREFIX = "rabbitmq.indexer"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you remove all of these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @lewismc. The prefix is not necessary anymore. The new structure allows us to have the same key of a parameter to many index writers without ambiguity or confusion. The prefix makes a parameter key larger and really I do not believe that is necessary.
@@ -19,19 +19,23 @@ | |||
public interface SolrConstants { | |||
public static final String SOLR_PREFIX = "solr."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to remove all of these as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still used for ZOOKEEPER_HOSTS = SOLR_PREFIX + "zookeeper.hosts"
solrClients.add(sc); | ||
} | ||
break; | ||
case "concurrent": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you throw unsupported Exception at this stage? and also a default case?
"\t" + RabbitMQConstants.SERVER_USERNAME + " : Username for authentication\n" + | ||
"\t" + RabbitMQConstants.SERVER_PASSWORD + " : Password for authentication\n" + | ||
"\t" + RabbitMQConstants.COMMIT_SIZE + " : Buffer size when sending to RabbitMQ (default 250)\n"; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency probably use the StringBuilder
pattern here?
conf/index-writers.xsd
Outdated
<xs:element name="writer" type="writerType" maxOccurs="unbounded" minOccurs="1"> | ||
<xs:annotation> | ||
<xs:documentation> | ||
Contains the all configuration of a particular index writer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo? Also it would be a good idea to have empty lines at the end of this file and the conf/index-writers.xml.template
for git/diff compatiblity.
Tested this on Solr 6 and works well... any comments folks? |
Looks good! I've tried to use indexer-dummy with this PR applied - it took long to configure the index-writers.xml properly, so we should definitely add "stub" sections for all index writers which are (still) based on configuration properties. All index writers should work out-of-the-box! |
<field source="boost"/> | ||
</remove> | ||
</mapping> | ||
</writer> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add stub sections for all indexer-* plugins so that they work out-of-the-box without modifications of the index-writers.xml required, e.g. for indexer-dummy:
<writer id="indexer_dummy_1" class="org.apache.nutch.indexwriter.dummy.DummyIndexWriter">
<parameters>
<param name="dummy-dummy" value="dummy-dummy"/>
</parameters>
<mapping>
<copy>
<field source="dummy-dummy" dest="dummy-dummy"/>
</copy>
<rename/>
<remove/>
</mapping>
</writer>
That's long for a dummy section, but the schema (index-writers.xsd) and the IndexWriters class requires all the elements and attributes. Maybe it's better to "relax" the schema, make elements/attributes optional and make IndexWriters not fail with NPEs.
…ema, copy to the same field is not allowed and IndexWriterParams class to facilitate the process of obtaining values from index-writers.xml file.
Thanks @sebastian-nagel for your review. Sections for all indexer-* plugins were added, so they work out-of-the-box as you required in your comments. Also, it is not mandatory to specify fields for the actions (the schema is relaxed). I included a new change, to avoid duplicate values in a field when someone tries to copy to the same field, like:
In addition, I added a new class (IndexWriterParams) to facilitate the process of obtaining and parsing values from the index-writers.xml file. Now, an instance of IndexWriterParams is passed to each IndexWriter instead of HashMap. |
- Solr: - do not copy fields to target not contained in default schema.xml - use "nutch" as default core name - indexer-dummy: file in working directory (write permissions should be granted)
Hi @r0ann3l, thanks! I've continued testing, and was able to feed two Solr indexes in parallel. Great! Afaics, all requested changes have been made (also that of @lewismc). To make the configuration work out of the box, I would suggest 3 changes:
I've tried to fix these issues in "a fork of NUTCH-1480". Feel free to cherry pick it from there. I've also tried to make indexer-dummy work. Without success, the file is created but then overwritten:
I see two potential solutions:
I tend to the second solution. It would also solve the problem of having two IndexWriters instances active. What do you think? |
- Logs for IndexerOutputFormat class to show the description of writers on the terminal. - IndexWriters instance, describe() method call and commit() call, moved from IndexingJob to IndexerOutputFormat. - The key of CACHE on ObjectCache.java:32 is now the UUID.
Hi @sebastian-nagel, thank you very much for your comments!!! I agree with your suggestions and I included the changes you propose from your fork. About indexer-dummy, I also tried to make it work, but it was not possible. In theory, you can build as many instances of Now, we have only one instance of I also, moved the indexers description from Thanks |
Hi @r0ann3l, Using the UUID in the ObjectCache makes the unit tests fail (TestGenerator): in fact the ObjectCache now returns the same object even if the configuration is different. We need actually really implement a hash value for Configuration objects. |
@r0ann3l can you please update this PR inline with master? |
Hi @sebastian-nagel:
What do you think? |
# Conflicts: # src/java/org/apache/nutch/indexer/IndexWriter.java # src/java/org/apache/nutch/indexer/IndexWriters.java # src/java/org/apache/nutch/indexer/IndexerOutputFormat.java # src/java/org/apache/nutch/indexer/IndexingJob.java # src/plugin/indexer-cloudsearch/src/java/org/apache/nutch/indexwriter/cloudsearch/CloudSearchIndexWriter.java # src/plugin/indexer-dummy/src/java/org/apache/nutch/indexwriter/dummy/DummyIndexWriter.java # src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestIndexWriter.java # src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticConstants.java # src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java # src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitIndexWriter.java # src/plugin/indexer-rabbit/src/java/org/apache/nutch/indexwriter/rabbit/RabbitMQConstants.java # src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java # src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrUtils.java
- Internal cache. - Fixed the unit test of the Elastic index writer.
Hi @sebastian-nagel, I changed the use of Also, I fixed an issue in |
Thanks, @r0ann3l! +1 - I've tested the solution in local and pseudo-distributed mode and was able to index into Solr (a single index). If there are no objections I'll commit/merge soon. |
With this patch now we can have many instances of the same IndexWriter class, but with different configurations. Also, we can copy, rename or remove fields of documents for every index writer individually. Besides, the parameters needed by the index writers will be into separated XML files, so them will be not into nutch-site.xml anymore.