New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NIFI-4035 Implement record-based Solr processors #2561
Conversation
NIFI-4035 Adding a PutSolrRecord Processor that reads NiFi Records and indexes them into Solr as SolrDocuments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good so far. Just wanted to give it some preliminary feedback.
.defaultValue("/update") | ||
.build(); | ||
|
||
public static final PropertyDescriptor FIELDS_TO_INDEX = new PropertyDescriptor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could get dicey if a user does embedded records. I know ElasticSearch supports that, and it's my understanding that Solr can do them too. Might want to think about this because a user might say they want to do these 3 top level and these 2 embedded fields and nothing else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solr can do child documents which if I understand correctly is what you meant by embedded records, but that is something that I haven't implemented in this processor as I was expecting this processor to be indexing a single solr document for a single record, in which case all the fields would be at the top level. I can look into implementing child documents but then i'm unsure how would the child docs be represented in a record.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added an example in the additionalDetails.html on how to specify fields for nested records.
.description("FlowFiles that failed for any reason other than Solr being unreachable") | ||
.build(); | ||
|
||
public static final Relationship REL_CONNECTION_FAILURE = new Relationship.Builder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally this is covered under FAILURE by other processors. You can always put a failure reason attribute on there and users can use RouteOnAttribute to cover this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other Solr processors have a connection failure relationship and it makes it easier to route connection failure back to self, and then send other failure relationship somewhere else because the other failures are likely permanent failures that will never work
@@ -0,0 +1,105 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See TestPutHBaseRecord
for an example of how to use org.apache.nifi.serialization.record.MockRecordParser
which should be able to replace this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PutMongoRecordIT.testInsertNestedRecords
is another good example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll look into these tests.
final RecordReader reader = readerFactory.createRecordReader(flowFile, in, getLogger())) { | ||
|
||
Record record; | ||
List<SolrInputDocument> inputDocumentList = new LinkedList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A big benefit of record-processing in NiFi is to pass around flow files with many records in them (thousands or millions) and thus avoid having many small flow file... given that, we probably want to avoid reading all of the records into a list here, as that would mean the entire content of the flow file would essentially be read into memory.
I would suggest some kind of batch size property like 500 or 1,000, and every time the batch size is reached then send a batch to Solr and start a new batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll add a parameter to accept a batch size and modify the processor to index solr documents in batches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a parameter to index documents in batches, although now the whole flowfile would fail even if one of the records of the flowfile threw an exception. I understand it is ideal in case of a Solr connection failure but in case of some other exception specific to a record would it be better if we process the records that do not throw an exception? If yes, would we in that case route the flowfile to failure as some records still threw an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do more complicated things like try to keep inserting records and write out all the failed records to a new flow file, but in my experience this usually becomes complex and error prone.
In this case there should be very few scenarios where specific records are failing because in order to be read by the record reader they already have to conform to the schema. So every SolrDocument being created will already have the same field names with the same types of values. If there is chance of invalid data that doesn't conform to the schema, then this can be handled before this processor using ValidateRecord.
So long story short, I think just keep it simple, and if anything fails for a reason other than connection exception, than just route the whole flow file to failure.
inputDocument.addField(fieldName,stringValue); | ||
} | ||
break; | ||
case RECORD: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do nested records end up being represented in the Solr document? Not saying anything is wrong here, just asking to understand how it works.
Lets say we have a person schema with top-level fields for "firstName" and "lastName", and "address", and the address field is of type record and then has it's own fields "street", "city", "zip"...
Does the resulting Solr document contain "firstName", "lastName", "street", "city", "zip"?
Would it make sense to have an option to include the parent field in the field names, so it ends up being "address_street", "address_city", and "address_zip" so that you know where those fields came from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've implemented it to flatten the nested record into a single solr document, although i was sort of unsure if there would be a need to consider nested records. If there is one, I'll make changes to the field name as you suggested so that one knows where the field came from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have to handle it since someone can specify a field name in "fields to index" that could be of type record.
I think it makes sense to have a property like "Nested Field Names" with choices for "Fully Qualified" and "Child Only" (or something like that).
This lines up with how Solr's JSON update works:
The part that shows....
The default behavior is to use the fully qualified name (FQN) of the node. So, if we don’t define any field mappings, like this:
curl 'http://localhost:8983/solr/my_collection/update/json/docs?split=/exams'\
-H 'Content-type:application/json' -d '
{
"first": "John",
"last": "Doe",
"grade": 8,
"exams": [
{
"subject": "Maths",
"test" : "term1",
"marks" : 90},
{
"subject": "Biology",
"test" : "term1",
"marks" : 86}
]
}'
The indexed documents would be added to the index with fields that look like this:
{
"first":"John",
"last":"Doe",
"grade":8,
"exams.subject":"Maths",
"exams.test":"term1",
"exams.marks":90},
{
"first":"John",
"last":"Doe",
"grade":8,
"exams.subject":"Biology",
"exams.test":"term1",
"exams.marks":86}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also brings up another scenario... what do we do if there is an array field, and the type of the elements in the array is a record?
That would be similar to the "exams" array in the above example. With Solr's JSON update handler you would have to say split=/exams and this produces a Solr document for each exam.
I'm actually not sure what Solr does if you left off the split param because then you would have multiple fields with the same name in the same document, like exams.subject would be there twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want me to replicate the top level field in N nested record of the array and convert into into N solr documents, but this might get complicated if there are more than one levels of nested records.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the current code, it writes a single solr document for a record and flattens all the nested records in that single solr document.
So if there is an array of nested records it would create multiple fields with the same key in the solr document which would eventually mean that the field would be indexed as multivalued in solr with the assumption that the schema has defined the field to be multivalued else it would fail to index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another approach that can be taken is to consider the nested records as child documents. Which would mean that every nested record in the array is a child document for the main record in solr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current approach is probably fine for now, maybe we can make a note in the @CapabilityDescription that mentions how nested records are handled, or in the additionalDetails.html you could provide example input and show it would be handled.
Someone can always use ConvertRecord to convert whatever format to JSON, and then use the existing PutSolrContentStream with Solr's splitting if they want to do something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll then just change the field names as discussed before and I'll leave the implementation as it is and give some examples as part of the additionalDetails.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added examples around the behavior of nested records in the additionalDetails.html and I've modified the field names to include the parent field name
.description("Comma-separated list of field names to write") | ||
.required(false) | ||
.addValidator(StandardValidators.NON_EMPTY_VALIDATOR) | ||
.build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since a record reader can dynamically obtain a schema based on the incoming flow file using the schema.name attribute, I think "Fields to Index" should support expression language so each flow file could potentially supply a different set of fields.
final String contentStreamPath = context.getProperty(UPDATE_PATH).evaluateAttributeExpressions(flowFile).getValue(); | ||
final MultiMapSolrParams requestParams = new MultiMapSolrParams(getRequestParams(context, flowFile)); | ||
final RecordReaderFactory readerFactory = context.getProperty(RECORD_READER).asControllerService(RecordReaderFactory.class); | ||
final String fieldsToIndex = context.getProperty(FIELDS_TO_INDEX).getValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on above comment this would need .evaluateAttributeExpressions(flowFile)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll make that change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made the change to support nifi expressions for the field
@abhinavrohatgi30 Looks like your latest push grabbed a bunch of other folks' commits. Unless @bbende disagrees, I think you're going to need to rebase and repush. |
Yea when you update your branch you should be doing something like the following...
This assumes "upstream" points to either Apache NiFi git repo or Apache NiFi Github. Using rebase will apply all the incoming commits from upstream/master to your branch and then put your commits back on top of that so it looks like yours are always the latest. You then need to force push to your remote branch |
Sorry, instead of doing the force push i resolved conflicts and did a push, can i now do the rebase again on the current commit or will i have to add a new commit inorder to rebase from the master branch? |
I would try doing a rebase against master to see what happens. In the worst case situation you would have to create another branch off latest master, and then individually cherry-pick your commits from this branch over to the new branch, to get rid of those other commits that are in between yours, but only do that if you can't get this branch straightened out. |
76cca2f
to
a04400f
Compare
I'm done with the changes suggested, any other changes that you have in mind? |
I tried to re-base this against master so I could squash it down to a single commit, but the re-base is encountering a lot of conflicts, which really shouldn't be happening because its conflicting with itself. Can you work through getting it down to a single commit? Normally it should just be: Then in the list of commits you choose "s" for all the commits except the top-one, which squashes them all into the top one. Then force push. |
a04400f
to
2f4ea05
Compare
Hi @bbende , I've brought it down to a single commit, can you have a look at it now? |
Thanks, will try to take a look in a few days, unless someone gets to it first. |
Nested records and arrays of nested records are not working correctly... Scenario 1 - Nested Record Schema:
Input:
Result:
Scenario 2 - Array of Records Schema:
Input:
Result: Solr Document with multi-valued field exams where the values are the toString of a MapRecord:
Should have created fields like exams_marks, exams_test, exams_subject. Here is a full template for the two scenarios: There needs to be unit tests that cover both these cases. |
public static final String COLLECTION_PARAM_NAME = "collection"; | ||
public static final String COMMIT_WITHIN_PARAM_NAME = "commitWithin"; | ||
public static final String REPEATING_PARAM_PATTERN = "\\w+\\.\\d+"; | ||
public final ComponentLog logger = getLogger(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to remove this and use getLogger() later in the code, the logger could still be null here
connectionError.set(e); | ||
} | ||
} catch (final IOException | SchemaNotFoundException | MalformedRecordException e) { | ||
logger.error("Could not parse incoming data", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to getLogger().error
|
||
final List<String> fieldList = new ArrayList<>(); | ||
if (!StringUtils.isBlank(fieldsToIndex)) { | ||
fieldList.addAll(Arrays.asList(fieldsToIndex.trim().split("[,]"))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only trimming leading/trailing spaces on the entire value, but we should trim each field too in case someone enters "field1, field2" instead of "field1,field2".
</p> | ||
<p> | ||
Example: | ||
<strong>For a record created from the following json:</strong> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we wrap all the JSON examples in pre elements so that they display like code-blocks when viewing the documentation?
/** | ||
* Test for PutSolrRecord Processor | ||
*/ | ||
public class TestPutSolrRecord { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there should be 3 new tests added...
-
Test what happens when the recordParser throws an exception, just set failAfter on the record parser:
recordParser.failAfter(0);
And ensure that it routes to failure. -
Test a nested record
-
Test an array of nested records
@bbende I'll have a look at this and write test cases accordingly. |
e0dfd21
to
d48ef4b
Compare
…em into Solr as SolrDocuments Adding Test Cases for PutSolrRecord Processor Adding PutSolrRecord Processor in the list of Processors Resolving checkstyle errors Resolving checkstyle errors in test classes Adding License information and additional information about the processor 1. Implementing Batch Indexing 2. Changes for nested records 3. Removing MockRecordParser Fixing bugs with nested records Updating version of dependencies Setting Expression Language Scope
d48ef4b
to
618528d
Compare
Hi, I've looked at the comments and I've made the following changes as part of the latest commit that cover all the comments :
I hope the processor now works as expected, let me know if any further changes are to be made Thanks |
@abhinavrohatgi30 You have a merge conflict in this branch. If you resolve it, I'll help @bbende finish the review. |
@MikeThomsen I'm really sorry, it might take a while, I'm on a vacation and away from my workstation. I'll keep you updated as soon as I am back. Thanks |
@abhinavrohatgi30 While you were away, I merged another Solr-related commit and that's the reason you now have conflicts. |
I was able to resolve the conflicts and everything looks good now, going to merge, thanks! |
@bbende @MikeThomsen Thanks for reviewing the pull request |
Thank you for submitting a contribution to Apache NiFi.
In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:
For all changes:
Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically master)?
Is your initial contribution a single, squashed commit?
For code changes:
For documentation related changes:
Note:
Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.