NIFI-4035 Implement record-based Solr processors #2561

abhinavrohatgi30 · 2018-03-17T15:57:06Z

Thank you for submitting a contribution to Apache NiFi.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically master)?
Is your initial contribution a single, squashed commit?

For code changes:

Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
Have you written or updated unit tests to verify your changes?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?
If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?
If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?

For documentation related changes:

Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.

abhinavrohatgi30 · 2018-03-17T15:59:45Z

NIFI-4035 Adding a PutSolrRecord Processor that reads NiFi Records and indexes them into Solr as SolrDocuments.

MikeThomsen

Looking good so far. Just wanted to give it some preliminary feedback.

MikeThomsen · 2018-03-20T12:16:40Z

...bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/PutSolrRecord.java

+            .defaultValue("/update")
+            .build();
+
+    public static final PropertyDescriptor FIELDS_TO_INDEX  = new PropertyDescriptor


This could get dicey if a user does embedded records. I know ElasticSearch supports that, and it's my understanding that Solr can do them too. Might want to think about this because a user might say they want to do these 3 top level and these 2 embedded fields and nothing else.

Solr can do child documents which if I understand correctly is what you meant by embedded records, but that is something that I haven't implemented in this processor as I was expecting this processor to be indexing a single solr document for a single record, in which case all the fields would be at the top level. I can look into implementing child documents but then i'm unsure how would the child docs be represented in a record.

I have added an example in the additionalDetails.html on how to specify fields for nested records.

MikeThomsen · 2018-03-20T12:18:16Z

...bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/PutSolrRecord.java

+            .description("FlowFiles that failed for any reason other than Solr being unreachable")
+            .build();
+
+    public static final Relationship REL_CONNECTION_FAILURE = new Relationship.Builder()


Generally this is covered under FAILURE by other processors. You can always put a failure reason attribute on there and users can use RouteOnAttribute to cover this.

The other Solr processors have a connection failure relationship and it makes it easier to route connection failure back to self, and then send other failure relationship somewhere else because the other failures are likely permanent failures that will never work

MikeThomsen · 2018-03-20T12:26:06Z

...ifi-solr-processors/src/test/java/org/apache/nifi/processors/solr/util/MockRecordParser.java

@@ -0,0 +1,105 @@
+/*


See TestPutHBaseRecord for an example of how to use org.apache.nifi.serialization.record.MockRecordParser which should be able to replace this class.

PutMongoRecordIT.testInsertNestedRecords is another good example.

Sure, I'll look into these tests.

bbende · 2018-03-20T13:24:15Z

...bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/PutSolrRecord.java

+             final RecordReader reader = readerFactory.createRecordReader(flowFile, in, getLogger())) {
+
+            Record record;
+            List<SolrInputDocument> inputDocumentList = new LinkedList<>();


A big benefit of record-processing in NiFi is to pass around flow files with many records in them (thousands or millions) and thus avoid having many small flow file... given that, we probably want to avoid reading all of the records into a list here, as that would mean the entire content of the flow file would essentially be read into memory.

I would suggest some kind of batch size property like 500 or 1,000, and every time the batch size is reached then send a batch to Solr and start a new batch.

Sure, I'll add a parameter to accept a batch size and modify the processor to index solr documents in batches.

I've added a parameter to index documents in batches, although now the whole flowfile would fail even if one of the records of the flowfile threw an exception. I understand it is ideal in case of a Solr connection failure but in case of some other exception specific to a record would it be better if we process the records that do not throw an exception? If yes, would we in that case route the flowfile to failure as some records still threw an exception?

We could do more complicated things like try to keep inserting records and write out all the failed records to a new flow file, but in my experience this usually becomes complex and error prone.

In this case there should be very few scenarios where specific records are failing because in order to be read by the record reader they already have to conform to the schema. So every SolrDocument being created will already have the same field names with the same types of values. If there is chance of invalid data that doesn't conform to the schema, then this can be handled before this processor using ValidateRecord.

So long story short, I think just keep it simple, and if anything fails for a reason other than connection exception, than just route the whole flow file to failure.

bbende · 2018-03-20T13:33:20Z

...olr-bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/SolrUtils.java

+                    inputDocument.addField(fieldName,stringValue);
+                }
+                break;
+            case RECORD: {


How do nested records end up being represented in the Solr document? Not saying anything is wrong here, just asking to understand how it works.

Lets say we have a person schema with top-level fields for "firstName" and "lastName", and "address", and the address field is of type record and then has it's own fields "street", "city", "zip"...

Does the resulting Solr document contain "firstName", "lastName", "street", "city", "zip"?

Would it make sense to have an option to include the parent field in the field names, so it ends up being "address_street", "address_city", and "address_zip" so that you know where those fields came from?

Yes, I've implemented it to flatten the nested record into a single solr document, although i was sort of unsure if there would be a need to consider nested records. If there is one, I'll make changes to the field name as you suggested so that one knows where the field came from.

I think we have to handle it since someone can specify a field name in "fields to index" that could be of type record.

I think it makes sense to have a property like "Nested Field Names" with choices for "Fully Qualified" and "Child Only" (or something like that).

This lines up with how Solr's JSON update works:

https://lucene.apache.org/solr/guide/6_6/transforming-and-indexing-custom-json.html#transforming-and-indexing-custom-json

The part that shows....

The default behavior is to use the fully qualified name (FQN) of the node. So, if we don’t define any field mappings, like this:

curl 'http://localhost:8983/solr/my_collection/update/json/docs?split=/exams'\
-H 'Content-type:application/json' -d '
{
"first": "John",
"last": "Doe",
"grade": 8,
"exams": [
{
"subject": "Maths",
"test" : "term1",
"marks" : 90},
{
"subject": "Biology",
"test" : "term1",
"marks" : 86}
]
}'

The indexed documents would be added to the index with fields that look like this:

{
"first":"John",
"last":"Doe",
"grade":8,
"exams.subject":"Maths",
"exams.test":"term1",
"exams.marks":90},
{
"first":"John",
"last":"Doe",
"grade":8,
"exams.subject":"Biology",
"exams.test":"term1",
"exams.marks":86}

This also brings up another scenario... what do we do if there is an array field, and the type of the elements in the array is a record?

That would be similar to the "exams" array in the above example. With Solr's JSON update handler you would have to say split=/exams and this produces a Solr document for each exam.

I'm actually not sure what Solr does if you left off the split param because then you would have multiple fields with the same name in the same document, like exams.subject would be there twice.

Do you want me to replicate the top level field in N nested record of the array and convert into into N solr documents, but this might get complicated if there are more than one levels of nested records.

With the current code, it writes a single solr document for a record and flattens all the nested records in that single solr document.
So if there is an array of nested records it would create multiple fields with the same key in the solr document which would eventually mean that the field would be indexed as multivalued in solr with the assumption that the schema has defined the field to be multivalued else it would fail to index.

Another approach that can be taken is to consider the nested records as child documents. Which would mean that every nested record in the array is a child document for the main record in solr

I think the current approach is probably fine for now, maybe we can make a note in the @CapabilityDescription that mentions how nested records are handled, or in the additionalDetails.html you could provide example input and show it would be handled.

Someone can always use ConvertRecord to convert whatever format to JSON, and then use the existing PutSolrContentStream with Solr's splitting if they want to do something like that.

Ok, I'll then just change the field names as discussed before and I'll leave the implementation as it is and give some examples as part of the additionalDetails.html

I've added examples around the behavior of nested records in the additionalDetails.html and I've modified the field names to include the parent field name

bbende · 2018-03-20T17:46:49Z

...bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/PutSolrRecord.java

+            .description("Comma-separated list of field names to write")
+            .required(false)
+            .addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
+            .build();


Since a record reader can dynamically obtain a schema based on the incoming flow file using the schema.name attribute, I think "Fields to Index" should support expression language so each flow file could potentially supply a different set of fields.

bbende · 2018-03-20T17:47:15Z

...bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/PutSolrRecord.java

+        final String contentStreamPath = context.getProperty(UPDATE_PATH).evaluateAttributeExpressions(flowFile).getValue();
+        final MultiMapSolrParams requestParams = new MultiMapSolrParams(getRequestParams(context, flowFile));
+        final RecordReaderFactory readerFactory = context.getProperty(RECORD_READER).asControllerService(RecordReaderFactory.class);
+        final String fieldsToIndex = context.getProperty(FIELDS_TO_INDEX).getValue();


Based on above comment this would need .evaluateAttributeExpressions(flowFile)

Sure, I'll make that change

I've made the change to support nifi expressions for the field

MikeThomsen · 2018-03-21T10:47:56Z

@abhinavrohatgi30 Looks like your latest push grabbed a bunch of other folks' commits. Unless @bbende disagrees, I think you're going to need to rebase and repush.

bbende · 2018-03-21T14:30:44Z

Yea when you update your branch you should be doing something like the following...

git fetch upstream
git rebase upstream/master

This assumes "upstream" points to either Apache NiFi git repo or Apache NiFi Github.

Using rebase will apply all the incoming commits from upstream/master to your branch and then put your commits back on top of that so it looks like yours are always the latest.

You then need to force push to your remote branch git push origin your-branch --force

abhinavrohatgi30 · 2018-03-21T15:31:09Z

Sorry, instead of doing the force push i resolved conflicts and did a push, can i now do the rebase again on the current commit or will i have to add a new commit inorder to rebase from the master branch?

bbende · 2018-03-21T18:40:01Z

I would try doing a rebase against master to see what happens. In the worst case situation you would have to create another branch off latest master, and then individually cherry-pick your commits from this branch over to the new branch, to get rid of those other commits that are in between yours, but only do that if you can't get this branch straightened out.

abhinavrohatgi30 · 2018-03-22T14:35:14Z

I'm done with the changes suggested, any other changes that you have in mind?

bbende · 2018-03-23T19:53:40Z

I tried to re-base this against master so I could squash it down to a single commit, but the re-base is encountering a lot of conflicts, which really shouldn't be happening because its conflicting with itself. Can you work through getting it down to a single commit?

Normally it should just be:
git rebase -i upstream/master

Then in the list of commits you choose "s" for all the commits except the top-one, which squashes them all into the top one. Then force push.

abhinavrohatgi30 · 2018-03-27T20:39:28Z

Hi @bbende , I've brought it down to a single commit, can you have a look at it now?

bbende · 2018-03-28T14:33:36Z

Thanks, will try to take a look in a few days, unless someone gets to it first.

bbende · 2018-04-02T20:36:24Z

Nested records and arrays of nested records are not working correctly...

Scenario 1 - Nested Record

Schema:

{
    "type": "record",
    "name": "exams",
    "fields" : [
      { "name": "first", "type": "string" },
      { "name": "last", "type": "string" },
      { "name": "grade", "type": "int" },
      {
        "name": "exam",
        "type": {
          "name" : "exam",
          "type" : "record",
          "fields" : [
            { "name": "subject", "type": "string" },
            { "name": "test", "type": "string" },
            { "name": "marks", "type": "int" }
          ]
        }
      }
    ]
}

Input:

{
  "first": "Abhi",
  "last": "R",
  "grade": 8,
  "exam": {
    "subject": "Maths",
    "test" : "term1",
    "marks" : 90
  }
}

Result:

java.util.NoSuchElementException: No value present
	at java.util.Optional.get(Optional.java:135)
	at org.apache.nifi.processors.solr.SolrUtils.writeRecord(SolrUtils.java:313)
	at org.apache.nifi.processors.solr.SolrUtils.writeValue(SolrUtils.java:384)
	at org.apache.nifi.processors.solr.SolrUtils.writeRecord(SolrUtils.java:314)
	at org.apache.nifi.processors.solr.PutSolrRecord.onTrigger(PutSolrRecord.java:247)
	at org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)

Scenario 2 - Array of Records

Schema:

{
    "type": "record",
    "name": "exams",
    "fields" : [
      { "name": "first", "type": "string" },
      { "name": "last", "type": "string" },
      { "name": "grade", "type": "int" },
      {
        "name": "exams",
        "type": {
          "type" : "array",
          "items" : {
            "name" : "exam",
            "type" : "record",
            "fields" : [
                { "name": "subject", "type": "string" },
                { "name": "test", "type": "string" },
                { "name": "marks", "type": "int" }
            ]
          }
        }
      }
    ]
}

Input:

{
"first": "Abhi",
"last": "R",
"grade": 8,
"exams": [
    {
        "subject": "Maths",
        "test" : "term1",
        "marks" : 90
    },
    {
        "subject": "Physics",
        "test" : "term1",
        "marks" : 95
    }
]
}

Result:

Solr Document with multi-valued field exams where the values are the toString of a MapRecord:

 "exams":["org.apache.nifi.serialization.record.MapRecord:MapRecord[{marks=90, test=term1, subject=Maths}]",
          "org.apache.nifi.serialization.record.MapRecord:MapRecord[{marks=95, test=term1, subject=Physics}]"],

Should have created fields like exams_marks, exams_test, exams_subject.

Here is a full template for the two scenarios:

https://gist.githubusercontent.com/bbende/edc2e7d61db83b29533ac3fc520de30f/raw/8764d50ed5e14d876c53a0b84b3af5741d910b3b/PutSolrRecordTesting.xml

There needs to be unit tests that cover both these cases.

bbende · 2018-04-02T20:37:22Z

...bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/PutSolrRecord.java

+    public static final String COLLECTION_PARAM_NAME = "collection";
+    public static final String COMMIT_WITHIN_PARAM_NAME = "commitWithin";
+    public static final String REPEATING_PARAM_PATTERN = "\\w+\\.\\d+";
+    public  final ComponentLog logger = getLogger();


Need to remove this and use getLogger() later in the code, the logger could still be null here

bbende · 2018-04-02T20:38:06Z

...bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/PutSolrRecord.java

+                connectionError.set(e);
+            }
+        } catch (final IOException | SchemaNotFoundException | MalformedRecordException e) {
+            logger.error("Could not parse incoming data", e);


Change to getLogger().error

bbende · 2018-04-02T20:39:09Z

...bundle/nifi-solr-processors/src/main/java/org/apache/nifi/processors/solr/PutSolrRecord.java

+
+        final List<String> fieldList = new ArrayList<>();
+        if (!StringUtils.isBlank(fieldsToIndex)) {
+            fieldList.addAll(Arrays.asList(fieldsToIndex.trim().split("[,]")));


This is only trimming leading/trailing spaces on the entire value, but we should trim each field too in case someone enters "field1, field2" instead of "field1,field2".

bbende · 2018-04-02T20:40:14Z

...src/main/resources/docs/org.apache.nifi.processors.solr.PutSolrRecord/additionalDetails.html

+</p>
+<p>
+    Example:
+    <strong>For a record created from the following json:</strong>


Can we wrap all the JSON examples in pre elements so that they display like code-blocks when viewing the documentation?

bbende · 2018-04-02T20:42:04Z

...le/nifi-solr-processors/src/test/java/org/apache/nifi/processors/solr/TestPutSolrRecord.java

+/**
+ * Test for PutSolrRecord Processor
+ */
+public class TestPutSolrRecord {


I think there should be 3 new tests added...

Test what happens when the recordParser throws an exception, just set failAfter on the record parser:
recordParser.failAfter(0);
And ensure that it routes to failure.

Test a nested record

Test an array of nested records

abhinavrohatgi30 · 2018-04-03T19:46:06Z

@bbende I'll have a look at this and write test cases accordingly.

…em into Solr as SolrDocuments Adding Test Cases for PutSolrRecord Processor Adding PutSolrRecord Processor in the list of Processors Resolving checkstyle errors Resolving checkstyle errors in test classes Adding License information and additional information about the processor 1. Implementing Batch Indexing 2. Changes for nested records 3. Removing MockRecordParser Fixing bugs with nested records Updating version of dependencies Setting Expression Language Scope

abhinavrohatgi30 · 2018-04-09T22:29:53Z

Hi, I've looked at the comments and I've made the following changes as part of the latest commit that cover all the comments :

Fixed the issue with Nested Records (The issue came up because of the change in field names in the previous commit)
Fixed the issue with Array of Records (It was generating an Object[] as opposed to a Record[] that I was expecting and as a result was storing the string representation of a Record)
Trimming field names individually
Adding Test cases for Nested Record, Array of Record and Record Parser failure
Using the getLogger() later in the code
Wrapping the Jsons in the additionalDetails.html in a <pre> tag

I hope the processor now works as expected, let me know if any further changes are to be made

Thanks

MikeThomsen · 2018-04-20T15:41:56Z

@abhinavrohatgi30 You have a merge conflict in this branch. If you resolve it, I'll help @bbende finish the review.

abhinavrohatgi30 · 2018-04-20T17:58:12Z

@MikeThomsen I'm really sorry, it might take a while, I'm on a vacation and away from my workstation. I'll keep you updated as soon as I am back.
The last I remember is that when I pushed my changes there weren't any conflicts this seems to have come up in between the day I pushed and today. I'll try to resolve this as soon as I am back.

Thanks

MikeThomsen · 2018-04-24T13:00:10Z

@abhinavrohatgi30 While you were away, I merged another Solr-related commit and that's the reason you now have conflicts.

bbende · 2018-04-24T14:49:21Z

I was able to resolve the conflicts and everything looks good now, going to merge, thanks!

abhinavrohatgi30 · 2018-04-25T10:57:14Z

@bbende @MikeThomsen Thanks for reviewing the pull request

MikeThomsen reviewed Mar 20, 2018

View reviewed changes

bbende reviewed Mar 20, 2018

View reviewed changes

abhinavrohatgi30 force-pushed the nifi-4035 branch from 76cca2f to a04400f Compare March 21, 2018 19:12

abhinavrohatgi30 force-pushed the nifi-4035 branch from a04400f to 2f4ea05 Compare March 23, 2018 20:01

bbende reviewed Apr 2, 2018

View reviewed changes

abhinavrohatgi30 force-pushed the nifi-4035 branch 2 times, most recently from e0dfd21 to d48ef4b Compare April 9, 2018 21:11

abhinavrohatgi30 force-pushed the nifi-4035 branch from d48ef4b to 618528d Compare April 9, 2018 21:50

asfgit closed this in e3f4720 Apr 24, 2018

NIFI-4035 Implement record-based Solr processors #2561

NIFI-4035 Implement record-based Solr processors #2561

Conversation

abhinavrohatgi30 commented Mar 17, 2018 • edited

For all changes:

For code changes:

For documentation related changes:

Note:

abhinavrohatgi30 commented Mar 17, 2018

MikeThomsen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bbende Mar 22, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MikeThomsen commented Mar 21, 2018

bbende commented Mar 21, 2018

abhinavrohatgi30 commented Mar 21, 2018

bbende commented Mar 21, 2018

abhinavrohatgi30 commented Mar 22, 2018 • edited

bbende commented Mar 23, 2018

abhinavrohatgi30 commented Mar 27, 2018

bbende commented Mar 28, 2018

bbende commented Apr 2, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhinavrohatgi30 commented Apr 3, 2018

abhinavrohatgi30 commented Apr 9, 2018 • edited

MikeThomsen commented Apr 20, 2018

abhinavrohatgi30 commented Apr 20, 2018 • edited

MikeThomsen commented Apr 24, 2018

bbende commented Apr 24, 2018

abhinavrohatgi30 commented Apr 25, 2018

abhinavrohatgi30 commented Mar 17, 2018 •

edited

bbende Mar 22, 2018 •

edited

abhinavrohatgi30 commented Mar 22, 2018 •

edited

abhinavrohatgi30 commented Apr 9, 2018 •

edited

abhinavrohatgi30 commented Apr 20, 2018 •

edited