Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIFI-4035 Implement record-based Solr processors #2561

Closed
wants to merge 1 commit into from

Conversation

abhinavrohatgi30
Copy link
Contributor

@abhinavrohatgi30 abhinavrohatgi30 commented Mar 17, 2018

Thank you for submitting a contribution to Apache NiFi.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

  • Is there a JIRA ticket associated with this PR? Is it referenced
    in the commit message?

  • Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.

  • Has your PR been rebased against the latest commit within the target branch (typically master)?

  • Is your initial contribution a single, squashed commit?

For code changes:

  • Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
  • Have you written or updated unit tests to verify your changes?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?
  • If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?
  • If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?

For documentation related changes:

  • Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.

@abhinavrohatgi30
Copy link
Contributor Author

NIFI-4035 Adding a PutSolrRecord Processor that reads NiFi Records and indexes them into Solr as SolrDocuments.

Copy link
Contributor

@MikeThomsen MikeThomsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good so far. Just wanted to give it some preliminary feedback.

.defaultValue("/update")
.build();

public static final PropertyDescriptor FIELDS_TO_INDEX = new PropertyDescriptor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could get dicey if a user does embedded records. I know ElasticSearch supports that, and it's my understanding that Solr can do them too. Might want to think about this because a user might say they want to do these 3 top level and these 2 embedded fields and nothing else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solr can do child documents which if I understand correctly is what you meant by embedded records, but that is something that I haven't implemented in this processor as I was expecting this processor to be indexing a single solr document for a single record, in which case all the fields would be at the top level. I can look into implementing child documents but then i'm unsure how would the child docs be represented in a record.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added an example in the additionalDetails.html on how to specify fields for nested records.

.description("FlowFiles that failed for any reason other than Solr being unreachable")
.build();

public static final Relationship REL_CONNECTION_FAILURE = new Relationship.Builder()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally this is covered under FAILURE by other processors. You can always put a failure reason attribute on there and users can use RouteOnAttribute to cover this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other Solr processors have a connection failure relationship and it makes it easier to route connection failure back to self, and then send other failure relationship somewhere else because the other failures are likely permanent failures that will never work

@@ -0,0 +1,105 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See TestPutHBaseRecord for an example of how to use org.apache.nifi.serialization.record.MockRecordParser which should be able to replace this class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PutMongoRecordIT.testInsertNestedRecords is another good example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll look into these tests.

final RecordReader reader = readerFactory.createRecordReader(flowFile, in, getLogger())) {

Record record;
List<SolrInputDocument> inputDocumentList = new LinkedList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A big benefit of record-processing in NiFi is to pass around flow files with many records in them (thousands or millions) and thus avoid having many small flow file... given that, we probably want to avoid reading all of the records into a list here, as that would mean the entire content of the flow file would essentially be read into memory.

I would suggest some kind of batch size property like 500 or 1,000, and every time the batch size is reached then send a batch to Solr and start a new batch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll add a parameter to accept a batch size and modify the processor to index solr documents in batches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a parameter to index documents in batches, although now the whole flowfile would fail even if one of the records of the flowfile threw an exception. I understand it is ideal in case of a Solr connection failure but in case of some other exception specific to a record would it be better if we process the records that do not throw an exception? If yes, would we in that case route the flowfile to failure as some records still threw an exception?

Copy link
Contributor

@bbende bbende Mar 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do more complicated things like try to keep inserting records and write out all the failed records to a new flow file, but in my experience this usually becomes complex and error prone.

In this case there should be very few scenarios where specific records are failing because in order to be read by the record reader they already have to conform to the schema. So every SolrDocument being created will already have the same field names with the same types of values. If there is chance of invalid data that doesn't conform to the schema, then this can be handled before this processor using ValidateRecord.

So long story short, I think just keep it simple, and if anything fails for a reason other than connection exception, than just route the whole flow file to failure.

inputDocument.addField(fieldName,stringValue);
}
break;
case RECORD: {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do nested records end up being represented in the Solr document? Not saying anything is wrong here, just asking to understand how it works.

Lets say we have a person schema with top-level fields for "firstName" and "lastName", and "address", and the address field is of type record and then has it's own fields "street", "city", "zip"...

Does the resulting Solr document contain "firstName", "lastName", "street", "city", "zip"?

Would it make sense to have an option to include the parent field in the field names, so it ends up being "address_street", "address_city", and "address_zip" so that you know where those fields came from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've implemented it to flatten the nested record into a single solr document, although i was sort of unsure if there would be a need to consider nested records. If there is one, I'll make changes to the field name as you suggested so that one knows where the field came from.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to handle it since someone can specify a field name in "fields to index" that could be of type record.

I think it makes sense to have a property like "Nested Field Names" with choices for "Fully Qualified" and "Child Only" (or something like that).

This lines up with how Solr's JSON update works:

https://lucene.apache.org/solr/guide/6_6/transforming-and-indexing-custom-json.html#transforming-and-indexing-custom-json

The part that shows....

The default behavior is to use the fully qualified name (FQN) of the node. So, if we don’t define any field mappings, like this:

curl 'http://localhost:8983/solr/my_collection/update/json/docs?split=/exams'\
-H 'Content-type:application/json' -d '
{
"first": "John",
"last": "Doe",
"grade": 8,
"exams": [
{
"subject": "Maths",
"test" : "term1",
"marks" : 90},
{
"subject": "Biology",
"test" : "term1",
"marks" : 86}
]
}'

The indexed documents would be added to the index with fields that look like this:

{
"first":"John",
"last":"Doe",
"grade":8,
"exams.subject":"Maths",
"exams.test":"term1",
"exams.marks":90},
{
"first":"John",
"last":"Doe",
"grade":8,
"exams.subject":"Biology",
"exams.test":"term1",
"exams.marks":86}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also brings up another scenario... what do we do if there is an array field, and the type of the elements in the array is a record?

That would be similar to the "exams" array in the above example. With Solr's JSON update handler you would have to say split=/exams and this produces a Solr document for each exam.

I'm actually not sure what Solr does if you left off the split param because then you would have multiple fields with the same name in the same document, like exams.subject would be there twice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want me to replicate the top level field in N nested record of the array and convert into into N solr documents, but this might get complicated if there are more than one levels of nested records.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current code, it writes a single solr document for a record and flattens all the nested records in that single solr document.
So if there is an array of nested records it would create multiple fields with the same key in the solr document which would eventually mean that the field would be indexed as multivalued in solr with the assumption that the schema has defined the field to be multivalued else it would fail to index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach that can be taken is to consider the nested records as child documents. Which would mean that every nested record in the array is a child document for the main record in solr

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current approach is probably fine for now, maybe we can make a note in the @CapabilityDescription that mentions how nested records are handled, or in the additionalDetails.html you could provide example input and show it would be handled.

Someone can always use ConvertRecord to convert whatever format to JSON, and then use the existing PutSolrContentStream with Solr's splitting if they want to do something like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll then just change the field names as discussed before and I'll leave the implementation as it is and give some examples as part of the additionalDetails.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added examples around the behavior of nested records in the additionalDetails.html and I've modified the field names to include the parent field name

.description("Comma-separated list of field names to write")
.required(false)
.addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
.build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since a record reader can dynamically obtain a schema based on the incoming flow file using the schema.name attribute, I think "Fields to Index" should support expression language so each flow file could potentially supply a different set of fields.

final String contentStreamPath = context.getProperty(UPDATE_PATH).evaluateAttributeExpressions(flowFile).getValue();
final MultiMapSolrParams requestParams = new MultiMapSolrParams(getRequestParams(context, flowFile));
final RecordReaderFactory readerFactory = context.getProperty(RECORD_READER).asControllerService(RecordReaderFactory.class);
final String fieldsToIndex = context.getProperty(FIELDS_TO_INDEX).getValue();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on above comment this would need .evaluateAttributeExpressions(flowFile)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll make that change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made the change to support nifi expressions for the field

@MikeThomsen
Copy link
Contributor

@abhinavrohatgi30 Looks like your latest push grabbed a bunch of other folks' commits. Unless @bbende disagrees, I think you're going to need to rebase and repush.

@bbende
Copy link
Contributor

bbende commented Mar 21, 2018

Yea when you update your branch you should be doing something like the following...

git fetch upstream
git rebase upstream/master

This assumes "upstream" points to either Apache NiFi git repo or Apache NiFi Github.

Using rebase will apply all the incoming commits from upstream/master to your branch and then put your commits back on top of that so it looks like yours are always the latest.

You then need to force push to your remote branch git push origin your-branch --force

@abhinavrohatgi30
Copy link
Contributor Author

Sorry, instead of doing the force push i resolved conflicts and did a push, can i now do the rebase again on the current commit or will i have to add a new commit inorder to rebase from the master branch?

@bbende
Copy link
Contributor

bbende commented Mar 21, 2018

I would try doing a rebase against master to see what happens. In the worst case situation you would have to create another branch off latest master, and then individually cherry-pick your commits from this branch over to the new branch, to get rid of those other commits that are in between yours, but only do that if you can't get this branch straightened out.

@abhinavrohatgi30
Copy link
Contributor Author

abhinavrohatgi30 commented Mar 22, 2018

I'm done with the changes suggested, any other changes that you have in mind?

@bbende
Copy link
Contributor

bbende commented Mar 23, 2018

I tried to re-base this against master so I could squash it down to a single commit, but the re-base is encountering a lot of conflicts, which really shouldn't be happening because its conflicting with itself. Can you work through getting it down to a single commit?

Normally it should just be:
git rebase -i upstream/master

Then in the list of commits you choose "s" for all the commits except the top-one, which squashes them all into the top one. Then force push.

@abhinavrohatgi30
Copy link
Contributor Author

Hi @bbende , I've brought it down to a single commit, can you have a look at it now?

@bbende
Copy link
Contributor

bbende commented Mar 28, 2018

Thanks, will try to take a look in a few days, unless someone gets to it first.

@bbende
Copy link
Contributor

bbende commented Apr 2, 2018

Nested records and arrays of nested records are not working correctly...

Scenario 1 - Nested Record

Schema:

{
    "type": "record",
    "name": "exams",
    "fields" : [
      { "name": "first", "type": "string" },
      { "name": "last", "type": "string" },
      { "name": "grade", "type": "int" },
      {
        "name": "exam",
        "type": {
          "name" : "exam",
          "type" : "record",
          "fields" : [
            { "name": "subject", "type": "string" },
            { "name": "test", "type": "string" },
            { "name": "marks", "type": "int" }
          ]
        }
      }
    ]
}

Input:

{
  "first": "Abhi",
  "last": "R",
  "grade": 8,
  "exam": {
    "subject": "Maths",
    "test" : "term1",
    "marks" : 90
  }
}

Result:

java.util.NoSuchElementException: No value present
	at java.util.Optional.get(Optional.java:135)
	at org.apache.nifi.processors.solr.SolrUtils.writeRecord(SolrUtils.java:313)
	at org.apache.nifi.processors.solr.SolrUtils.writeValue(SolrUtils.java:384)
	at org.apache.nifi.processors.solr.SolrUtils.writeRecord(SolrUtils.java:314)
	at org.apache.nifi.processors.solr.PutSolrRecord.onTrigger(PutSolrRecord.java:247)
	at org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)

Scenario 2 - Array of Records

Schema:

{
    "type": "record",
    "name": "exams",
    "fields" : [
      { "name": "first", "type": "string" },
      { "name": "last", "type": "string" },
      { "name": "grade", "type": "int" },
      {
        "name": "exams",
        "type": {
          "type" : "array",
          "items" : {
            "name" : "exam",
            "type" : "record",
            "fields" : [
                { "name": "subject", "type": "string" },
                { "name": "test", "type": "string" },
                { "name": "marks", "type": "int" }
            ]
          }
        }
      }
    ]
}

Input:

{
"first": "Abhi",
"last": "R",
"grade": 8,
"exams": [
    {
        "subject": "Maths",
        "test" : "term1",
        "marks" : 90
    },
    {
        "subject": "Physics",
        "test" : "term1",
        "marks" : 95
    }
]
}

Result:

Solr Document with multi-valued field exams where the values are the toString of a MapRecord:

 "exams":["org.apache.nifi.serialization.record.MapRecord:MapRecord[{marks=90, test=term1, subject=Maths}]",
          "org.apache.nifi.serialization.record.MapRecord:MapRecord[{marks=95, test=term1, subject=Physics}]"],

Should have created fields like exams_marks, exams_test, exams_subject.

Here is a full template for the two scenarios:

https://gist.githubusercontent.com/bbende/edc2e7d61db83b29533ac3fc520de30f/raw/8764d50ed5e14d876c53a0b84b3af5741d910b3b/PutSolrRecordTesting.xml

There needs to be unit tests that cover both these cases.

public static final String COLLECTION_PARAM_NAME = "collection";
public static final String COMMIT_WITHIN_PARAM_NAME = "commitWithin";
public static final String REPEATING_PARAM_PATTERN = "\\w+\\.\\d+";
public final ComponentLog logger = getLogger();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to remove this and use getLogger() later in the code, the logger could still be null here

connectionError.set(e);
}
} catch (final IOException | SchemaNotFoundException | MalformedRecordException e) {
logger.error("Could not parse incoming data", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to getLogger().error


final List<String> fieldList = new ArrayList<>();
if (!StringUtils.isBlank(fieldsToIndex)) {
fieldList.addAll(Arrays.asList(fieldsToIndex.trim().split("[,]")));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only trimming leading/trailing spaces on the entire value, but we should trim each field too in case someone enters "field1, field2" instead of "field1,field2".

</p>
<p>
Example:
<strong>For a record created from the following json:</strong>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we wrap all the JSON examples in pre elements so that they display like code-blocks when viewing the documentation?

/**
* Test for PutSolrRecord Processor
*/
public class TestPutSolrRecord {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should be 3 new tests added...

  1. Test what happens when the recordParser throws an exception, just set failAfter on the record parser:
    recordParser.failAfter(0);
    And ensure that it routes to failure.

  2. Test a nested record

  3. Test an array of nested records

@abhinavrohatgi30
Copy link
Contributor Author

@bbende I'll have a look at this and write test cases accordingly.

@abhinavrohatgi30 abhinavrohatgi30 force-pushed the nifi-4035 branch 2 times, most recently from e0dfd21 to d48ef4b Compare April 9, 2018 21:11
…em into Solr as SolrDocuments

Adding Test Cases for PutSolrRecord Processor

Adding PutSolrRecord Processor in the list of Processors

Resolving checkstyle errors

Resolving checkstyle errors in test classes

Adding License information and additional information about the processor

1. Implementing Batch Indexing 2. Changes for nested records 3. Removing MockRecordParser

Fixing bugs with nested records

Updating version of dependencies

Setting Expression Language Scope
@abhinavrohatgi30
Copy link
Contributor Author

abhinavrohatgi30 commented Apr 9, 2018

Hi, I've looked at the comments and I've made the following changes as part of the latest commit that cover all the comments :

  1. Fixed the issue with Nested Records (The issue came up because of the change in field names in the previous commit)

  2. Fixed the issue with Array of Records (It was generating an Object[] as opposed to a Record[] that I was expecting and as a result was storing the string representation of a Record)

  3. Trimming field names individually

  4. Adding Test cases for Nested Record, Array of Record and Record Parser failure

  5. Using the getLogger() later in the code

  6. Wrapping the Jsons in the additionalDetails.html in a <pre> tag

I hope the processor now works as expected, let me know if any further changes are to be made

Thanks

@MikeThomsen
Copy link
Contributor

@abhinavrohatgi30 You have a merge conflict in this branch. If you resolve it, I'll help @bbende finish the review.

@abhinavrohatgi30
Copy link
Contributor Author

abhinavrohatgi30 commented Apr 20, 2018

@MikeThomsen I'm really sorry, it might take a while, I'm on a vacation and away from my workstation. I'll keep you updated as soon as I am back.
The last I remember is that when I pushed my changes there weren't any conflicts this seems to have come up in between the day I pushed and today. I'll try to resolve this as soon as I am back.

Thanks

@MikeThomsen
Copy link
Contributor

@abhinavrohatgi30 While you were away, I merged another Solr-related commit and that's the reason you now have conflicts.

@bbende
Copy link
Contributor

bbende commented Apr 24, 2018

I was able to resolve the conflicts and everything looks good now, going to merge, thanks!

@asfgit asfgit closed this in e3f4720 Apr 24, 2018
@abhinavrohatgi30
Copy link
Contributor Author

@bbende @MikeThomsen Thanks for reviewing the pull request

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants