NIFI-5084 Added GenerateRecord processor.#2813
NIFI-5084 Added GenerateRecord processor.#2813MikeThomsen wants to merge 5 commits intoapache:masterfrom
Conversation
|
@joewitt can you review the L&N? I added the new stuff, but wasn't sure if the existing dependencies that are covered elsewhere (Ex. Jackson and Bouncy Castle) had to also get added to the L&N here. |
|
@zenfenan can you review? |
|
As mentioned in the docs, the Confluent lib that I use for this uses extensions to Avro syntax to define the rules. |
|
Sure @MikeThomsen I'll take a look |
|
Mike - i'll look when able. In the mean time anything present in the resulting Nar(s) needs to be accounted for in the L&N of that nar. The nar should probably not be added to the assembly for now since we're space constrained and it is for test purposes and someone could easily add it when needed. Therefore, no need to update the nifi-assembly/L&N. thanks |
|
@joewitt so here's a thought... do you think it would be feasible to write a deployment process that pushes the extra nars (ex atlas, hive3 and this) to our GitHub repo's "releases" list? Or are there restrictions on space, etc.? |
|
Not sure. But we should just organize a proper effort to get an extension registry in play. It will take some time but with us not being able to make the download larger we should be properly motivated to sort that out. |
|
@MikeThomsen this is really great. A couple of questions:
|
|
| static final PropertyDescriptor SCHEMA = new PropertyDescriptor.Builder() | ||
| .name("generate-record-schema") | ||
| .displayName("Schema") | ||
| .expressionLanguageSupported(ExpressionLanguageScope.NONE) |
There was a problem hiding this comment.
Any reason not to allow EL here?
There was a problem hiding this comment.
I'd rather start small and expand out because there is a lot that can go wrong here.
| .name("generate-record-fixed-size") | ||
| .displayName("Fixed Size") | ||
| .expressionLanguageSupported(ExpressionLanguageScope.NONE) | ||
| .description("") |
There was a problem hiding this comment.
This should be populated with the description of the property, looks like it is used to determine whether the limit is the number of records produced, or (if false) some random number less than or equal to the limit? If so, is it easier for the user to have two properties, or to document a "Number of Records" property with an example of using the random() function in Expression Language?
There was a problem hiding this comment.
I'd rather keep it simple and let users explicitly control it.
| limit = context.getProperty(LIMIT).evaluateAttributeExpressions(input).asInteger(); | ||
| } else { | ||
| int ceiling = context.getProperty(LIMIT).evaluateAttributeExpressions(input).asInteger(); | ||
| limit = new Random().nextInt(ceiling); |
There was a problem hiding this comment.
This makes it possible to generate 0-record files and impossible to generate "LIMIT" number of files. Should it be +1?
| .required(false) | ||
| .addValidator(Validator.VALID) | ||
| .build(); | ||
|
|
There was a problem hiding this comment.
It would be cool to have a property for "Additional schema resources" or something, so they can refer to files containing words such as in their adjectives-list.json example. Not sure if this is possible as it appears to expect the file locations to be relative to the current working directory or something, but worth a look.
There was a problem hiding this comment.
The library it uses claims to provide that capability baked in, both as a file reference and as an inline JSON list of acceptable values. I should expand the scope of testing for that.
| .required(true) | ||
| .build(); | ||
|
|
||
| static final PropertyDescriptor SCHEMA = new PropertyDescriptor.Builder() |
There was a problem hiding this comment.
I appreciate that this library has a DSL that is a modified version of an Avro schema definition, as they may need the actual Avro schema defined in the writer as well. Does the writer support "Inherit Record Schema" so it can just get the one generated by the library without having to specify it?
Also I appreciate the flexibility of the DSL to be able to generate data of different types, lengths, patterns, etc. On the downside it appears to be more about generating data in the desired structure rather than generating desired data and putting it into the structure, meaning there is no direct support for generating emails, SSNs, phone numbers, etc. Those would have to be done by regex and/or using provided values.
Also the library generates Avro which we are currently converting in every case, which seems like an unnecessary step. At the least we may want to call getMimeType() on the writer, if it is Avro and we are inheriting the schema (versus defining it explicitly in the writer) we might be able to skip the "conversion" and write directly to the flow file. Not sure how much of that is available via the API, I'm just saying it's a bummer to have to convert the generated records. What kind of throughput are you seeing when it runs at full speed?
Did you vet other Java libraries for data generation? avro-mocker uses the actual output schema, then the CLI asks questions about the strategy for each field, I wonder if we could leverage that via a separate DSL or user-defined properties. JFairy is more concerned with semantic datatypes, but does not provide a DSL so we would likely have to do something similar to this library in terms of an Avro-schema-based DSL or something even simpler (if possible and prudent). All the libraries I looked at had similar pros/cons, so if we stick with this one I'm fine with that. Would be nice to have more examples in the additional details though, for email addresses, IPs, phone numbers, etc.
There was a problem hiding this comment.
I'll be honest, I didn't deep dive into too many options. So I didn't know about avro-mocker, for instance. Might be worth checking out.
I did check out JFairy and Faker (Java version), and neither of them have particularly good performance. I am kinda thinking about working on that library from Confluent to let it integrate with them optionally.
FWIW, you can use the regexes to generate ip strings, etc. AFAIK, this would do IPv4 IPs:
[\d]{,3}\.[\d]{,3}\.[\d]{,3}\.[\d]{,3}
| for (int x = 0; x < records.size(); x++) { | ||
| writer.write(records.get(x)); | ||
| } | ||
| writer.finishRecordSet(); |
There was a problem hiding this comment.
As a record-based processor, the WriteResult returned from writer.finishRecordSet() should be used to store an attribute called "record.count", and writer.getMimeType() should be used to set the "mime.type" attribute (and these should be added as WritesAttribute annotations at the processor class definition at the top).
| out = session.write(out, outputStream -> { | ||
| try { | ||
| RecordSetWriter writer = writerFactory.createWriter(getLogger(), schema, outputStream); | ||
| writer.beginRecordSet(); |
There was a problem hiding this comment.
If you create a ListRecordSet from records, you can do a single writer.write() and the default implementation will do the loop for you.
|
Per the discussion on the mailing list, I am moving the code here and closing this PR. |
|
@MikeThomsen I'm looking for the discussion on the mailing list on why this was decided to not be merged in and I can't find it. I only see the "How would we handle this?" thread[1]. Could you link to it? [1] http://apache-nifi-developer-list.39713.n7.nabble.com/How-would-we-handle-this-td19151.html |
|
Here you go: http://apache-nifi-developer-list.39713.n7.nabble.com/Should-I-withdraw-this-PR-td19475.html I'm fine with reopening this if you want to take over the code review. Folks have been pretty busy and it was starting to get to stale, so I pulled the trigger to include it in a larger bundle of testing tools for users. |
|
Also, the space issue is becoming a serious concern for our binary distributions. So now things that aren't really important for core have to be weighed against that consideration if you want normal users to have an easy time getting to them. |
|
Is there a reason this processor wasn't just a part of the normal record bundle instead of being thought of as part of a larger reprocessing bundle? If it's just another file in that same nar, it wouldn't be a lot of space. For reference, I have the use-case where I have attributes on a FF and the content is a binary file. After putting the file to disk, I want to create a nested json object to continue processing. I essentially want to create a new record just from EL (same as UpdateRecord but with no reader). I believe GenerateRecord would fill this need? |
|
I don't think it's compatible with your use case. What I built, based on how I read the ticket, was a processor that uses a Confluent library to generate random data based on an Avro schema. So maybe I'm missing something, but I think that's different from what you need. |
|
Ah yup, totally different lol. I was thinking GenerateFlowFile but with just a writer + dynamic properties to set fields. Sorry for the confusion! |
|
NP. |
Thank you for submitting a contribution to Apache NiFi.
In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:
For all changes:
Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically master)?
Is your initial contribution a single, squashed commit?
For code changes:
For documentation related changes:
Note:
Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.