Skip to content

NIFI-5084 Added GenerateRecord processor.#2813

Closed
MikeThomsen wants to merge 5 commits intoapache:masterfrom
MikeThomsen:NIFI-5084
Closed

NIFI-5084 Added GenerateRecord processor.#2813
MikeThomsen wants to merge 5 commits intoapache:masterfrom
MikeThomsen:NIFI-5084

Conversation

@MikeThomsen
Copy link
Contributor

Thank you for submitting a contribution to Apache NiFi.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

  • Is there a JIRA ticket associated with this PR? Is it referenced
    in the commit message?

  • Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.

  • Has your PR been rebased against the latest commit within the target branch (typically master)?

  • Is your initial contribution a single, squashed commit?

For code changes:

  • Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
  • Have you written or updated unit tests to verify your changes?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?
  • If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?
  • If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?

For documentation related changes:

  • Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.

@MikeThomsen
Copy link
Contributor Author

@joewitt can you review the L&N? I added the new stuff, but wasn't sure if the existing dependencies that are covered elsewhere (Ex. Jackson and Bouncy Castle) had to also get added to the L&N here.

@MikeThomsen
Copy link
Contributor Author

@zenfenan can you review?

@MikeThomsen
Copy link
Contributor Author

As mentioned in the docs, the Confluent lib that I use for this uses extensions to Avro syntax to define the rules.

@zenfenan
Copy link
Contributor

Sure @MikeThomsen I'll take a look

@joewitt
Copy link
Contributor

joewitt commented Jun 26, 2018

Mike - i'll look when able. In the mean time anything present in the resulting Nar(s) needs to be accounted for in the L&N of that nar. The nar should probably not be added to the assembly for now since we're space constrained and it is for test purposes and someone could easily add it when needed. Therefore, no need to update the nifi-assembly/L&N.

thanks

@MikeThomsen
Copy link
Contributor Author

@joewitt so here's a thought... do you think it would be feasible to write a deployment process that pushes the extra nars (ex atlas, hive3 and this) to our GitHub repo's "releases" list? Or are there restrictions on space, etc.?

@joewitt
Copy link
Contributor

joewitt commented Jun 26, 2018

Not sure. But we should just organize a proper effort to get an extension registry in play. It will take some time but with us not being able to make the download larger we should be properly motivated to sort that out.

@ottobackwards
Copy link
Contributor

@MikeThomsen this is really great. A couple of questions:

  • Why wouldn't you want to have the same schema selection capabilities for both the processor (virtual reader) and the writer?
  • Did you consider creating a Generating Reader Service that could be re-used by this processor and possibly many others at the same time?

@MikeThomsen
Copy link
Contributor Author

@ottobackwards

  1. I wanted to keep it simple. You can't really wire up the writer without going through most of that, and I wanted to have a simple configuration option for just dumping a schema in for less experienced users who just want to get stuff done.

  2. I think that would make a good follow on ticket. Some thought would be need to determine how to control the size of a record set. I could see that causing issues for processors.

static final PropertyDescriptor SCHEMA = new PropertyDescriptor.Builder()
.name("generate-record-schema")
.displayName("Schema")
.expressionLanguageSupported(ExpressionLanguageScope.NONE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to allow EL here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather start small and expand out because there is a lot that can go wrong here.

.name("generate-record-fixed-size")
.displayName("Fixed Size")
.expressionLanguageSupported(ExpressionLanguageScope.NONE)
.description("")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be populated with the description of the property, looks like it is used to determine whether the limit is the number of records produced, or (if false) some random number less than or equal to the limit? If so, is it easier for the user to have two properties, or to document a "Number of Records" property with an example of using the random() function in Expression Language?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather keep it simple and let users explicitly control it.

limit = context.getProperty(LIMIT).evaluateAttributeExpressions(input).asInteger();
} else {
int ceiling = context.getProperty(LIMIT).evaluateAttributeExpressions(input).asInteger();
limit = new Random().nextInt(ceiling);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes it possible to generate 0-record files and impossible to generate "LIMIT" number of files. Should it be +1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

.required(false)
.addValidator(Validator.VALID)
.build();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be cool to have a property for "Additional schema resources" or something, so they can refer to files containing words such as in their adjectives-list.json example. Not sure if this is possible as it appears to expect the file locations to be relative to the current working directory or something, but worth a look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The library it uses claims to provide that capability baked in, both as a file reference and as an inline JSON list of acceptable values. I should expand the scope of testing for that.

.required(true)
.build();

static final PropertyDescriptor SCHEMA = new PropertyDescriptor.Builder()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate that this library has a DSL that is a modified version of an Avro schema definition, as they may need the actual Avro schema defined in the writer as well. Does the writer support "Inherit Record Schema" so it can just get the one generated by the library without having to specify it?

Also I appreciate the flexibility of the DSL to be able to generate data of different types, lengths, patterns, etc. On the downside it appears to be more about generating data in the desired structure rather than generating desired data and putting it into the structure, meaning there is no direct support for generating emails, SSNs, phone numbers, etc. Those would have to be done by regex and/or using provided values.

Also the library generates Avro which we are currently converting in every case, which seems like an unnecessary step. At the least we may want to call getMimeType() on the writer, if it is Avro and we are inheriting the schema (versus defining it explicitly in the writer) we might be able to skip the "conversion" and write directly to the flow file. Not sure how much of that is available via the API, I'm just saying it's a bummer to have to convert the generated records. What kind of throughput are you seeing when it runs at full speed?

Did you vet other Java libraries for data generation? avro-mocker uses the actual output schema, then the CLI asks questions about the strategy for each field, I wonder if we could leverage that via a separate DSL or user-defined properties. JFairy is more concerned with semantic datatypes, but does not provide a DSL so we would likely have to do something similar to this library in terms of an Avro-schema-based DSL or something even simpler (if possible and prudent). All the libraries I looked at had similar pros/cons, so if we stick with this one I'm fine with that. Would be nice to have more examples in the additional details though, for email addresses, IPs, phone numbers, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be honest, I didn't deep dive into too many options. So I didn't know about avro-mocker, for instance. Might be worth checking out.

I did check out JFairy and Faker (Java version), and neither of them have particularly good performance. I am kinda thinking about working on that library from Confluent to let it integrate with them optionally.

FWIW, you can use the regexes to generate ip strings, etc. AFAIK, this would do IPv4 IPs:

[\d]{,3}\.[\d]{,3}\.[\d]{,3}\.[\d]{,3}

for (int x = 0; x < records.size(); x++) {
writer.write(records.get(x));
}
writer.finishRecordSet();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a record-based processor, the WriteResult returned from writer.finishRecordSet() should be used to store an attribute called "record.count", and writer.getMimeType() should be used to set the "mime.type" attribute (and these should be added as WritesAttribute annotations at the processor class definition at the top).

out = session.write(out, outputStream -> {
try {
RecordSetWriter writer = writerFactory.createWriter(getLogger(), schema, outputStream);
writer.beginRecordSet();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you create a ListRecordSet from records, you can do a single writer.write() and the default implementation will do the loop for you.

@MikeThomsen
Copy link
Contributor Author

Per the discussion on the mailing list, I am moving the code here and closing this PR.

@MikeThomsen MikeThomsen closed this Sep 1, 2018
@JPercivall
Copy link
Contributor

@MikeThomsen I'm looking for the discussion on the mailing list on why this was decided to not be merged in and I can't find it. I only see the "How would we handle this?" thread[1]. Could you link to it?

[1] http://apache-nifi-developer-list.39713.n7.nabble.com/How-would-we-handle-this-td19151.html

@MikeThomsen
Copy link
Contributor Author

@JPercivall

Here you go:

http://apache-nifi-developer-list.39713.n7.nabble.com/Should-I-withdraw-this-PR-td19475.html

I'm fine with reopening this if you want to take over the code review. Folks have been pretty busy and it was starting to get to stale, so I pulled the trigger to include it in a larger bundle of testing tools for users.

@MikeThomsen
Copy link
Contributor Author

Also, the space issue is becoming a serious concern for our binary distributions. So now things that aren't really important for core have to be weighed against that consideration if you want normal users to have an easy time getting to them.

@JPercivall
Copy link
Contributor

Is there a reason this processor wasn't just a part of the normal record bundle instead of being thought of as part of a larger reprocessing bundle? If it's just another file in that same nar, it wouldn't be a lot of space.

For reference, I have the use-case where I have attributes on a FF and the content is a binary file. After putting the file to disk, I want to create a nested json object to continue processing. I essentially want to create a new record just from EL (same as UpdateRecord but with no reader). I believe GenerateRecord would fill this need?

@MikeThomsen
Copy link
Contributor Author

I don't think it's compatible with your use case. What I built, based on how I read the ticket, was a processor that uses a Confluent library to generate random data based on an Avro schema. So maybe I'm missing something, but I think that's different from what you need.

@JPercivall
Copy link
Contributor

Ah yup, totally different lol. I was thinking GenerateFlowFile but with just a writer + dynamic properties to set fields. Sorry for the confusion!

@MikeThomsen
Copy link
Contributor Author

NP.

@MikeThomsen MikeThomsen deleted the NIFI-5084 branch August 14, 2024 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants