Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIFI-3724 - Add Put/Fetch Parquet Processors #1712

Closed
wants to merge 2 commits into from

Conversation

bbende
Copy link
Contributor

@bbende bbende commented Apr 27, 2017

This PR adds a new nifi-parquet-bundle with PutParquet and FetchParquet processors. These work similar to PutHDFS and FetchHDFS, but instead read and write Records.

While working on this I needed to reuse portions of the record reader/writer code, and thus refactored some of the project structure which caused many files to move around.

Summary of changes:

  • Created nifi-parquet-bundle
  • Created nifi-commons/nifi-record to hold domain/API related to records
  • Created nifi-nar-bundles/nifi-extension-utils as a place for utility code specific to extensions
  • Moved nifi-commons/nifi-processor-utils under nifi-extension-utils
  • Moved nifi-commons/nifi-hadoop-utils under nifi-extension-utils
  • Create nifi-extension-utils/nifi-record-utils for utility code related records

To test the Parquet processors you can create a core-site.xml with a local file system and read/write parquet to local directories:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>file:///</value>
    </property>
</configuration>

…hParquet

- Creating nifi-records-utils to share utility code from record services
- Refactoring Parquet tests to use MockRecorderParser and MockRecordWriter
- Refactoring AbstractPutHDFSRecord to use schema access strategy
- Adding custom validate to AbstractPutHDFSRecord and adding handling of UNION types when writing Records as Avro
- Refactoring project structure to get CS API references out of nifi-commons, introducing nifi-extension-utils under nifi-nar-bundles
- Updating abstract put/fetch processors to obtain the WriteResult and update flow file attributes
@alopresto
Copy link
Contributor

Reviewing...


// properties
public static final PropertyDescriptor HADOOP_CONFIGURATION_RESOURCES = new PropertyDescriptor.Builder()
.name("Hadoop Configuration Resources")
.description("A file or comma separated list of files which contains the Hadoop file system configuration. Without this, Hadoop "
+ "will search the classpath for a 'core-site.xml' and 'hdfs-site.xml' file or will revert to a default configuration.")
.required(false)
.addValidator(createMultipleFilesExistValidator())
.addValidator(HadoopValidators.MULTIPLE_FILE_EXISTS_VALIDATOR)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment -- until I read the source code for this, my interpretation was that this validator ensured that multiple files existed -- i.e. one file provided would fail. Perhaps we can rename this ONE_OR_MORE_FILES_EXIST_VALIDATOR? Not a giant issue but potentially confusing.

public static final String AVRO_SCHEMA_FORMAT = "avro";

public static Schema extractAvroSchema(final RecordSchema recordSchema) throws SchemaNotFoundException {
final Optional<String> schemaFormatOption = recordSchema.getSchemaFormat();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we guard against null here?

}

public static DataType determineDataType(final Schema avroSchema) {
final Type avroType = avroSchema.getType();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment for null check.

"The FlowFile contains 3 Attributes that will be used to lookup a Schema from the configured Schema Registry: 'schema.identifier', 'schema.version', and 'schema.protocol.version'");

public static final PropertyDescriptor SCHEMA_REGISTRY = new PropertyDescriptor.Builder()
.name("Schema Registry")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a name value without a space and provide a displayName field with human-facing value (i.e. "Schema Registry").

.build();

public static final PropertyDescriptor SCHEMA_ACCESS_STRATEGY = new PropertyDescriptor.Builder()
.name("Schema Access Strategy")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a name value without a space and provide a displayName field with human-facing value (i.e. "Schema Access Strategy").


public static final PropertyDescriptor SCHEMA_NAME = new PropertyDescriptor.Builder()
.name("Schema Name")
.description("Specifies the name of the schema to lookup in the Schema Registry property")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a name value without a space and provide a displayName field with human-facing value (i.e. "Schema Name").

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to be clear the only thing that needs to be truly Human Readable is displayName. However, the 'name' doesn't have to look like something only a computer would like :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mean it has to be difficult to read, but as this is used as the value identifier when the flow is serialized to XML, some formatters/etc. could break lines on the space or otherwise manipulate it. I think it's safer to avoid spaces (and most of the other examples are formatted-like-this).

recordWriter = createHDFSRecordWriter(context, flowFile, configuration, tempFile, destRecordSchema);

// if we fail to create the RecordReader then we want to route to failure, so we need to
// handle this separately from the other IOExceptions which normally rout to retry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rout -> route.

successFlowFile = session.putAllAttributes(successFlowFile, attributes);

final URI uri = path.toUri();
getLogger().info("Successfully received content from {} for {} in {}", new Object[] {uri, successFlowFile, stopWatch.getDuration()});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add unit of milliseconds to the log output of the duration.

}
Thread.sleep(200L);// try waiting to let whatever might cause rename failure to resolve
}
if (!renamed) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My read of this is that if the rename operation fails 10x, the source file is deleted. Is that captured anywhere in the docs/Javadocs, etc.? Would be a little confusing for a user unless the only context for this method is renaming the temporary file to the persistent one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior is that it will try up to 10 times to rename the dot file to the final name, and then if it still hasn't been renamed it will delete the dot file and route the flow file to failure. This behavior came from the existing PutHDFS so I kept the same behavior for consistency, but I can add something to the capability description of PutParquet describing this behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a method Javadoc comment explaining that is fine.

@alopresto
Copy link
Contributor

I ran contrib-check and all tests and both passed fine. I have minor comments on the code above but nothing serious.

I loaded a template provided by Bryan which generated flowfiles, merged them, and wrote them to Parquet format (on local disk using the core-site.xml referenced above), then fetched those files and wrote them out as CSV.

hw12203:/Users/alopresto/Workspace/scratch/NIFI-3724 (master) alopresto
🔓 1s @ 14:59:53 $ ll
total 24
drwxr-xr-x    6 alopresto  staff   204B May  1 14:59 ./
drwxr-xr-x  105 alopresto  staff   3.5K May  1 14:45 ../
-rw-r--r--@   1 alopresto  staff   6.0K May  1 14:55 .DS_Store
-rw-r--r--    1 alopresto  staff   129B May  1 14:41 core-site.xml
drwxr-xr-x    2 alopresto  staff    68B May  1 14:59 csv/
drwxr-xr-x    2 alopresto  staff    68B May  1 14:59 parquet/
hw12203:/Users/alopresto/Workspace/scratch/NIFI-3724 (master) alopresto
🔓 9s @ 15:00:03 $ tl
.
├── [6.0K]  .DS_Store
├── [ 129]  core-site.xml
├── [ 238]  csv/
│   ├── [  54]  257951968574779
│   ├── [3.5M]  257962982705055
│   ├── [  54]  257981981063720
│   ├── [3.7M]  257986105785832
│   └── [  54]  258011981257869
└── [ 476]  parquet/
    ├── [  16]  .257951968574779.crc
    ├── [6.5K]  .257962982705055.crc
    ├── [  16]  .257981981063720.crc
    ├── [6.6K]  .257986105785832.crc
    ├── [  16]  .258011981257869.crc
    ├── [6.5K]  .258013234789061.crc
    ├── [ 758]  257951968574779
    ├── [829K]  257962982705055
    ├── [ 758]  257981981063720
    ├── [842K]  257986105785832
    ├── [ 758]  258011981257869
    └── [833K]  258013234789061

2 directories, 19 files
hw12203:/Users/alopresto/Workspace/scratch/NIFI-3724 (master) alopresto
🔓 86s @ 15:01:30 $ more csv/258011981257869
name,favorite_number,favorite_color
Bryan,693421,blue

If the displayName comments are fixed, I am +1 and ready to merge. Thanks Bryan.

One minor issue:

  • On template import, the processors which referenced a controller service were invalid. Configuring each (they showed "Incompatible Controller Service Configured") by selecting the same option from the list fixed the issue. This doesn't seem like an issue introduced by any code in this PR, however.

@bbende
Copy link
Contributor Author

bbende commented May 1, 2017

Thanks Andy, I just pushed a commit that addresses your comments, we should be good to go. I am going to look into the template issue, but I agree that it is not caused by the changes in this PR.

@alopresto
Copy link
Contributor

Thanks. Merging.

@asfgit asfgit closed this in 60d88b5 May 1, 2017
@masterlittle
Copy link

Hi, Does this enable writing the parquet files to S3 bucket? If not then is there any way I can achieve the same?

@nellashapiro123
Copy link

Has anybody been able to use fetchParquet processor successfully? I am getting SchemaNotFound exception. I have created the file with PutParquet and Spark can read this parquet file.

@bbende
Copy link
Contributor Author

bbende commented Nov 2, 2017

@nellashapiro123 it would probably be best to ask this on the mailing lists:
https://nifi.apache.org/mailing_lists.html

If you send an email, please provide more info about your flow like which reader and writer is FetchParquet using? what schema access strategy is each reader and writer using? and if using schema access by name, what is the value of the schema.name attribute coming into FetchParquet?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants