NIFI-3724 - Add Put/Fetch Parquet Processors #1712

bbende · 2017-04-27T21:33:46Z

This PR adds a new nifi-parquet-bundle with PutParquet and FetchParquet processors. These work similar to PutHDFS and FetchHDFS, but instead read and write Records.

While working on this I needed to reuse portions of the record reader/writer code, and thus refactored some of the project structure which caused many files to move around.

Summary of changes:

Created nifi-parquet-bundle
Created nifi-commons/nifi-record to hold domain/API related to records
Created nifi-nar-bundles/nifi-extension-utils as a place for utility code specific to extensions
Moved nifi-commons/nifi-processor-utils under nifi-extension-utils
Moved nifi-commons/nifi-hadoop-utils under nifi-extension-utils
Create nifi-extension-utils/nifi-record-utils for utility code related records

To test the Parquet processors you can create a core-site.xml with a local file system and read/write parquet to local directories:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>file:///</value>
    </property>
</configuration>

…hParquet - Creating nifi-records-utils to share utility code from record services - Refactoring Parquet tests to use MockRecorderParser and MockRecordWriter - Refactoring AbstractPutHDFSRecord to use schema access strategy - Adding custom validate to AbstractPutHDFSRecord and adding handling of UNION types when writing Records as Avro - Refactoring project structure to get CS API references out of nifi-commons, introducing nifi-extension-utils under nifi-nar-bundles - Updating abstract put/fetch processors to obtain the WriteResult and update flow file attributes

alopresto · 2017-05-01T14:54:41Z

Reviewing...

alopresto · 2017-05-01T16:14:39Z

...fi-hadoop-utils/src/main/java/org/apache/nifi/processors/hadoop/AbstractHadoopProcessor.java


    // properties
    public static final PropertyDescriptor HADOOP_CONFIGURATION_RESOURCES = new PropertyDescriptor.Builder()
            .name("Hadoop Configuration Resources")
            .description("A file or comma separated list of files which contains the Hadoop file system configuration. Without this, Hadoop "
                    + "will search the classpath for a 'core-site.xml' and 'hdfs-site.xml' file or will revert to a default configuration.")
            .required(false)
-            .addValidator(createMultipleFilesExistValidator())
+            .addValidator(HadoopValidators.MULTIPLE_FILE_EXISTS_VALIDATOR)


Minor comment -- until I read the source code for this, my interpretation was that this validator ensured that multiple files existed -- i.e. one file provided would fail. Perhaps we can rename this ONE_OR_MORE_FILES_EXIST_VALIDATOR? Not a giant issue but potentially confusing.

alopresto · 2017-05-01T16:18:05Z

...ifi-record-utils/nifi-avro-record-utils/src/main/java/org/apache/nifi/avro/AvroTypeUtil.java

+    public static final String AVRO_SCHEMA_FORMAT = "avro";
+
+    public static Schema extractAvroSchema(final RecordSchema recordSchema) throws SchemaNotFoundException {
+        final Optional<String> schemaFormatOption = recordSchema.getSchemaFormat();


Should we guard against null here?

alopresto · 2017-05-01T16:33:47Z

...ifi-record-utils/nifi-avro-record-utils/src/main/java/org/apache/nifi/avro/AvroTypeUtil.java

+    }
+
+    public static DataType determineDataType(final Schema avroSchema) {
+        final Type avroType = avroSchema.getType();


Same comment for null check.

alopresto · 2017-05-01T16:40:17Z

...ls/nifi-avro-record-utils/src/main/java/org/apache/nifi/schema/access/SchemaAccessUtils.java

+            "The FlowFile contains 3 Attributes that will be used to lookup a Schema from the configured Schema Registry: 'schema.identifier', 'schema.version', and 'schema.protocol.version'");
+
+    public  static final PropertyDescriptor SCHEMA_REGISTRY = new PropertyDescriptor.Builder()
+            .name("Schema Registry")


Please use a name value without a space and provide a displayName field with human-facing value (i.e. "Schema Registry").

alopresto · 2017-05-01T16:40:34Z

...ls/nifi-avro-record-utils/src/main/java/org/apache/nifi/schema/access/SchemaAccessUtils.java

+            .build();
+
+    public  static final PropertyDescriptor SCHEMA_ACCESS_STRATEGY = new PropertyDescriptor.Builder()
+            .name("Schema Access Strategy")


Please use a name value without a space and provide a displayName field with human-facing value (i.e. "Schema Access Strategy").

alopresto · 2017-05-01T16:40:50Z

...ls/nifi-avro-record-utils/src/main/java/org/apache/nifi/schema/access/SchemaAccessUtils.java

+
+    public static final PropertyDescriptor SCHEMA_NAME = new PropertyDescriptor.Builder()
+            .name("Schema Name")
+            .description("Specifies the name of the schema to lookup in the Schema Registry property")


Please use a name value without a space and provide a displayName field with human-facing value (i.e. "Schema Name").

just to be clear the only thing that needs to be truly Human Readable is displayName. However, the 'name' doesn't have to look like something only a computer would like :)

I don't mean it has to be difficult to read, but as this is used as the value identifier when the flow is serialized to XML, some formatters/etc. could break lines on the space or otherwise manipulate it. I think it's safer to avoid spaces (and most of the other examples are formatted-like-this).

alopresto · 2017-05-01T16:53:05Z

...doop-record-utils/src/main/java/org/apache/nifi/processors/hadoop/AbstractPutHDFSRecord.java

+                        recordWriter = createHDFSRecordWriter(context, flowFile, configuration, tempFile, destRecordSchema);
+
+                        // if we fail to create the RecordReader then we want to route to failure, so we need to
+                        // handle this separately from the other IOExceptions which normally rout to retry


rout -> route.

alopresto · 2017-05-01T16:54:43Z

...op-record-utils/src/main/java/org/apache/nifi/processors/hadoop/AbstractFetchHDFSRecord.java

+                successFlowFile = session.putAllAttributes(successFlowFile, attributes);
+
+                final URI uri = path.toUri();
+                getLogger().info("Successfully received content from {} for {} in {}", new Object[] {uri, successFlowFile, stopWatch.getDuration()});


Add unit of milliseconds to the log output of the duration.

alopresto · 2017-05-01T17:15:44Z

...doop-record-utils/src/main/java/org/apache/nifi/processors/hadoop/AbstractPutHDFSRecord.java

+            }
+            Thread.sleep(200L);// try waiting to let whatever might cause rename failure to resolve
+        }
+        if (!renamed) {


My read of this is that if the rename operation fails 10x, the source file is deleted. Is that captured anywhere in the docs/Javadocs, etc.? Would be a little confusing for a user unless the only context for this method is renaming the temporary file to the persistent one.

The behavior is that it will try up to 10 times to rename the dot file to the final name, and then if it still hasn't been renamed it will delete the dot file and route the flow file to failure. This behavior came from the existing PutHDFS so I kept the same behavior for consistency, but I can add something to the capability description of PutParquet describing this behavior.

I think a method Javadoc comment explaining that is fine.

alopresto · 2017-05-01T19:06:35Z

I ran contrib-check and all tests and both passed fine. I have minor comments on the code above but nothing serious.

I loaded a template provided by Bryan which generated flowfiles, merged them, and wrote them to Parquet format (on local disk using the core-site.xml referenced above), then fetched those files and wrote them out as CSV.

hw12203:/Users/alopresto/Workspace/scratch/NIFI-3724 (master) alopresto
🔓 1s @ 14:59:53 $ ll
total 24
drwxr-xr-x    6 alopresto  staff   204B May  1 14:59 ./
drwxr-xr-x  105 alopresto  staff   3.5K May  1 14:45 ../
-rw-r--r--@   1 alopresto  staff   6.0K May  1 14:55 .DS_Store
-rw-r--r--    1 alopresto  staff   129B May  1 14:41 core-site.xml
drwxr-xr-x    2 alopresto  staff    68B May  1 14:59 csv/
drwxr-xr-x    2 alopresto  staff    68B May  1 14:59 parquet/
hw12203:/Users/alopresto/Workspace/scratch/NIFI-3724 (master) alopresto
🔓 9s @ 15:00:03 $ tl
.
├── [6.0K]  .DS_Store
├── [ 129]  core-site.xml
├── [ 238]  csv/
│   ├── [  54]  257951968574779
│   ├── [3.5M]  257962982705055
│   ├── [  54]  257981981063720
│   ├── [3.7M]  257986105785832
│   └── [  54]  258011981257869
└── [ 476]  parquet/
    ├── [  16]  .257951968574779.crc
    ├── [6.5K]  .257962982705055.crc
    ├── [  16]  .257981981063720.crc
    ├── [6.6K]  .257986105785832.crc
    ├── [  16]  .258011981257869.crc
    ├── [6.5K]  .258013234789061.crc
    ├── [ 758]  257951968574779
    ├── [829K]  257962982705055
    ├── [ 758]  257981981063720
    ├── [842K]  257986105785832
    ├── [ 758]  258011981257869
    └── [833K]  258013234789061

2 directories, 19 files
hw12203:/Users/alopresto/Workspace/scratch/NIFI-3724 (master) alopresto
🔓 86s @ 15:01:30 $ more csv/258011981257869
name,favorite_number,favorite_color
Bryan,693421,blue

If the displayName comments are fixed, I am +1 and ready to merge. Thanks Bryan.

One minor issue:

On template import, the processors which referenced a controller service were invalid. Configuring each (they showed "Incompatible Controller Service Configured") by selecting the same option from the list fixed the issue. This doesn't seem like an issue introduced by any code in this PR, however.

bbende · 2017-05-01T19:33:14Z

Thanks Andy, I just pushed a commit that addresses your comments, we should be good to go. I am going to look into the template issue, but I agree that it is not caused by the changes in this PR.

alopresto · 2017-05-01T19:33:40Z

Thanks. Merging.

masterlittle · 2017-07-31T07:42:54Z

Hi, Does this enable writing the parquet files to S3 bucket? If not then is there any way I can achieve the same?

nellashapiro123 · 2017-11-02T19:46:05Z

Has anybody been able to use fetchParquet processor successfully? I am getting SchemaNotFound exception. I have created the file with PutParquet and Spark can read this parquet file.

bbende · 2017-11-02T20:57:41Z

@nellashapiro123 it would probably be best to ask this on the mailing lists:
https://nifi.apache.org/mailing_lists.html

If you send an email, please provide more info about your flow like which reader and writer is FetchParquet using? what schema access strategy is each reader and writer using? and if using schema access by name, what is the value of the schema.name attribute coming into FetchParquet?

trixpan mentioned this pull request Apr 28, 2017

NIFI-3759 - avro append for PutHDFS processor #1716

Closed

5 tasks

bbende force-pushed the parquet-bundle branch from a35e595 to f0de057 Compare April 28, 2017 20:41

alopresto reviewed May 1, 2017

View reviewed changes

NIFI-3724 Addressing Review Feedback

e2ab9cc

asfgit closed this in 60d88b5 May 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NIFI-3724 - Add Put/Fetch Parquet Processors #1712

NIFI-3724 - Add Put/Fetch Parquet Processors #1712

bbende commented Apr 27, 2017

alopresto commented May 1, 2017

alopresto May 1, 2017

alopresto May 1, 2017

alopresto May 1, 2017

alopresto May 1, 2017

alopresto May 1, 2017

alopresto May 1, 2017

joewitt May 1, 2017

alopresto May 1, 2017

alopresto May 1, 2017

alopresto May 1, 2017

alopresto May 1, 2017

bbende May 1, 2017

alopresto May 1, 2017

alopresto commented May 1, 2017

bbende commented May 1, 2017

alopresto commented May 1, 2017

masterlittle commented Jul 31, 2017

nellashapiro123 commented Nov 2, 2017

bbende commented Nov 2, 2017

NIFI-3724 - Add Put/Fetch Parquet Processors #1712

NIFI-3724 - Add Put/Fetch Parquet Processors #1712

Conversation

bbende commented Apr 27, 2017

alopresto commented May 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alopresto commented May 1, 2017

bbende commented May 1, 2017

alopresto commented May 1, 2017

masterlittle commented Jul 31, 2017

nellashapiro123 commented Nov 2, 2017

bbende commented Nov 2, 2017