Based on example code snippet ParquetReaderWriterWithAvro.java
located on github at:
Original example code author: Max Konstantinov MaxNevermind
Extensively refactored by: Roger Voss roger-dv, Tideworks Technology, May 2018
-
Original example wrote 2 Avro dummy test data items to a Parquet file.
-
The refactored implementation uses an iteration loop to write a default of 10 Avro dummy test day items and will accept a count as passed as a command line argument.
-
The test data strings are now generated by RandomString class to a size of 64 characters.
-
Still uses the original avroToParquet.avsc schema by which to describe the Avro dummy test data.
-
The most significant enhancements is where the code now calls these two methods:
nioPathToOutputFile()
nioPathToInputFile()
-
nioPathToOutputFile()
accepts a Java nioPath
to a standard file system file path and returns anorg.apache.parquet.io.OutputFile
(which is accepted by theAvroParquetWriter
builder). -
nioPathToInputFile()
accepts a Java nio Path to a standard file system file path and returns anorg.apache.parquet.io.InputFile
(which is accepted by theAvroParquetReader
builder).
These methods provide implementations of these two `OutputFile` and `InputFile` adaptors that make it possible to write Avro data to Parquet formatted file residing in the conventional file system (i.e., a plain file system instead of the Hadoop hdfs file system) and then read it back. The usecase would be for working in a big data solution stack that is not predicated on Hadoop and hdfs.
- It is an easy matter to adapt this approach to work with JSON input data - just
synthesize an appropriate Avro schema to describe the JSON data, put the JSON data
into an Avro
GenericData.Record
and write it out.
-
Build:
mvn install
-
HADOOP_HOME
environment variable should be defined to prevent an exception from being thrown - code will continue to execute properly but defining this squelches it. This is down in the bowels of Hadoop/Parquet library implementation - not behavior from the application code. -
HOME
environment variable may defined. The program will look for logback.xml there and will write the Parquet file it generates to there. Otherwise the program will use the current working directory. -
In
logback.xml
, the filters on theConsoleAppender
andRollingFileAppender
should be adjusted to modify verbosity level of logging. The defaults are set toINFO
level. The intent is to allow, say, setting file appender toDEBUG
while console is set toINFO
. -
The only command line argument accepted is the specification of how many iterations of writing Avro records; the default is 10.
-
Can use the shell script
run.sh
to invoke the program from the Maventarget/
directory. -
Logging will go into a
logs/
directory as the fileavro2parquet.log
.