From 3ce5b7e7d978dea8e73354084920be61b287bb20 Mon Sep 17 00:00:00 2001 From: Owen O'Malley Date: Tue, 28 Feb 2017 09:53:17 -0800 Subject: [PATCH] Add documentation for the Java tools jar. --- site/_docs/tools.md | 71 +++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 65 insertions(+), 6 deletions(-) diff --git a/site/_docs/tools.md b/site/_docs/tools.md index d02daee9e5..fa911367ea 100644 --- a/site/_docs/tools.md +++ b/site/_docs/tools.md @@ -81,15 +81,29 @@ string,struct>>", } ~~~ -## Java Metadata +## Java ORC Tools -The org.apache.orc.tools.FileDump Java class, which is available via Hive as: +In addition to the C++ tools above, there is an ORC tools jar that +packages several useful utilities and the necessary Java dependencies +(including Hadoop) into a single package. The Java ORC tool jar +supports both the local file system and HDFS. +The subcommands for the tools are: + * meta - print the metadata of an ORC file + * data - print the data of an ORC file + * scan (since ORC 1.3) - scan the data for benchmarking + * convert (since ORC 1.4) - convert JSON files to ORC + * json-schema (since ORC 1.4) - determine the schema of JSON documents + ~~~ shell -% java -jar orc-tools-*.jar meta [-j] [-p] [-t] [--rowindex ] - [--recover] [--skip-dump] [--backup-path ] +% java -jar orc-tools-X.Y.Z-uber.jar ~~~ +### Java Meta + +The meta command prints the metadata about the given ORC file and is +equivalent to the Hive ORC File Dump command. + -j : format the output in JSON @@ -114,7 +128,7 @@ The org.apache.orc.tools.FileDump Java class, which is available via Hive as: An example of the output is given below: ~~~ shell -% java -jar orc-tools-*.jar meta examples/TestOrcFile.test1.orc +% java -jar orc-tools-X.Y.Z-uber.jar meta examples/TestOrcFile.test1.orc Processing data file examples/TestOrcFile.test1.orc [length: 1711] Structure for examples/TestOrcFile.test1.orc File Version: 0.12 with HIVE_8732 @@ -261,4 +275,49 @@ File length: 1711 bytes Padding length: 0 bytes Padding ratio: 0% ______________________________________________________________________ -~~~ \ No newline at end of file +~~~ + +### Java Data + +The data command prints the data in an ORC file as a JSON document. Each +record is printed as a JSON object on a line. Each record is annotated with +the fieldnames and a JSON representation that depends on the field's type. + +### Java Scan + +The scan command reads the contents of the file without printing anything. It +is primarily intendend for benchmarking the Java reader without including the +cost of printing the data out. + +### Java Convert + +The convert command reads several JSON files and converts them into a +single ORC file. + +-o + : Sets the output ORC filename, which defaults to output.orc + +-s + : Sets the schema for the ORC file. By default, the schema is automatically discovered. + +-h + : Print help + +The automatic JSON schema discovery is equivalent to the json-schema tool +below. + +### Java JSON Schema + +The JSON Schema discovery tool processes a set of JSON documents and +produces a schema that encompasses all of the records in all of the +documents. It works by computing the enclosing type and promoting it +to include all of the observed values. + +-f + : Print the schema as a list of flat types for each subfield + +-t + : Print the schema as a Hive table declaration + +-h + : Print help \ No newline at end of file