Skip to content
Permalink
Browse files
HAWQ-1394 document HiveORC profile (closes #104)
  • Loading branch information
lisakowen authored and dyozie committed Mar 20, 2017
1 parent ce67ada commit 8c37d043ab8b6ef59315a7bd92acb2c9b45dcedb
Showing 1 changed file with 175 additions and 8 deletions.
@@ -50,7 +50,7 @@ The PXF Hive plug-in supports several file formats and profiles for accessing th
| TextFile | Flat file with data in comma-, tab-, or space-separated value format or JSON notation. | Hive, HiveText |
| SequenceFile | Flat file consisting of binary key/value pairs. | Hive |
| RCFile | Record columnar data consisting of binary key/value pairs; high row compression rate. | Hive, HiveRC |
| ORCFile | Optimized row columnar data with stripe, footer, and postscript sections; reduces data size. | Hive |
| ORCFile | Optimized row columnar data with stripe, footer, and postscript sections; reduces data size. | Hive, HiveORC |
| Parquet | Compressed columnar data representation. | Hive |
| Avro | JSON-defined, schema-based data serialization format. | Hive |

@@ -82,7 +82,7 @@ The following table summarizes external mapping rules for Hive primitive types.

Hive supports complex data types including array, struct, map, and union. PXF maps each of these complex types to `text`. While HAWQ does not natively support these types, you can create HAWQ functions or application code to extract subcomponents of these complex data types.

An example using complex data types with the `Hive` profile is provided later in this topic.
Examples using complex data types with the `Hive` and `HiveORC` profiles are provided later in this topic.


## <a id="hive_sampledataset"></a>Sample Data Set
@@ -316,15 +316,15 @@ HCatalog integration has the following limitations:

In the previous section, you used HCatalog integration to query a Hive table. You can also create a PXF/HAWQ external table to access Hive table data. This Hive table access mechanism requires that you identify an appropriate Hive profile.

The PXF Hive plug-in supports several Hive-related profiles. These include `Hive`, `HiveText`, and `HiveRC`. The `HiveText` and `HiveRC` profiles are specifically optimized for text and RC file formats, respectively. The `Hive` profile is optimized for all file storage types; use the `Hive` profile when the underlying Hive table is composed of multiple partitions with differing file formats.
The PXF Hive plug-in supports several Hive-related profiles. These include `Hive`, `HiveText`, and `HiveRC`, and `HiveORC`. The `HiveText` and `HiveRC` profiles are specifically optimized for text and RC file formats, respectively. The `HiveORC` profile is optimized for ORC file formats. The `Hive` profile is optimized for all file storage types; use the `Hive` profile when the underlying Hive table is composed of multiple partitions with differing file formats.

Use the following syntax to create a HAWQ external table representing Hive data:

``` sql
CREATE EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<host>[:<port>]/<hive-db-name>.<hive-table-name>
?PROFILE=Hive|HiveText|HiveRC[&DELIMITER=<delim>'])
?PROFILE=Hive|HiveText|HiveRC|HiveORC[&DELIMITER=<delim>'])
FORMAT 'CUSTOM|TEXT' (formatter='pxfwritable_import' | delimiter='<delim>')
```

@@ -336,9 +336,9 @@ Hive-plug-in-specific keywords and values used in the [CREATE EXTERNAL TABLE](..
| \<port\> | The PXF port. If \<port\> is omitted, PXF assumes \<host\> identifies a High Availability HDFS Nameservice and connects to the port number designated by the `pxf_service_port` server configuration parameter value. Default is 51200. |
| \<hive-db-name\> | The name of the Hive database. If omitted, defaults to the Hive database named `default`. |
| \<hive-table-name\> | The name of the Hive table. |
| PROFILE | The `PROFILE` keyword must specify one of the values `Hive`, `HiveText`, or `HiveRC`. |
| PROFILE | The `PROFILE` keyword must specify one of the values `Hive`, `HiveText`, `HiveRC`, or `HiveORC`. |
| DELIMITER | The `DELIMITER` clause is required for both the `HiveText` and `HiveRC` profiles and identifies the field delimiter used in the Hive data set. \<delim\> must be a single ascii character or specified in hexadecimal representation. |
| FORMAT (`Hive` profile) | The `FORMAT` clause must specify `CUSTOM`. The `CUSTOM` format supports only the built-in `pxfwritable_import` `formatter`. |
| FORMAT (`Hive` and `HiveORC` profiles) | The `FORMAT` clause must specify `CUSTOM`. The `CUSTOM` format supports only the built-in `pxfwritable_import` `formatter`. |
| FORMAT (`HiveText` and `HiveRC` profiles) | The `FORMAT` clause must specify `TEXT`. The `delimiter` must be specified a second time in '\<delim\>'. |


@@ -401,7 +401,7 @@ Use the PXF `HiveText` profile to create a queryable HAWQ external table from th
2. Query the external table:

``` sql
postgres=# SELECT * FROM salesinfo_hivetextprofile WHERE location="Beijing";
postgres=# SELECT * FROM salesinfo_hivetextprofile WHERE location='Beijing';
```

``` shell
@@ -475,6 +475,96 @@ Use the `HiveRC` profile to query RCFile-formatted data in Hive tables.
...
```

## <a id="hiveorc-intro"></a>HiveORC Profile

The Optimized Row Columnar (ORC) file format is a columnar file format that provides a highly efficient way to both store and access HDFS data. ORC format offers improvements over text and RCFile formats in terms of both compression and performance. HAWQ/PXF supports ORC version 1.2.1.

ORC is type-aware and specifically designed for Hadoop workloads. ORC files store both the type of and encoding information for the data in the file. All columns within a single group of row data (also known as stripe) are stored together on disk in ORC format files. The columnar nature of the ORC format type enables read projection, helping avoid accessing unecessary columns during a query.

ORC also supports predicate pushdown with built-in indexes at the file, stripe, and row levels, moving the filter operation to the data loading phase.

Refer to the [Apache orc](https://orc.apache.org/docs/) and the Apache Hive [LanguageManual ORC](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC) websites for detailed information about the ORC file format.

Use the `HiveORC` profile to access ORC format data. The `HiveORC` profile provides:

- Enhanced query performance - Column projection information is leveraged to enhance query performance by reducing disk I/O and data payload.

- Optimized predicate pushdown - Predicate pushdown is optimized for:
- `int2`, `int4`, `int8`, `float8`, `text`, `bpchar`, and `boolean` data type and `=`, `>`, `<`, `>=`, `<=`, `IS NULL`, and `IS NOT NULL` operator combinations
- `=`, `>`, `<`, `>=`, `<=`, `IS NULL`, and `IS NOT NULL` operators and comparisons between the integer types
- `=`, `>`, `<`, `>=`, `<=`, `IS NULL`, and `IS NOT NULL` operators and comparisons between the `float8` and `float4` types
- `IN` operator on arrays of `int2`, `int4`, `int8`, `boolean`, and `text`

- Complex type support - You can access Hive tables composed of array, map, struct, and union data types. PXF serializes each of these complex types to `text`.

**Note**: The `HiveORC` profile currently supports access to data stored in ORC format only through a Hive mapped table.

### <a id="using-hiveorc-profile"></a>Example: Using the HiveORC Profile

In the following example, you will create a Hive table stored in ORC format and use the `HiveORC` profile to query this Hive table.

1. Create a Hive table with ORC file format:

``` shell
$ HADOOP_USER_NAME=hdfs hive
```

``` sql
hive> CREATE TABLE sales_info_ORC (location string, month string,
number_of_orders int, total_sales double)
STORED AS ORC;
```

2. Insert the data from the `sales_info` table into `sales_info_ORC`:

``` sql
hive> INSERT INTO TABLE sales_info_ORC SELECT * FROM sales_info;
```

A copy of the sample data set is now stored in ORC format in `sales_info_ORC`.

3. Perform a Hive query on `sales_info_ORC` to verify that the data was loaded successfully:

``` sql
hive> SELECT * FROM sales_info_ORC;
```

4. Start the `psql` subsystem and turn on timing:

``` shell
$ psql -d postgres
```

``` sql
postgres=> \timing
Timing is on.
```

4. Use the PXF `HiveORC` profile to create a queryable HAWQ external table from the Hive table named `sales_info_ORC` you created in Step 1. The `FORMAT` clause must specify `'CUSTOM'`. The `HiveORC` `CUSTOM` format supports only the built-in `'pxfwritable_import'` `formatter`.

``` sql
postgres=> CREATE EXTERNAL TABLE salesinfo_hiveORCprofile(location text, month text, num_orders int, total_sales float8)
LOCATION ('pxf://namenode:51200/default.sales_info_ORC?PROFILE=HiveORC')
FORMAT 'CUSTOM' (formatter='pxfwritable_import');
```

5. Query the external table:

``` sql
postgres=> SELECT * FROM salesinfo_hiveORCprofile;
```

``` pre
location | month | number_of_orders | total_sales
---------------+-------+------------------+-------------
Prague | Jan | 101 | 4875.33
Rome | Mar | 87 | 1557.39
Bangalore | May | 317 | 8936.99
...

Time: 425.416 ms
```

## <a id="topic_dbb_nz3_ts"></a>Accessing Parquet-Format Hive Tables

The PXF `Hive` profile supports both non-partitioned and partitioned Hive tables that use the Parquet storage format in HDFS. Simply map the table columns using equivalent HAWQ data types. For example, if a Hive table is created using:
@@ -498,8 +588,9 @@ And query the HAWQ external table using:
postgres=# SELECT fname,lname FROM pxf_parquet_table;
```

## <a id="complex_dt_example"></a> Complex Data Types

## <a id="complex_dt_example"></a>Hive Profile Complex Data Type Example
### <a id="complex_dt_example"></a>Example: Using the Hive Profile with Complex Data Types

This example employs the `Hive` profile and the array and map complex types, specifically an array of integers and a string key/value pair map.

@@ -607,6 +698,82 @@ When specifying an array field in a Hive table, you must identify the terminator

`intarray` and `propmap` are each serialized as text strings.

### <a id="using-hiveorc-profile-complex"></a>Example: Using the HiveORC Profile with Complex Data Types

In the following example, you will create a Hive table stored in ORC format. You will use the `HiveORC` profile to query the complex types in the `table_complextypes` Hive table you created in the previous exercise.

1. Create a Hive table with ORC file format:

``` shell
$ HADOOP_USER_NAME=hdfs hive
```

``` sql
hive> CREATE TABLE table_complextypes_ORC( index int, name string, intarray ARRAY<int>, propmap MAP<string, string>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '%'
MAP KEYS TERMINATED BY ':'
STORED AS ORC;
```

2. Insert the data from the `table_complextypes` table into `table_complextypes_ORC`:

``` sql
hive> INSERT INTO TABLE table_complextypes_ORC SELECT * FROM table_complextypes;
```

A copy of the sample data set is now stored in ORC format in `table_complextypes_ORC`.

3. Perform a Hive query on `table_complextypes_ORC` to verify that the data was loaded successfully:

``` sql
hive> SELECT * FROM table_complextypes_ORC;
```

``` pre
OK
3 Prague [1,2,3] {"zone":"euro","status":"up"}
89 Rome [4,5,6] {"zone":"euro"}
400 Bangalore [7,8,9] {"zone":"apac","status":"pending"}
...
```

4. Start the `psql` subsystem:

``` shell
$ psql -d postgres
```

4. Use the PXF `HiveORC` profile to create a queryable HAWQ external table from the Hive table named `table_complextypes_ORC` you created in Step 1. The `FORMAT` clause must specify `'CUSTOM'`. The `HiveORC` `CUSTOM` format supports only the built-in `'pxfwritable_import'` `formatter`.

``` sql
postgres=> CREATE EXTERNAL TABLE complextypes_hiveorc(index int, name text, intarray text, propmap text)
LOCATION ('pxf://namenode:51200/default.table_complextypes_ORC?PROFILE=HiveORC')
FORMAT 'CUSTOM' (formatter='pxfwritable_import');
```

Notice that the integer array and map complex types are mapped to type text.

5. Query the external table:

``` sql
postgres=> SELECT * FROM complextypes_hiveorc;
```

``` pre
index | name | intarray | propmap
-------+------------+----------+------------------------------------
3 | Prague | [1,2,3] | {"zone":"euro","status":"up"}
89 | Rome | [4,5,6] | {"zone":"euro"}
400 | Bangalore | [7,8,9] | {"zone":"apac","status":"pending"}
...

```

`intarray` and `propmap` are each serialized as text strings.



## <a id="partitionfiltering"></a>Partition Filtering

The PXF Hive plug-in supports the Hive partitioning feature and directory structure. This enables partition exclusion on selected HDFS files comprising the Hive table. To use the partition filtering feature to reduce network traffic and I/O, run a PXF query using a `WHERE` clause that refers to a specific partition in the partitioned Hive table.

0 comments on commit 8c37d04

Please sign in to comment.