Skip to content
Permalink
Browse files
Merge branch 'feature/newregister' of https://github.com/janebeckman/…
  • Loading branch information
dyozie committed Oct 27, 2016
2 parents 5673447 + 8b65086 commit 01f3f8e9d9314edc8d26d23c51894aaf9ac77613
Show file tree
Hide file tree
Showing 3 changed files with 53 additions and 20 deletions.
@@ -22,15 +22,14 @@ Requirements for running `hawq register` on the server are:

Files or folders in HDFS can be registered into an existing table, allowing them to be managed as a HAWQ internal table. When registering files, you can optionally specify the maximum amount of data to be loaded, in bytes, using the `--eof` option. If registering a folder, the actual file sizes are used.

Only HAWQ or Hive-generated Parquet tables are supported. Partitioned tables are not supported. Attempting to register these tables will result in an error.
Only HAWQ or Hive-generated Parquet tables are supported. Only single-level partitioned tables are supported; registering partitioned tables with more than one level will result in an error.

Metadata for the Parquet file(s) and the destination table must be consistent. Different data types are used by HAWQ tables and Parquet files, so data must be mapped. You must verify that the structure of the parquet files and the HAWQ table are compatible before running `hawq register`.
Metadata for the Parquet file(s) and the destination table must be consistent. Different data types are used by HAWQ tables and Parquet files, so data must be mapped. You must verify that the structure of the Parquet files and the HAWQ table are compatible before running `hawq register`. Not all HIVE data types can be mapped to HAWQ equivalents. The currently-supported HIVE data types are: boolean, int, smallint, tinyint, bigint, float, double, string, binary, char, and varchar.

As a best practice, create a copy of the Parquet file to be registered before running ```hawq register```
You can then then run ```hawq register``` on the copy, leaving the original file available for additional Hive queries or if a data mapping error is encountered.

###Limitations for Registering Hive Tables to HAWQ
The currently-supported data types for generating Hive tables into HAWQ tables are: boolean, int, smallint, tinyint, bigint, float, double, string, binary, char, and varchar.
###Limitations for Registering Hive Tables to HAWQ

The following HIVE data types cannot be converted to HAWQ equivalents: timestamp, decimal, array, struct, map, and union.

@@ -40,33 +39,37 @@ This example shows how to register a HIVE-generated parquet file in HDFS into th

In this example, the location of the database is `hdfs://localhost:8020/hawq_default`, the tablespace id is 16385, the database id is 16387, the table filenode id is 77160, and the last file under the filenode is numbered 7.

Enter:
Run the `hawq register` command for the file location `hdfs://localhost:8020/temp/hive.paq`:

``` pre
$ hawq register -d postgres -f hdfs://localhost:8020/temp/hive.paq parquet_table
```

After running the `hawq register` command for the file location `hdfs://localhost:8020/temp/hive.paq`, the corresponding new location of the file in HDFS is: `hdfs://localhost:8020/hawq_default/16385/16387/77160/8`.
After running the `hawq register` command, the corresponding new location of the file in HDFS is: `hdfs://localhost:8020/hawq_default/16385/16387/77160/8`.

The command then updates the metadata of the table `parquet_table` in HAWQ, which is contained in the table `pg_aoseg.pg_paqseg_77160`. The pg\_aoseg table is a fixed schema for row-oriented and Parquet AO tables. For row-oriented tables, the table name prefix is pg\_aoseg. The table name prefix for parquet tables is pg\_paqseg. 77160 is the relation id of the table.
The command updates the metadata of the table `parquet_table` in HAWQ, which is contained in the table `pg_aoseg.pg_paqseg_77160`. The pg\_aoseg table is a fixed schema for row-oriented and Parquet AO tables. For row-oriented tables, the table name prefix is pg\_aoseg. For Parquet tables, the table name prefix is pg\_paqseg. 77160 is the relation id of the table.

To locate the table, either find the relation ID by looking up the catalog table pg\_class in SQL by running
You can locate the table by one of two methods, either by relation ID or by table name.

To find the relation ID, run the following command on the catalog table pg\_class:

```
select oid from pg_class where relname=$relname
```
or find the table name by using the SQL command
To find the table name, run the command:

```
select segrelid from pg_appendonly where relid = $relid
```
then running
then run:

```
select relname from pg_class where oid = segrelid
```

## <a id="topic1__section3"></a>Registering Data Using Information from a YAML Configuration File

The `hawq register` command can register HDFS files by using metadata loaded from a YAML configuration file by using the `--config <yaml_config\>` option. Both AO and Parquet tables can be registered. Tables need not exist in HAWQ before being registered. This function can be useful in disaster recovery, allowing information created by the `hawq extract` command to re-create HAWQ tables.
The `hawq register` command can register HDFS files by using metadata loaded from a YAML configuration file by using the `--config <yaml_config\>` option. Both AO and Parquet tables can be registered. Tables need not exist in HAWQ before being registered. In disaster recovery, information in a YAML-format file created by the `hawq extract` command can re-create HAWQ tables by using metadata from a backup checkpoint.

You can also use a YAML confguration file to append HDFS files to an existing HAWQ table or create a table and register it into HAWQ.

@@ -78,10 +81,9 @@ Data is registered according to the following conditions:
- If a table does not exist, it is created and registered into HAWQ. The catalog table will be updated with the file size specified by the YAML file.
- If the -\\\-force option is used, the data in existing catalog tables is erased and re-registered. All HDFS-related catalog contents in `pg_aoseg.pg_paqseg_$relid ` are cleared. The original files on HDFS are retained.

Tables using random distribution are preferred for registering into HAWQ. If hash tables are to be registered, the distribution policy in the YAML file must match that of the table being registered into.

In registering hash tables, the size of the registered file should be identical to or a multiple of the hash table bucket number. When registering hash distributed tables using a YAML file, the order of the files in the YAML file should reflect the hash distribution.
Tables using random distribution are preferred for registering into HAWQ.

There are additional restrictions when registering hash tables. When registering hash-distributed tables using a YAML file, the distribution policy in the YAML file must match that of the table being registered into and the order of the files in the YAML file should reflect the hash distribution. The size of the registered file should be identical to or a multiple of the hash table bucket number.

###Example: Registration using a YAML Configuration File

@@ -114,7 +116,7 @@ Select the new table and check to verify that the content has been registered.

## <a id="topic1__section4"></a>Data Type Mapping<a id="topic1__section4"></a>

HIVE and Parquet tables use different data types than HAWQ tables. Mapping must be used for metadata compatibility. You are responsible for making sure your implementation is mapped to the appropriate data type before running `hawq register`. The tables below show equivalent data types, if available.
HIVE and Parquet tables use different data types than HAWQ tables and must be mapped for metadata compatibility. You are responsible for making sure your implementation is mapped to the appropriate data type before running `hawq register`. The tables below show equivalent data types, if available.

<span class="tablecap">Table 1. HAWQ to Parquet Mapping</span>

@@ -204,5 +206,9 @@ group {
| varchar | varchar |


### Extracting Metadata

For more information on extracting metadata to a YAML file and the output content of the YAML file, refer to the reference page for [hawq extract](../../reference/cli/admin_utilities/hawqextract.html#topic1).



@@ -87,6 +87,8 @@ Encoding: UTF8
AO_Schema:
- name: string
type: string
Bucketnum: 6
Distribution_policy: DISTRIBUTED RANDOMLY

AO_FileLocations:
Blocksize: int
@@ -96,7 +98,7 @@ AO_FileLocations:
PartitionBy: string ('PARTITION BY ...')
Files:
- path: string (/gpseg0/16385/35469/35470.1)
size: long
size: long

Partitions:
- Blocksize: int
@@ -109,7 +111,10 @@ AO_FileLocations:
- path: string
size: long


Parquet_Schema:
- name: string
type: string

Parquet_FileLocations:
RowGroupSize: long
PageSize: long
@@ -203,6 +208,7 @@ AO_FileLocations:
- name: count
type: int4
DFS_URL: hdfs://127.0.0.1:9000
Distribution_policy: DISTRIBUTED RANDOMLY
Encoding: UTF8
FileFormat: AO
TableName: public.rank
@@ -284,6 +290,26 @@ Parquet_FileLocations:
PageSize: 1048576
RowGroupSize: 8388608
RowGroupSize: 8388608
Parquet_Schema:
- name: o_orderkey
type: int8
- name: o_custkey
type: int4
- name: o_orderstatus
type: bpchar
- name: o_totalprice
type: numeric
- name: o_orderdate
type: date
- name: o_orderpriority
type: bpchar
- name: o_clerk
type: bpchar
- name: o_shippriority
type: int4
- name: o_comment
type: varchar
Distribution_policy: DISTRIBUTED RANDOMLY
```

## See Also
@@ -22,7 +22,7 @@ Connection Options:
Misc. Options:
[-f <filepath>]
[-e <eof>]
[--force]
[--force]
[-c <yml_config>]
hawq register help | -?
hawq register --version
@@ -55,8 +55,8 @@ Two usage models are available.
Metadata for the Parquet file(s) and the destination table must be consistent. Different data types are used by HAWQ tables and Parquet files, so the data is mapped. Refer to the section [Data Type Mapping](hawqregister.html#topic1__section7) below. You must verify that the structure of the Parquet files and the HAWQ table are compatible before running `hawq register`.

####Limitations
Only HAWQ or Hive-generated Parquet tables are supported.
Hash tables and artitioned tables are not supported in this use model.
Only HAWQ or Hive-generated Parquet tables are supported. Partitioned tables are supported, but only single-level partitioned tables can be registered into HAWQ.
Hash tables are not supported in this use model.

###Usage Model 2: Use information from a YAML configuration file to register data

@@ -70,6 +70,7 @@ The register process behaves differently, according to different conditions.
- If a table does not exist, it is created and registered into HAWQ.
- If the -\\\-force option is used, the data in existing catalog tables is erased and re-registered.


###Limitations for Registering Hive Tables to HAWQ
The currently-supported data types for generating Hive tables into HAWQ tables are: boolean, int, smallint, tinyint, bigint, float, double, string, binary, char, and varchar.

0 comments on commit 01f3f8e

Please sign in to comment.