/
hawqregister.html.md.erb
254 lines (173 loc) · 8.73 KB
/
hawqregister.html.md.erb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
---
title: hawq register
---
Loads and registers AO or Parquet-formatted tables in HDFS into a corresponding table in HAWQ.
## <a id="topic1__section2"></a>Synopsis
``` pre
Usage 1:
hawq register [<connection_options>] [-f <hdfsfilepath>] [-e <Eof>] <tablename>
Usage 2:
hawq register [<connection_options>] [-c <configfilepath>][-F] <tablename>
Connection Options:
[-h | --host <hostname>]
[-p | --port <port>]
[-U | --user <username>]
[-d | --database <database>]
Misc. Options:
[-f | --filepath <filepath>]
[-e | --eof<eof>]
[-F | --force ]
[-c | --config <yml_config>]
hawq register help | -?
hawq register --version
```
## <a id="topic1__section3"></a>Prerequisites
The client machine where `hawq register` is executed must meet the following conditions:
- All hosts in your HAWQ cluster (master and segments) must have network access between them and the hosts containing the data to be loaded.
- The Hadoop client must be configured and the hdfs filepath specified.
- The files to be registered and the HAWQ table must be located in the same HDFS cluster.
- The target table DDL is configured with the correct data type mapping.
## <a id="topic1__section4"></a>Description
`hawq register` is a utility that loads and registers existing data files or folders in HDFS into HAWQ internal tables, allowing HAWQ to directly read the data and use internal table processing for operations such as transactions and high perforance, without needing to load or copy it. Data from the file or directory specified by \<hdfsfilepath\> is loaded into the appropriate HAWQ table directory in HDFS and the utility updates the corresponding HAWQ metadata for the files.
You can use `hawq register` to:
- Load and register external Parquet-formatted file data generated by an external system such as Hive or Spark.
- Recover cluster data from a backup cluster.
Two usage models are available.
###Usage Model 1: Register file data to an existing table.
`hawq register [-h hostname] [-p port] [-U username] [-d databasename] [-f filepath] [-e eof]<tablename>`
Metadata for the Parquet file(s) and the destination table must be consistent. Different data types are used by HAWQ tables and Parquet files, so the data is mapped. Refer to the section [Data Type Mapping](hawqregister.html#topic1__section7) below. You must verify that the structure of the Parquet files and the HAWQ table are compatible before running `hawq register`.
####Limitations
Only HAWQ or Hive-generated Parquet tables are supported.
Hash tables and partitioned tables are not supported in this use model.
###Usage Model 2: Use information from a YAML configuration file to register data
`hawq register [-h hostname] [-p port] [-U username] [-d databasename] [-c configfile] [--force] <tablename>`
Files generated by the `hawq extract` command are registered through use of metadata in a YAML configuration file. Both AO and Parquet tables can be registered. Tables need not exist in HAWQ before being registered.
The register process behaves differently, according to different conditions.
- Existing tables have files appended to the existing HAWQ table.
- If a table does not exist, it is created and registered into HAWQ.
- If the -\\\-force option is used, the data in existing catalog tables is erased and re-registered.
###Limitations for Registering Hive Tables to HAWQ
The currently-supported data types for generating Hive tables into HAWQ tables are: boolean, int, smallint, tinyint, bigint, float, double, string, binary, char, and varchar.
The following HIVE data types cannot be converted to HAWQ equivalents: timestamp, decimal, array, struct, map, and union.
Only single-level partitioned tables are supported.
###Data Type Mapping<a id="topic1__section7"></a>
HAWQ and Parquet tables and HIVE and HAWQ tables use different data types. Mapping must be used for compatibility. You are responsible for making sure your implementation is mapped to the appropriate data type before running `hawq register`. The tables below show equivalent data types, if available.
<span class="tablecap">Table 1. HAWQ to Parquet Mapping</span>
|HAWQ Data Type | Parquet Data Type |
| :------------| :---------------|
| bool | boolean |
| int2/int4/date | int32 |
| int8/money | int64 |
| time/timestamptz/timestamp | int64 |
| float4 | float |
|float8 | double |
|bit/varbit/bytea/numeric | Byte array |
|char/bpchar/varchar/name| Byte array |
| text/xml/interval/timetz | Byte array |
| macaddr/inet/cidr | Byte array |
**Additional HAWQ-to-Parquet Mapping**
**point**:
```
group {
required int x;
required int y;
}
```
**circle:**
```
group {
required int x;
required int y;
required int r;
}
```
**box:**
```
group {
required int x1;
required int y1;
required int x2;
required int y2;
}
```
**iseg:**
```
group {
required int x1;
required int y1;
required int x2;
required int y2;
}
```
**path**:
```
group {
repeated group {
required int x;
required int y;
}
}
```
<span class="tablecap">Table 2. HIVE to HAWQ Mapping</span>
|HIVE Data Type | HAWQ Data Type |
| :------------| :---------------|
| boolean | bool |
| tinyint | int2 |
| smallint | int2/smallint |
| int | int4 / int |
| bigint | int8 / bigint |
| float | float4 |
| double | float8 |
| string | varchar |
| binary | bytea |
| char | char |
| varchar | varchar |
## <a id="topic1__section5"></a>Options
**General Options**
<dt>-? (show help) </dt>
<dd>Show help, then exit.
<dt>-\\\-version </dt>
<dd>Show the version of this utility, then exit.</dd>
**Connection Options**
<dt>-h , -\\\-host \<hostname\> </dt>
<dd>Specifies the host name of the machine on which the HAWQ master database server is running. If not specified, reads from the environment variable `$PGHOST` or defaults to `localhost`.</dd>
<dt> -p , -\\\-port \<port\> </dt>
<dd>Specifies the TCP port on which the HAWQ master database server is listening for connections. If not specified, reads from the environment variable `$PGPORT` or defaults to 5432.</dd>
<dt>-U , -\\\-user \<username\> </dt>
<dd>The database role name to connect as. If not specified, reads from the environment variable `$PGUSER` or defaults to the current system user name.</dd>
<dt>-d , -\\\-database \<databasename\> </dt>
<dd>The database to register the Parquet HDFS data into. The default is `postgres`<dd>
<dt>-f , -\\\-filepath \<hdfspath\></dt>
<dd>The path of the file or directory in HDFS containing the files to be registered.</dd>
<dt>\<tablename\> </dt>
<dd>The HAWQ table that will store the data to be registered. If the --config option is not supplied, the table cannot use hash distribution. Random table distribution is strongly preferred. If hash distribution must be used, make sure that the distribution policy for the data files described in the YAML file is consistent with the table being registered into.</dd>
####Miscellaneous Options
The following options are used with specific use models.
<dt>-e , -\\\-eof \<eof\></dt>
<dd>Specify the end of the file to be registered. \<eof\> represents the valid content length of the file, in bytes to be used, a value between 0 the actual size of the file. If this option is not included, the actual file size, or size of files within a folder, is used. Used with Use Model 1.</dd>
<dt>-F , -\\\-force</dt>
<dd>Used for disaster recovery of a cluster. Clears all HDFS-related catalog contents in `pg_aoseg.pg_paqseg_$relid `and re-registers files to a specified table. The HDFS files are not removed or modified. To use this option for recovery, data is assumed to be periodically imported to the cluster to be recovered. Used with Usage Model 2.</dd>
<dt>-c , -\\\-config \<yml_config\> </dt>
<dd>Registers files specified by YAML-format configuration files into HAWQ. Used with Usage Model 2.</dd>
## <a id="topic1__section6"></a>Example: Usage Model 2
This example shows how to register files using a YAML configuration file. This file is usually generated by the `hawq extract` command.
Create a table and insert data into the table:
```
=> create table paq1(a int, b varchar(10))with(appendonly=true, orientation=parquet);`
=> insert into paq1 values(generate_series(1,1000), 'abcde');
```
Extract the table's metadata.
```
hawq extract -o paq1.yml paq1
```
Use the YAML file to register the new table paq2:
```
hawq register --config paq1.yml paq2
```
Select the new table to determine if the content has already been registered:
```
=> select count(*) from paq2;
```
The result should return 1000.
## See Also
[hawq extract](hawqextract.html#topic1)