Skip to content
Permalink
Browse files
Merge remote-tracking branch 'origin/develop' into feature/pxfhdfs-en…
…hance
  • Loading branch information
lisakowen committed Oct 20, 2016
2 parents 86d13b3 + 00a2a36 commit a6e8c71dec711a8199808c705e947dc671d4926a
Show file tree
Hide file tree
Showing 33 changed files with 271 additions and 78 deletions.
@@ -12,7 +12,7 @@ This topic provides some guidelines around expanding your HAWQ cluster.

There are several recommendations to keep in mind when modifying the size of your running HAWQ cluster:

- When you add a new node, install both a DataNode and a physical segment on the new node.
- When you add a new node, install both a DataNode and a physical segment on the new node. If you are using YARN to manage HAWQ resources, you must also configure a YARN NodeManager on the new node.
- After adding a new node, you should always rebalance HDFS data to maintain cluster performance.
- Adding or removing a node also necessitates an update to the HDFS metadata cache. This update will happen eventually, but can take some time. To speed the update of the metadata cache, execute **`select gp_metadata_cache_clear();`**.
- Note that for hash distributed tables, expanding the cluster will not immediately improve performance since hash distributed tables use a fixed number of virtual segments. In order to obtain better performance with hash distributed tables, you must redistribute the table to the updated cluster by either the [ALTER TABLE](../reference/sql/ALTER-TABLE.html) or [CREATE TABLE AS](../reference/sql/CREATE-TABLE-AS.html) command.
@@ -153,7 +153,7 @@ This topic provides some guidelines around expanding your HAWQ cluster.

There are several recommendations to keep in mind when modifying the size of your running HAWQ cluster:

- When you add a new node, install both a DataNode and a HAWQ segment on the new node.
- When you add a new node, install both a DataNode and a HAWQ segment on the new node. If you are using YARN to manage HAWQ resources, you must also configure a YARN NodeManager on the new node.
- After adding a new node, you should always rebalance HDFS data to maintain cluster performance.
- Adding or removing a node also necessitates an update to the HDFS metadata cache. This update will happen eventually, but can take some time. To speed the update of the metadata cache, select the **Service Actions > Clear HAWQ's HDFS Metadata Cache** option in Ambari.
- Note that for hash distributed tables, expanding the cluster will not immediately improve performance since hash distributed tables use a fixed number of virtual segments. In order to obtain better performance with hash distributed tables, you must redistribute the table to the updated cluster by either the [ALTER TABLE](../reference/sql/ALTER-TABLE.html) or [CREATE TABLE AS](../reference/sql/CREATE-TABLE-AS.html) command.
@@ -6,49 +6,26 @@ In a HAWQ DBMS, the database server instances \(the master and all segments\) ar

Because a HAWQ system is distributed across many machines, the process for starting and stopping a HAWQ system is different than the process for starting and stopping a regular PostgreSQL DBMS.

Use the `hawq start `*`object`* and `hawq stop `*`object`* commands to start and stop HAWQ, respectively. These management tools are located in the $GPHOME/bin directory on your HAWQ master host. Initializing a HAWQ system also starts it.
Use the `hawq start `*`object`* and `hawq stop `*`object`* commands to start and stop HAWQ, respectively. These management tools are located in the $GPHOME/bin directory on your HAWQ master host.

Initializing a HAWQ system also starts it.

**Important:**

Do not issue a `KILL` command to end any Postgres process. Instead, use the database command `pg_cancel_backend()`.

For information about [hawq start](../reference/cli/admin_utilities/hawqstart.html) and [hawq stop](../reference/cli/admin_utilities/hawqstop.html), see the appropriate pages in the HAWQ Management Utility Reference or enter `hawq start -h` or `hawq stop -h` on the command line.

## <a id="task_g1y_xtm_s5"></a>Initialize HAWQ

Initialize and start the HAWQ system using configuration parameters defined in `$GPHOME/etc/hawq-site.xml`.

The `hawq init` command with the appropriate cluster or node command initializes and starts a HAWQ cluster. The master or segment nodes can be individually initialized by using `hawq init master` and `hawq init segment` commands, respectively. Format options can also be specified at this time.

The `hawq init <object>` utility will create a HAWQ instance using configuration parameters defined in `$GPHOME/etc/hawq-site.xml` A single node cluster can be started without any user-defined changes to the default `hawq-site.xml` file. Use the template-hawq-site.xml file to specify the configuration for larger clusters.

When using the template for initializing a new cluster configuration, replace the items contained within the % markers, for example in: *`value`*`%master.host%`*`value`*, `%master.host%` would be replaced with the master host name. After modification, rename the file to the name of the default configuration file: `hawq-site.xml`.

- Before initializing HAWQ, set the `$GPHOME` environment variable to point to the location of your HAWQ installation on the master host and exchange SSH keys between all host addresses in the array, using `hawq ssh-exkeys`.
- To initialize and start a HAWQ cluster, enter the following command on the master host:

```shell
$ hawq init cluster
```


## <a id="task_hkd_gzv_fp"></a>Starting HAWQ

Start an initialized HAWQ system by running the `hawq start` command on the master instance.
When a HAWQ system is first initialized, it is also started. For more information about initializing HAWQ, see [hawq init](../reference/cli/admin_utilities/hawqinit.html).

Use the `hawq start cluster` command to start a HAWQ system that has already been initialized by the `hawq init cluster` command, but has been stopped by the `hawq stop cluster` command. The `hawq start cluster` command starts HAWQ by starting all the segments on the HAWQ cluster. `hawq start cluster` orchestrates this process and performs the process in parallel.
To start a stopped HAWQ system that was previously initialized, run the `hawq start` command on the master instance.

You can also use the `hawq start master` command to start only the HAWQ master, without segment nodes, then add these later, using `hawq start segment`. If you want HAWQ to ignore hosts that fail ssh validation, use the hawq start `--ignore-bad-hosts` option.

- Run `hawq start cluster` on the master host to start a HAWQ system:

```shell
$ hawq start cluster
```

**Note:**

When the HAWQ system is first initialized with the `hawq init` command, it is automatically started.
Use the `hawq start cluster` command to start a HAWQ system that has already been initialized by the `hawq init cluster` command, but has been stopped by the `hawq stop cluster` command. The `hawq start cluster` command starts a HAWQ system on the master host and starts all its segments. The command orchestrates this process and performs the process in parallel.


## <a id="task_gpdb_restart"></a>Restarting HAWQ
@@ -16,10 +16,14 @@ If a query performs poorly, examine its query plan and ask the following questio
If the plan is not choosing the optimal join order, set `join_collapse_limit=1` and use explicit `JOIN` syntax in your SQL statement to force the legacy query optimizer (planner) to the specified join order. You can also collect more statistics on the relevant join columns.

- **Does the optimizer selectively scan partitioned tables?** If you use table partitioning, is the optimizer selectively scanning only the child tables required to satisfy the query predicates? Scans of the parent tables should return 0 rows since the parent tables do not contain any data. See [Verifying Your Partition Strategy](../ddl/ddl-partition.html#topic74) for an example of a query plan that shows a selective partition scan.
- **Does the optimizer choose hash aggregate and hash join operations where applicable?** Hash operations are typically much faster than other types of joins or aggregations. Row comparison and sorting is done in memory rather than reading/writing from disk. To enable the query optimizer to choose hash operations, there must be sufficient memory available to hold the estimated number of rows. Try increasing work memory to improve performance for a query. If possible, run an `EXPLAIN ANALYZE` for the query to show which plan operations spilled to disk, how much work memory they used, and how much memory was required to avoid spilling to disk. For example:
- **Does the optimizer choose hash aggregate and hash join operations where applicable?** Hash operations are typically much faster than other types of joins or aggregations. Row comparison and sorting is done in memory rather than reading/writing from disk. To enable the query optimizer to choose hash operations, there must be sufficient memory available to hold the estimated number of rows. Try increasing work memory to improve performance for a query. If possible, run an `EXPLAIN ANALYZE` for the query to show which plan operations spilled to disk, how much work memory they used, and how much memory was required to avoid spilling to disk. For example:

`Work_mem used: 23430K bytes avg, 23430K bytes max (seg0). Work_mem wanted: 33649K bytes avg, 33649K bytes max (seg0) to lessen workfile I/O affecting 2 workers.`

**Note**
The *work\_mem* property is not configurable. Use resource queues to manage memory use. For more information on resource queues, see [Configuring Resource Management](../resourcemgmt/ConfigureResourceManagement.html) and [Working with Hierarchical Resource Queues](../resourcemgmt/ResourceQueues.html).


The "bytes wanted" message from `EXPLAIN ANALYZE` is based on the amount of data written to work files and is not exact. The minimum `work_mem` needed can differ from the suggested value.


@@ -4,7 +4,7 @@ title: Disabling Kerberos Security

Follow these steps to disable Kerberos security for HAWQ and PXF for manual installations.

**Note:** If you install or manager your cluster using Ambari, then the HAWQ Ambari plug-in automatically disables security for HAWQ and PXF when you disable security for Hadoop. The following instructions are only necessary for manual installations, or when Hadoop security is disabled outside of Ambari.
**Note:** If you install or manage your cluster using Ambari, then the HAWQ Ambari plug-in automatically disables security for HAWQ and PXF when you disable security for Hadoop. The following instructions are only necessary for manual installations, or when Hadoop security is disabled outside of Ambari.

1. Disable Kerberos on the Hadoop cluster on which you use HAWQ.
2. Disable security for HAWQ:
@@ -1,8 +1,96 @@
---
title: ODBC/JDBC Application Interfaces
title: HAWQ Database Drivers and APIs
---

You may want to connect your existing Business Intelligence (BI) or Analytics applications with HAWQ. The database application programming interfaces most commonly used with HAWQ are the Postgres and ODBC and JDBC APIs.

You may want to deploy your existing Business Intelligence (BI) or Analytics applications with HAWQ. The most commonly used database application programming interfaces with HAWQ are the ODBC and JDBC APIs.
HAWQ provides the following connectivity tools for connecting to the database:

- ODBC driver
- JDBC driver
- `libpq` - PostgreSQL C API

## <a id="dbdriver"></a>HAWQ Drivers

ODBC and JDBC drivers for HAWQ are available as a separate download from Pivotal Network [Pivotal Network](https://network.pivotal.io/products/pivotal-hdb).

### <a id="odbc_driver"></a>ODBC Driver

The ODBC API specifies a standard set of C interfaces for accessing database management systems. For additional information on using the ODBC API, refer to the [ODBC Programmer's Reference](https://msdn.microsoft.com/en-us/library/ms714177(v=vs.85).aspx) documentation.

HAWQ supports the DataDirect ODBC Driver. Installation instructions for this driver are provided on the Pivotal Network driver download page. Refer to [HAWQ ODBC Driver](http://media.datadirect.com/download/docs/odbc/allodbc/#page/odbc%2Fthe-greenplum-wire-protocol-driver.html%23) for HAWQ-specific ODBC driver information.

#### <a id="odbc_driver_connurl"></a>Connection Data Source
The information required by the HAWQ ODBC driver to connect to a database is typically stored in a named data source. Depending on your platform, you may use [GUI](http://media.datadirect.com/download/docs/odbc/allodbc/index.html#page/odbc%2FData_Source_Configuration_through_a_GUI_14.html%23) or [command line](http://media.datadirect.com/download/docs/odbc/allodbc/index.html#page/odbc%2FData_Source_Configuration_in_the_UNIX_2fLinux_odbc_13.html%23) tools to create your data source definition. On Linux, ODBC data sources are typically defined in a file named `odbc.ini`.

Commonly-specified HAWQ ODBC data source connection properties include:

| Property Name | Value Description |
|-------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Database | Name of the database to which you want to connect. |
| Driver | Full path to the ODBC driver library file. |
| HostName | HAWQ master host name. |
| MaxLongVarcharSize | Maximum size of columns of type long varchar. |
| Password | Password used to connect to the specified database. |
| PortNumber | HAWQ master database port number. |

Refer to [Connection Option Descriptions](http://media.datadirect.com/download/docs/odbc/allodbc/#page/odbc%2Fgreenplum-connection-option-descriptions.html%23) for a list of ODBC connection properties supported by the HAWQ DataDirect ODBC driver.

Example HAWQ DataDirect ODBC driver data source definition:

``` shell
[HAWQ-201]
Driver=/usr/local/hawq_drivers/odbc/lib/ddgplm27.so
Description=DataDirect 7.1 Greenplum Wire Protocol - for HAWQ
Database=getstartdb
HostName=hdm1
PortNumber=5432
Password=changeme
MaxLongVarcharSize=8192
```

The first line, `[HAWQ-201]`, identifies the name of the data source.

ODBC connection properties may also be specified in a connection string identifying either a data source name, the name of a file data source, or the name of a driver. A HAWQ ODBC connection string has the following format:

``` shell
([DSN=<data_source_name>]|[FILEDSN=<filename.dsn>]|[DRIVER=<driver_name>])[;<attribute=<value>[;...]]
```

For additional information on specifying a HAWQ ODBC connection string, refer to [Using a Connection String](http://media.datadirect.com/download/docs/odbc/allodbc/index.html#page/odbc%2FUsing_a_Connection_String_16.html%23).

### <a id="jdbc_driver"></a>JDBC Driver
The JDBC API specifies a standard set of Java interfaces to SQL-compliant databases. For additional information on using the JDBC API, refer to the [Java JDBC API](https://docs.oracle.com/javase/8/docs/technotes/guides/jdbc/) documentation.

HAWQ supports the DataDirect JDBC Driver. Installation instructions for this driver are provided on the Pivotal Network driver download page. Refer to [HAWQ JDBC Driver](http://media.datadirect.com/download/docs/jdbc/alljdbc/help.html#page/jdbcconnect%2Fgreenplum-driver.html%23) for HAWQ-specific JDBC driver information.

#### <a id="jdbc_driver_connurl"></a>Connection URL
Connection URLs for accessing the HAWQ DataDirect JDBC driver must be in the following format:

``` shell
jdbc:pivotal:greenplum://host:port[;<property>=<value>[;...]]
```

Commonly-specified HAWQ JDBC connection properties include:

| Property Name | Value Description |
|-------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| DatabaseName | Name of the database to which you want to connect. |
| User | Username used to connect to the specified database. |
| Password | Password used to connect to the specified database. |

Refer to [Connection Properties](http://media.datadirect.com/download/docs/jdbc/alljdbc/help.html#page/jdbcconnect%2FConnection_Properties_10.html%23) for a list of JDBC connection properties supported by the HAWQ DataDirect JDBC driver.

Example HAWQ JDBC connection string:

``` shell
jdbc:pivotal:greenplum://hdm1:5432;DatabaseName=getstartdb;User=hdbuser;Password=hdbpass
```

## <a id="libpq_api"></a>libpq API
`libpq` is the C API to PostgreSQL/HAWQ. This API provides a set of library functions enabling client programs to pass queries to the PostgreSQL backend server and to receive the results of those queries.

`libpq` is installed in the `lib/` directory of your HAWQ distribution. `libpq-fe.h`, the header file required for developing front-end PostgreSQL applications, can be found in the `include/` directory.

For additional information on using the `libpq` API, refer to [libpq - C Library](https://www.postgresql.org/docs/8.2/static/libpq.html) in the PostgreSQL documentation.

ODBC/JDBC drivers are available as a separate download via Pivotal Network [Pivotal Network](https://network.pivotal.io/products/pivotal-hdb).
@@ -4,5 +4,5 @@ title: Supported Client Applications

Users can connect to HAWQ using various client applications:

- A number of [HAWQ Client Applications](g-greenplum-database-client-applications.html) are provided with your HAWQ installation. The `psql` client application provides an interactive command-line interface to HAWQ.
- A number of [HAWQ Client Applications](g-hawq-database-client-applications.html) are provided with your HAWQ installation. The `psql` client application provides an interactive command-line interface to HAWQ.
- Using standard ODBC/JDBC Application Interfaces, such as ODBC and JDBC, users can connect their client applications to HAWQ.

0 comments on commit a6e8c71

Please sign in to comment.