Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HAWQ-1304 - multiple doc changes for PXF and Hive Plugin #94

Closed

Conversation

lisakowen
Copy link
Contributor

doc changes for HAWQ-1228:

  • outputformat class
  • hcatalog use of optimal Hive* profile
  • hive profile use of optimal Hive* profile
  • enabling logging to see Hive* profile actually used
  • remove performance statements

related changes:

  • metadata class
  • clarify Hive plug-in prerequisites
  • add example PXF profile definition
  • misc editting and formatting changes


This example will employ the array and map complex types, specifically an array of integers and a string key/value pair map.
This example will employ the `Hive` profile and the array and map complex types, specifically an array of integers and a string key/value pair map.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change "will employ" to "employs" here.

| host | The HDFS NameNode. |
| port  | Connection port for the PXF service. If the port is omitted, PXF assumes that High Availability (HA) is enabled and connects to the HA name service port, 51200, by default. The HA name service port can be changed by setting the `pxf_service_port` configuration parameter. |
| \<path\-to\-data\> | A directory, file name, wildcard pattern, table name, etc. |
| PROFILE | The profile PXF should use to access the data. PXF supports multiple plug-ins that currently expose profiles named `HBase`, `Hive`, `HiveRC`, `HiveText`, `HiveORC`, `HdfsTextSimple`, `HdfsTextMulti`, `Avro`, `SequenceWritable`, and `Json`. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change "should use" to "uses"

The `LOCATION` string in a PXF `CREATE EXTERNAL TABLE` statement is a URI that specifies the host and port of an external data source and the path to the data in the external data source. The query portion of the URI, introduced by the question mark (?), must include the required parameters `FRAGMENTER` (readable tables only), `ACCESSOR`, and `RESOLVER`, which specify Java class names that extend the base PXF API plug-in classes. Alternatively, the required parameters can be replaced with a `PROFILE` parameter with the name of a profile defined in the `/etc/conf/pxf-profiles.xml` that defines the required classes.
The `LOCATION` string in a PXF `CREATE EXTERNAL TABLE` statement is a URI that specifies the host and port of an external data source and the path to the data in the external data source. The query portion of the URI, introduced by the question mark (?), must include the PXF profile name or the plug-in's `FRAGMENTER` (readable tables only), `ACCESSOR`, and `RESOLVER` class names.

PXF profiles are defined in the `/etc/pxf/conf/pxf-profiles.xml` file. Profile definitions include plug-in class names. For example, the `HdfsTextSimple` profile definition follows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change "follows" to "is"

@@ -232,23 +250,23 @@ public class InputData {

### <a id="fragmenter"></a>Fragmenter

**Note:** The Fragmenter Plugin reads data into HAWQ readable external tables. The Fragmenter Plugin cannot write data out of HAWQ into writable external tables.
**Note:** The Fragmenter class reads data into HAWQ readable external tables. The Fragmenter class cannot write data out of HAWQ into writable external tables.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like "reads" in this sentence should be changed to "formats" or "loads", but maybe neither of those are correct. You decide.


The `ANALYZE` command now retrieves advanced statistics for PXF readable tables by estimating the number of tuples in a table, creating a sample table from the external table, and running advanced statistics queries on the sample table in the same way statistics are collected for native HAWQ tables.

The configuration parameter `pxf_enable_stat_collection` controls collection of advanced statistics. If `pxf_enable_stat_collection` is set to false, no analysis is performed on PXF tables. An additional parameter, `pxf_stat_max_fragments`, controls the number of fragments sampled to build a sample table. By default `pxf_stat_max_fragments` is set to 100, which means that even if there are more than 100 fragments, only this number of fragments will be used in `ANALYZE` to sample the data. Increasing this number will result in better sampling, but can also impact performance.

When a PXF table is analyzed and `pxf_enable_stat_collection` is set to off, or an error occurs because the table is not defined correctly, the PXF service is down, or `getFragmentsStats` is not implemented, a warning message is shown and no statistics are gathered for that table. If `ANALYZE` is running over all tables in the database, the next table will be processed – a failure processing one table does not stop the command.
When a PXF table is analyzed and `pxf_enable_stat_collection` is set to off, or an error occurs because the table is not defined correctly, the PXF service is down, or `getFragmentsStats()` is not implemented, a warning message is shown and no statistics are gathered for that table. If `ANALYZE` is running over all tables in the database, the next table will be processed – a failure processing one table does not stop the command.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence really needs to be unpacked. My best take at it is:

When a PXF table is analyzed, any of the following conditions might result in a warning message with no statistics gathered for the table:

  • pxf_enable_stat_collection is set to off, or
  • an error occurs because the table is not defined correctly, or
  • the PXF service is down, or
  • getFragmentsStats() is not implemented

@@ -663,8 +695,8 @@ public interface WriteResolver {

**Note:**

- getFields should return a List&lt;OneField&gt;, each OneField representing a single field.
- `setFields `should return a single `OneRow `object, given a List&lt;OneField&gt;.
- `getFields()` should return a `List<OneField>`, each `OneField` representing a single field.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change "each" to "with each"

@@ -687,7 +719,7 @@ public class OneField {
}
```

The value of `type` should follow the org.apache.hawq.pxf.api.io.DataType `enums`. `val` is the appropriate Java class. Supported types are as follows:
The value of `type` should follow the `org.apache.hawq.pxf.api.io.DataType` `enums`. `val` is the appropriate Java class. Supported types are as follows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove "as follows" here.

@@ -885,7 +923,7 @@ public class Constant

#### <a id="filterobject"></a>Filter Object

Filter Objects can be internal, such as those you define; or external, those that the remote system uses. For example, for HBase, you define the HBase `Filter` class (`org.apache.hadoop.hbase.filter.Filter`), while for Hive, you use an internal default representation created by the PXF framework, called `BasicFilter`. You can decide the filter object to use, including writing a new one. `BasicFilter` is the most common:
Filter Objects can be internal, such as those you define; or external, those that the remote system uses. For example, for HBase, you define the HBase `Filter` class (`org.apache.hadoop.hbase.filter.Filter`), while for Hive, you use an internal default representation created by the PXF framework, called `BasicFilter`. You can choose the filter object to use, including writing a new one. `BasicFilter` is the most common:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the ; to a , (or use dashes as instead like -such as those you define-)

Also remove the commas after HBase, and Hive, in the second sentence.

@@ -34,7 +34,7 @@ PXF comes with a number of built-in profiles that group together a collection o
- HBase (Read only)
- JSON (Read only)

You can specify a built-in profile when you want to read data that exists inside HDFS files, Hive tables, HBase tables, and JSON files and for writing data into HDFS files.
You can specify a built-in profile when you want to read data that exists inside HDFS files, Hive tables, HBase tables, and JSON files and when you want to write data into HDFS files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comma after "JSON files," Also not sure but it seems like the "and" should change to "or" in this section.

@@ -195,6 +195,8 @@ Examine/collect the log messages from `pxf-service.log`.

### <a id="pxfdblogmsg"></a>Database-Level Logging

Database-level logging may provide insight into internal PXF service operations. Additionally, when accessing Hive tables using `hcatalog` or the `Hive*` profiles, log messages will identify the underlying `Hive*` profile(s) employed to access the data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change "accessing" to "you access" and remove "will"

@lisakowen
Copy link
Contributor Author

thanks for reviewing, david. i've incorporated your review comments.

@dyozie
Copy link
Contributor

dyozie commented Feb 3, 2017

Thank you! Is this ready to go, then, or are you looking for other reviewers to comment?

@lisakowen
Copy link
Contributor Author

waiting on engineering review.

@shivzone
Copy link

shivzone commented Feb 8, 2017

@lisakowen with the recent fixes using hcatalog, we should have hcatalog based access as the primary example for PXF Hive Plugin. We can continue having examples for Hive profile to demonstrate the ability to access a table with multiple parititions/stroage types.

@@ -131,6 +149,8 @@ Note: The <code class="ph codeph">DELIMITER</code> parameter is mandatory.
</tbody>
</table>

**Notes**: Metadata identifies the Java class that provides field definitions in the relation. OutputFormat identifies the file format for which a specific profile is optimized. While the built-in `Hive*` profiles provide Metadata and OutputFormat classes, most profiles will have no need to implement or specify these classes.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this description about OutputFormat is the most accurate one. Oleks can you provide a more accurate description for this property

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably mention that PXF service can produce data in different formats - TEXT, GPDBWritable, and outputFormat property means given profile optimized for particular output format(TEXT/GPDBWritable).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the statement is kind of vague, yes. i will make it clearer.

it seems to me that the outputFormat property is more admin- or developer-focused. what i mean here is that a typical end-user using a Hive* profile probably does not need to know about the setting. an admin configuring custom profiles or a developer creating a custom plug-in may be interested. the current admin- and developer- focused PXF docs need some attention. can we address a more detailed discussion of this property as part of that rework at a later time?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lisakowen you are right. outputFormat is a property intended for developers creating custom plugins/profiles. End users who only wish to use any of the existing profiles don't have to know about this.

@lisakowen
Copy link
Contributor Author

@shivzone - i will move the hcatalog section above the external tables section. i can also integrate the example that uses the Hive profile with multiple partitions of differing file formats from PR #90 that has not yet been reviewed. are you asking that this page include only Hive profile examples, or do you want to keep the existing HiveRC/HiveText examples? thanks.

@shivzone
Copy link

shivzone commented Feb 8, 2017

We can continue having the other examples. We can retire them over time.

The PXF Hive plug-in supports several Hive-related profiles. These include `Hive`, `HiveText`, and `HiveRC`.
In the previous section, you used HCatalog integration to query a Hive table. You can also create a PXF/HAWQ external table to access Hive table data. This Hive table access mechanism requires that you identify an appropriate Hive profile.

The PXF Hive plug-in supports several Hive-related profiles. These include `Hive`, `HiveText`, and `HiveRC`. `HiveText` and `HiveRC` profiles are specifically optimized for text and RC file formats, respectively. The `Hive` profile is optimized for all file storage types; use the `Hive` profile when the underlying Hive table is composed of multiple partitions with differing file formats.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Third sentence here should start with "The"

@dyozie
Copy link
Contributor

dyozie commented Feb 8, 2017

Thanks, Lisa! The changes are in.

@lisakowen
Copy link
Contributor Author

merged. closing.

@lisakowen lisakowen closed this Feb 9, 2017
@lisakowen lisakowen deleted the feature/HAWQ-1304-pxf-n-hive-chgs branch March 10, 2017 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants