Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-8204: Allow Provided Schema for HTTP Plugin in JSON Mode #2526

Merged
merged 18 commits into from
May 3, 2022

Conversation

cgivre
Copy link
Contributor

@cgivre cgivre commented May 1, 2022

DRILL-8204: Allow Provided Schema for HTTP Plugin in JSON Mode

Description

See below. 👇

Documentation

Schema Provisioning

One of the challenges of querying APIs is inconsistent data. Drill allows you to provide a schema for individual endpoints. You can do this in one of three ways:

  1. By providing a schema inline See: Specifying Schema as Table Function Parameter
  2. By providing a schema in the configuration for the endpoint.

The schema provisioning currently supports complex types of Arrays and Maps at any nesting level.

Example Schema Provisioning:

"jsonOptions": {
  "schema": {
    "columns":[
      {
        "name":"outer_map",
        "type":"ARRAY<STRUCT<`bigint_col` BIGINT, `boolean_col` BOOLEAN, `date_col` DATE, `double_col` DOUBLE, `interval_col` INTERVAL, `int_col` BIGINT, `timestamp_col` TIMESTAMP, `time_col` TIME, `varchar_col` VARCHAR>>","mode":"REPEATED"
      }, {
        "name":"field1",
        "type":"VARCHAR",
        "mode":"OPTIONAL"
      },
    ]
  }
}

Dealing With Inconsistent Schemas

One of the major challenges of interacting with JSON data is when the schema is inconsistent. Drill has a UNION data type which is marked as experimental. At the time of
writing, the HTTP plugin does not support the UNION, however supplying a schema can solve a lot of those issues.

Json Mode

Drill offers the option of reading all JSON values as a string. While this can complicate downstream analytics, it can also be a more memory-efficient way of reading data with
inconsistent schema. Unfortunately, at the time of writing, JSON-mode is only available with a provided schema. However, future work will allow this mode to be enabled for
any JSON data.

Enabling JSON Mode:

You can enable JSON mode simply by adding the drill.json-mode property with a value of json to a field, as shown below:

"jsonOptions": {
  "readNumbersAsDouble": true,
  "schema": {
    "type": "tuple_schema",
      "columns": [
        {
          "name": "custom_fields",
          "type": "ARRAY<STRUCT<`value` VARCHAR PROPERTIES { 'drill.json-mode' = 'json' }>>",
          "mode": "REPEATED"
      }
    ]
  }
}

Testing

Added unit tests.

@cgivre cgivre added bug enhancement PRs that add a new functionality to Drill labels May 1, 2022
@cgivre cgivre requested a review from jnturton May 1, 2022 13:30
@cgivre cgivre self-assigned this May 1, 2022
@cgivre cgivre requested review from vvysotskyi and removed request for jnturton May 2, 2022 12:33
Copy link
Member

@vvysotskyi vvysotskyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also support Specifying the Schema as Table Function Parameter and resolve the target schema if they both are specified, so some users could customize data types without accessing storage configs.

@jnturton
Copy link
Contributor

jnturton commented May 2, 2022

This comment is similar to the last one from @vvysotskyi except that I was experimenting with the metastore. While I cannot see that CREATE SCHEMA could work for the http storage plugin, I think it is still possible in theory to store a schema in the metastore something like the following.

ANALYZE TABLE table(http.tpch_test_svc.`/nation.json` (type=>'json',
    schema=>'inline=(
        `n_nationkey`	INT not null,
        `n_name`	VARCHAR not null,
        `n_regionkey` 	DOUBLE not null,
        `n_comment` 	VARCHAR not null)'
    )) REFRESH METADATA;

Note that the above command works for local CSV but I got a Calcite error for a local JSON file. I did not test it with the HTTP plugin.

Some information about provided schema priority, should it be of interest: https://drill.apache.org/docs/using-drill-metastore/#schema-priority.

@vvysotskyi
Copy link
Member

Metastore with JSON should also work fine, here is the unit test that checks it: TestMetastoreWithEasyFormatPlugin.testAnalyzeOnJsonTable.
For now, drill metastore supports only easy file formats and parquet, but in the future, it could handle the HTTP plugin.

@jnturton
Copy link
Contributor

jnturton commented May 2, 2022

For now, drill metastore supports only easy file formats and parquet, but in the future, it could handle the HTTP plugin.

@vvysotskyi I think that the HTTP plugin makes use of the same readers as the easy format plugins (CSV, JSON, XML)? Does that mean that metastore might already work with HTTP, or are there likely to be pieces missing?

@cgivre
Copy link
Contributor Author

cgivre commented May 2, 2022

@vvysotskyi Thanks for the review. I added the logic to the HttpBatchReader to support inline schema, however in doing so, I seem to have a found a bug with the SchemaNegotiator in that the schema always seems to be null with the inline schema.

I created DRILL-8205 to address this. I left the logic and a unit test so once we have a fix for DRILL-8205 it should all work as expected.

@cgivre
Copy link
Contributor Author

cgivre commented May 2, 2022

@vvysotskyi Thanks for your comments and assistance! I got this to work and the HTTP plugin now supports inline schema!

Copy link
Member

@vvysotskyi vvysotskyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making changes, +1

@cgivre cgivre merged commit b5ddf88 into apache:master May 3, 2022
jnturton pushed a commit to jnturton/drill that referenced this pull request Jul 11, 2022
…e#2526)

* Initial commit

* Map working

* WIP

* Added builder

* Lists in maps working

* Add documentation

* Cleaned up UT

* Final Revision

* Fix checkstyle

* Minor tweak

* removed extra test file

* Removed unused import

* Added inline schema support

* Addressed review comments

* Removed unused import

* Removed json string

* Final Revisions

* Fixed unit test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug enhancement PRs that add a new functionality to Drill
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants