Skip to content

embulk/embulk-filter-expand_json

main
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
src
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Expand Json filter plugin for Embulk

Release Status Build Status

expand columns having json into multiple columns

Overview

  • Plugin type: filter

Configuration

  • json_column_name: a column name having json to be expanded (string, required)
  • root: root property to start fetching each entries, specify in JsonPath style (string, default: "$.")
  • expanded_columns: columns expanded into multiple columns (array of hash, required)
    • name: name of the column. you can define JsonPath style.
    • type: type of the column (see below)
    • format: format of the timestamp if type is timestamp
    • timezone: Time zone of each timestamp columns if values don’t include time zone description (UTC by default)
  • keep_expanding_json_column: Not remove the expanding json column from input schema if it's true (false by default)
  • default_timezone: Time zone of timestamp columns if values don’t include time zone description (UTC by default)
  • stop_on_invalid_record: Stop bulk load transaction if an invalid record is included (false by default)
  • cache_provider: Cache provider name for JsonPath. "LRU" and "NOOP" are built-in. You can specify user defined class. (string, default: "LRU")
    • "NOOP" becomes default in the future.

type of the column

name description
boolean true or false
long 64-bit signed integers
timestamp Date and time with nano-seconds precision
double 64-bit floating point numbers
string Strings

Example

filters:
  - type: expand_json
    json_column_name: json_payload
    root: "$."
    expanded_columns:
      - {name: "phone_numbers", type: string}
      - {name: "app_id", type: long}
      - {name: "point", type: double}
      - {name: "created_at", type: timestamp, format: "%Y-%m-%d", timezone: "UTC"}
      - {name: "profile.anniversary.et", type: string}
      - {name: "profile.anniversary.voluptatem", type: string}
      - {name: "profile.like_words[1]", type: string}
      - {name: "profile.like_words[2]", type: string}
      - {name: "profile.like_words[0]", type: string}

Note

  • If the value evaluated by JsonPath is Array or Hash, the value is written as JSON.

Dependencies

Development

Run Example

$ ./gradlew gem
$ embulk run -Ibuild/gemContents/lib ./example/config.yml

Build

$ ./gradlew gem  # -t to watch change of files and rebuild continuously

Benchmark for cache_provider option

In some cases, cache_provider: NOOP improves the performance of this plugin by 3 times (#41). So we do a benchmark about cache_provider. In our case, cache_provider: noop improves the performance by 1.5 times.

use expand_json filter cache_provider Time took records/s
false none 7.62s 1,325,459/s
true "LRU" 2m9s 78,025/s
true "NOOP" 1m25s 118,476/s

You can reproduce the bench by the below way.

./gradlew gem
./bench/run.sh

For Maintainers

Release

Modify version in build.gradle at a detached commit, and then tag the commit with an annotation.

git checkout --detach master

(Edit: Remove "-SNAPSHOT" in "version" in build.gradle.)

git add build.gradle

git commit -m "Release vX.Y.Z"

git tag -a vX.Y.Z

(Edit: Write a tag annotation in the changelog format.)

See Keep a Changelog for the changelog format. We adopt a part of it for Git's tag annotation like below.

## [X.Y.Z] - YYYY-MM-DD

### Added
- Added a feature.

### Changed
- Changed something.

### Fixed
- Fixed a bug.

Push the annotated tag, then. It triggers a release operation on GitHub Actions after approval.

git push -u origin vX.Y.Z

Contributor

  • @Civitaspo
  • @muga
  • @sakama

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Languages