Hdfs file input plugin for Embulk

Read files on Hdfs.

Overview

Plugin type: file input
Resume supported: not yet
Cleanup supported: no

Configuration

config_files list of paths to Hadoop's configuration files (array of strings, default: [])
config overwrites configuration parameters (hash, default: {})
path file path on Hdfs. you can use glob and Date format like %Y%m%d/%s (string, required).
rewind_seconds When you use Date format in input_path property, the format is executed by using the time which is Now minus this property. (long, default: 0)
partition when this is true, partition input files and increase task count. (boolean, default: true)
num_partitions number of partitions. (long, default: Runtime.getRuntime().availableProcessors())
skip_header_lines Skip this number of lines first. Set 1 if the file has header line. (long, default: 0)
decompression Decompress compressed files by hadoop compression codec api. (boolean. default: false)

Example

in:
  type: hdfs
  config_files:
    - /etc/hadoop/conf/core-site.xml
    - /etc/hadoop/conf/hdfs-site.xml
  config:
    fs.defaultFS: 'hdfs://hadoop-nn1:8020'
    dfs.replication: 1
    fs.hdfs.impl: 'org.apache.hadoop.hdfs.DistributedFileSystem'
    fs.file.impl: 'org.apache.hadoop.fs.LocalFileSystem'
  path: /user/embulk/test/%Y-%m-%d/*
  rewind_seconds: 86400
  partition: true
  num_partitions: 30
  decoders:
    - {type: gzip}
  parser:
    charset: UTF-8
    newline: CRLF
    type: csv
    delimiter: "\t"
    quote: ''
    escape: ''
    trim_if_not_quoted: true
    skip_header_lines: 0
    allow_extra_columns: true
    allow_optional_columns: true
    columns:
    - {name: c0, type: string}
    - {name: c1, type: string}
    - {name: c2, type: string}
    - {name: c3, type: long}

Note

The parameter num_partitions is the approximate value. The actual num_partitions is larger than this parameter.
- see: The Partitioning Logic
the feature of the partition supports only 3 line terminators.
- \n
- \r
- \r\n

The Reference Implementation

hito4t/embulk-input-filesplit

##The Partitioning Logic

int partitionSizeByOneTask = totalFileLength / approximateNumPartitions;

/*
...
*/
            long numPartitions = 1; // default is no partition.
            if (isPartitionable(task, conf, status)) { // partition: true and (decompression: false or CompressionCodec is null)
                numPartitions = ((status.getLen() - 1) / partitionSizeByOneTask) + 1;
            }

            for (long i = 0; i < numPartitions; i++) {
                long start = status.getLen() * i / numPartitions;
                long end = status.getLen() * (i + 1) / numPartitions;
                if (start < end) {
                    TargetFileInfo targetFileInfo = new TargetFileInfo.Builder()
                            .pathString(status.getPath().toString())
                            .start(start)
                            .end(end)
                            .isDecompressible(isDecompressible(task, conf, status))
                            .isPartitionable(isPartitionable(task, conf, status))
                            .numHeaderLines(task.getSkipHeaderLines())
                            .build();
                    builder.add(targetFileInfo);
                }
            }
/*
...
*/

Build

$ ./gradlew gem

Development

$ ./gradlew classpath
$ bundle exec embulk run -I lib example.yml

Name		Name	Last commit message	Last commit date
Latest commit History 150 Commits
config/checkstyle		config/checkstyle
example		example
gradle/wrapper		gradle/wrapper
lib/embulk/input		lib/embulk/input
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
CHENGELOG.md		CHENGELOG.md
LICENSE.txt		LICENSE.txt
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hdfs file input plugin for Embulk

Overview

Configuration

Example

Note

The Reference Implementation

Build

Development

About

Releases 12

Packages

Contributors 2

Languages

License

civitaspo/embulk-input-hdfs

Folders and files

Latest commit

History

Repository files navigation

Hdfs file input plugin for Embulk

Overview

Configuration

Example

Note

The Reference Implementation

Build

Development

About

Resources

License

Stars

Watchers

Forks

Releases 12

Packages 0

Contributors 2

Languages

Packages