Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
0bef2ed
hadoop connector
Sep 5, 2019
3f972be
Add getSortedChunkGroupMetaDataListByDeviceIds method in TsFileSequen…
Sep 6, 2019
0521df1
Add getSortedChunkGroupMetaDataListByDeviceIds method in TsFileSequen…
Sep 6, 2019
92befe7
Almost done
JackieTien97 Sep 8, 2019
335a959
update junit test code
JackieTien97 Sep 9, 2019
7519656
modify the junit test code
JackieTien97 Sep 9, 2019
10cf27d
test
JackieTien97 Sep 10, 2019
7e7bfec
solve conflicts
JackieTien97 Sep 10, 2019
a07dd1b
Merge remote-tracking branch 'upstream/master' into hadoop-connector
JackieTien97 Sep 10, 2019
680352b
add some test cases
JackieTien97 Sep 10, 2019
1f8e611
delete useless log
JackieTien97 Sep 10, 2019
91f8cdd
delete useless test file
JackieTien97 Sep 10, 2019
97ff6f0
set the default value of READ_TIME_ENABLE and READ_DELTAOBJECT_ENABLE…
JackieTien97 Sep 12, 2019
8e50261
add hadoop windows utils for windows test environment
JackieTien97 Sep 12, 2019
3b31e90
change the dir path format
JackieTien97 Sep 12, 2019
aa03ffc
modify the file path
JackieTien97 Sep 12, 2019
7f91b75
modify move file path
JackieTien97 Sep 12, 2019
0f2c992
Another try
JackieTien97 Sep 12, 2019
ff80bdd
add hadoop environment varibale
JackieTien97 Sep 12, 2019
79fb702
modify the hadoop version
JackieTien97 Sep 12, 2019
3f360a3
only add
JackieTien97 Sep 12, 2019
76b52e2
solve conflict
JackieTien97 Sep 12, 2019
8623099
add more dependency
JackieTien97 Sep 12, 2019
30da99b
modify the hadoop pom
JackieTien97 Sep 16, 2019
683aec9
add a hadoop submodule in example module
JackieTien97 Sep 16, 2019
1280bc8
before change to map
JackieTien97 Sep 17, 2019
b8bc065
Replace ArrayWritable with MapWritable and add a hadoop submodule in …
JackieTien97 Sep 17, 2019
f47e1e1
remove the redundant HDFSInput and HDFSOutput in hadoop module
JackieTien97 Sep 17, 2019
368d41c
add license
JackieTien97 Sep 17, 2019
c9fbb0d
add hive connector only for query
JackieTien97 Sep 25, 2019
e411852
resolve the pom dependency convergence error
JackieTien97 Sep 25, 2019
e07faae
1. add some configuration in hive-connector pom to package with depen…
JackieTien97 Sep 26, 2019
09e1036
customize TSFHiveOutputFormat for Hive
JackieTien97 Sep 26, 2019
74d2254
resolve the pom dependency convergence
JackieTien97 Sep 26, 2019
d21988f
customize TSFHiveInputFormat for Hive
JackieTien97 Sep 26, 2019
a94b434
One strange thing: two different InputSplit in hadoop! Confusing me a…
JackieTien97 Sep 26, 2019
61f0f29
resolve conflict
JackieTien97 Sep 26, 2019
7b94cfc
resolve conflicts
JackieTien97 Sep 27, 2019
0b5b61f
Merge remote-tracking branch 'upstream/master' into hive-connector
JackieTien97 Sep 28, 2019
4bb10d8
refactor the function name
JackieTien97 Sep 28, 2019
fe2433c
delete all the author comments in files
JackieTien97 Sep 28, 2019
a8525f2
change the package version
JackieTien97 Oct 15, 2019
3d03015
solve the hive can't use count() function problem
JackieTien97 Oct 15, 2019
d63b1a9
change key name
JackieTien97 Oct 15, 2019
9e896cf
refactor
JackieTien97 Oct 16, 2019
f0e63eb
add some comments
JackieTien97 Oct 16, 2019
871eec7
hive-connector English version doc
JackieTien97 Oct 16, 2019
095bf93
add Chinese version doc for hive-connector
JackieTien97 Oct 16, 2019
8ab9d6e
add more UT
JackieTien97 Oct 17, 2019
05a5d79
add apache-rat
JackieTien97 Oct 17, 2019
8312007
generate test.tsfile in test
JackieTien97 Oct 17, 2019
9b2e52e
change the tablename and cloumn name
JackieTien97 Oct 18, 2019
8de7f99
solve conflict
JackieTien97 Oct 22, 2019
cfe107e
solve problems
JackieTien97 Oct 22, 2019
03c3268
change the package name
JackieTien97 Oct 22, 2019
62765ee
Merge remote-tracking branch 'upstream/master' into hive-connector
JackieTien97 Oct 23, 2019
04bf1f7
Confusing OOM and change the package name in exmaple module and hadoo…
JackieTien97 Oct 23, 2019
da2d06e
Merge remote-tracking branch 'upstream/master' into hive-connector
JackieTien97 Oct 23, 2019
a9eb68d
change database to device_id and tablename to sensorid
JackieTien97 Oct 24, 2019
6db4fd9
Solve some license
JackieTien97 Oct 25, 2019
d30327f
some docs
JackieTien97 Oct 25, 2019
ffe6177
change the way to get deviceId
JackieTien97 Oct 25, 2019
b097b51
modify docs
JackieTien97 Oct 25, 2019
0d12313
modify docs
JackieTien97 Oct 26, 2019
5bdbed8
Merge remote-tracking branch 'upstream/master' into hive-connector
JackieTien97 Oct 26, 2019
8503ed4
change hive connector doc path and name
JackieTien97 Oct 26, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
<!--

Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.

-->

<!-- TOC -->
## 概要

- TsFile的Hive连接器使用手册
- 什么是TsFile的Hive连接器
- 系统环境要求
- 数据类型对应关系
- 为Hive添加依赖jar包
- 创建Tsfile-backed的Hive表
- 从Tsfile-backed的Hive表中查询
- 选择查询语句示例
- 聚合查询语句示例
- 后续工作

<!-- /TOC -->
# TsFile的Hive连接器使用手册

## 什么是TsFile的Hive连接器

TsFile的Hive连接器实现了对Hive读取外部Tsfile类型的文件格式的支持,
使用户能够通过Hive操作Tsfile。

有了这个连接器,用户可以
* 将单个Tsfile文件加载进Hive,不论文件是存储在本地文件系统或者是HDFS中
* 将某个特定目录下的所有文件加载进Hive,不论文件是存储在本地文件系统或者是HDFS中
* 使用HQL查询tsfile
* 到现在为止, 写操作在hive-connector中还没有被支持. 所以, HQL中的insert操作是不被允许的

## 系统环境要求

|Hadoop Version |Hive Version | Java Version | TsFile |
|------------- |------------ | ------------ |------------ |
| `2.7.3` or `3.2.1` | `2.3.6` or `3.1.2` | `1.8` | `0.9.0-SNAPSHOT`|

> 注意:关于如何下载和使用Tsfile, 请参考以下链接: <https://github.com/apache/incubator-iotdb/tree/master/tsfile>。

## 数据类型对应关系

| TsFile 数据类型 | Hive 数据类型 |
| ---------------- | --------------- |
| BOOLEAN | Boolean |
| INT32 | INT |
| INT64 | BIGINT |
| FLOAT | Float |
| DOUBLE | Double |
| TEXT | STRING |


## 为Hive添加依赖jar包

为了在Hive中使用Tsfile的hive连接器,我们需要把hive连接器的jar导入进hive。

从 <https://github.com/apache/incubator-iotdb>下载完iotdb后, 你可以使用 `mvn clean package -pl hive-connector -am -Dmaven.test.skip=true`命令得到一个 `hive-connector-X.X.X-SNAPSHOT-jar-with-dependencies.jar`。

然后在hive的命令行中,使用`add jar XXX`命令添加依赖。例如:

```
hive> add jar /Users/hive/incubator-iotdb/hive-connector/target/hive-connector-0.9.0-SNAPSHOT-jar-with-dependencies.jar;

Added [/Users/hive/incubator-iotdb/hive-connector/target/hive-connector-0.9.0-SNAPSHOT-jar-with-dependencies.jar] to class path
Added resources: [/Users/hive/incubator-iotdb/hive-connector/target/hive-connector-0.9.0-SNAPSHOT-jar-with-dependencies.jar]
```


## 创建Tsfile-backed的Hive表

为了创建一个Tsfile-backed的表,需要将`serde`指定为`org.apache.iotdb.hive.TsFileSerDe`,
将`inputformat`指定为`org.apache.iotdb.hive.TSFHiveInputFormat`,
将`outputformat`指定为`org.apache.iotdb.hive.TSFHiveOutputFormat`。

同时要提供一个只包含两个字段的Schema,这两个字段分别是`time_stamp`和`sensor_id`。
`time_stamp`代表的是时间序列的时间值,`sensor_id`是你想要从tsfile文件中提取出来分析的传感器名称,比如说`sensor_1`。
表的名字可以是hive所支持的任何表名。

需要提供一个路径供hive-connector从其中拉取最新的数据。

这个路径必须是一个指定的文件夹,这个文件夹可以在你的本地文件系统上,也可以在HDFS上,如果你启动了Hadoop的话。
如果是本地文件系统,要以这样的形式`file:///data/data/sequence/root.baic2.WWS.leftfrontdoor/`

最后需要在`TBLPROPERTIES`里指明`device_id`

例如:

```
CREATE EXTERNAL TABLE IF NOT EXISTS only_sensor_1(
time_stamp TIMESTAMP,
sensor_1 BIGINT)
ROW FORMAT SERDE 'org.apache.iotdb.hive.TsFileSerDe'
STORED AS
INPUTFORMAT 'org.apache.iotdb.hive.TSFHiveInputFormat'
OUTPUTFORMAT 'org.apache.iotdb.hive.TSFHiveOutputFormat'
LOCATION '/data/data/sequence/root.baic2.WWS.leftfrontdoor/'
TBLPROPERTIES ('device_id'='root.baic2.WWS.leftfrontdoor.plc1');
```

在这个例子里,我们从`/data/data/sequence/root.baic2.WWS.leftfrontdoor/`中拉取`root.baic2.WWS.leftfrontdoor.plc1.sensor_1`的数据。
这个表可能产生如下描述:

```
hive> describe only_sensor_1;
OK
time_stamp timestamp from deserializer
sensor_1 bigint from deserializer
Time taken: 0.053 seconds, Fetched: 2 row(s)
```

到目前为止, Tsfile-backed的表已经可以像hive中其他表一样被操作了。


## 从Tsfile-backed的Hive表中查询

在做任何查询之前,我们需要通过如下命令,在hive中设置`hive.input.format`:

```
hive> set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
```

现在,我们已经在hive中有了一个名为`only_sensor_1`的外部表。
我们可以使用HQL做任何查询来分析其中的数据。

例如:

### 选择查询语句示例

```
hive> select * from only_sensor_1 limit 10;
OK
1 1000000
2 1000001
3 1000002
4 1000003
5 1000004
6 1000005
7 1000006
8 1000007
9 1000008
10 1000009
Time taken: 1.464 seconds, Fetched: 10 row(s)
```

### 聚合查询语句示例

```
hive> select count(*) from only_sensor_1;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = jackietien_20191016202416_d1e3e233-d367-4453-b39a-2aac9327a3b6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2019-10-16 20:24:18,305 Stage-1 map = 0%, reduce = 0%
2019-10-16 20:24:27,443 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local867757288_0002
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
1000000
Time taken: 11.334 seconds, Fetched: 1 row(s)
```

## 后续工作

我们现在仅支持查询操作,写操作的支持还在开发中...
Original file line number Diff line number Diff line change
@@ -0,0 +1,190 @@
<!--

Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.

-->

<!-- TOC -->
## Outline

- TsFile-Hive-Connector User Guide
- About TsFile-Hive-Connector
- System Requirements
- Data Type Correspondence
- Add Dependency For Hive
- Creating Tsfile-backed Hive tables
- Querying from Tsfile-backed Hive tables
- Select Clause Example
- Aggregate Clause Example
- What's Next

<!-- /TOC -->
# TsFile-Hive-Connector User Guide

## About TsFile-Hive-Connector

TsFile-Hive-Connector implements the support of Hive for external data sources of Tsfile type. This enables users to operate Tsfile by Hive.

With this connector, you can
* Load a single TsFile, from either the local file system or hdfs, into hive
* Load all files in a specific directory, from either the local file system or hdfs, into hive
* Query the tsfile through HQL.
* As of now, the write operation is not supported in hive-connector. So, insert operation in HQL is not allowed while operating tsfile through hive.

## System Requirements

|Hadoop Version |Hive Version | Java Version | TsFile |
|------------- |------------ | ------------ |------------ |
| `2.7.3` or `3.2.1` | `2.3.6` or `3.1.2` | `1.8` | `0.9.0-SNAPSHOT`|

> Note: For more information about how to download and use TsFile, please see the following link: https://github.com/apache/incubator-iotdb/tree/master/tsfile.

## Data Type Correspondence

| TsFile data type | Hive field type |
| ---------------- | --------------- |
| BOOLEAN | Boolean |
| INT32 | INT |
| INT64 | BIGINT |
| FLOAT | Float |
| DOUBLE | Double |
| TEXT | STRING |


## Add Dependency For Hive

To use hive-connector in hive, we should add the hive-connector jar into hive.

After downloading the code of iotdb from <https://github.com/apache/incubator-iotdb>, you can use the command of `mvn clean package -pl hive-connector -am -Dmaven.test.skip=true` to get a `hive-connector-X.X.X-SNAPSHOT-jar-with-dependencies.jar`.

Then in hive, use the command of `add jar XXX` to add the dependency. For example:

```
hive> add jar /Users/hive/incubator-iotdb/hive-connector/target/hive-connector-0.9.0-SNAPSHOT-jar-with-dependencies.jar;

Added [/Users/hive/incubator-iotdb/hive-connector/target/hive-connector-0.9.0-SNAPSHOT-jar-with-dependencies.jar] to class path
Added resources: [/Users/hive/incubator-iotdb/hive-connector/target/hive-connector-0.9.0-SNAPSHOT-jar-with-dependencies.jar]
```


## Creating Tsfile-backed Hive tables

To create a Tsfile-backed table, specify the `serde` as `org.apache.iotdb.hive.TsFileSerDe`,
specify the `inputformat` as `org.apache.iotdb.hive.TSFHiveInputFormat`,
and the `outputformat` as `org.apache.iotdb.hive.TSFHiveOutputFormat`.

Also provide a schema which only contains two fields: `time_stamp` and `sensor_id` for the table.
`time_stamp` is the time value of the time series
and `sensor_id` is the name of the sensor you want to extract from the tsfile to hive such as `sensor_1`.
The name of the table can be any valid tables names in hive.

Also provide a location from which hive-connector will pull the most current data for the table.

The location must be a specific directory, it can be on your local file system or HDFS if you have set up Hadoop.
If it is in your local file system, the location should look like `file:///data/data/sequence/root.baic2.WWS.leftfrontdoor/`

At last, you should set the `device_id` in `TBLPROPERTIES` to the device name you want to analyze.

For example:

```
CREATE EXTERNAL TABLE IF NOT EXISTS only_sensor_1(
time_stamp TIMESTAMP,
sensor_1 BIGINT)
ROW FORMAT SERDE 'org.apache.iotdb.hive.TsFileSerDe'
STORED AS
INPUTFORMAT 'org.apache.iotdb.hive.TSFHiveInputFormat'
OUTPUTFORMAT 'org.apache.iotdb.hive.TSFHiveOutputFormat'
LOCATION '/data/data/sequence/root.baic2.WWS.leftfrontdoor/'
TBLPROPERTIES ('device_id'='root.baic2.WWS.leftfrontdoor.plc1');
```
In this example we're pulling the data of `root.baic2.WWS.leftfrontdoor.plc1.sensor_1` from the directory of `/data/data/sequence/root.baic2.WWS.leftfrontdoor/`.
This table might result in a description as below:

```
hive> describe only_sensor_1;
OK
time_stamp timestamp from deserializer
sensor_1 bigint from deserializer
Time taken: 0.053 seconds, Fetched: 2 row(s)
```
At this point, the Tsfile-backed table can be worked with in Hive like any other table.

## Querying from Tsfile-backed Hive tables

Before we do any queries, we should set the `hive.input.format` in hive by executing the following command.

```
hive> set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
```

Now, we already have an external table named `only_sensor_1` in hive.
We can use any query operations through HQL to analyse it.

For example:

### Select Clause Example

```
hive> select * from only_sensor_1 limit 10;
OK
1 1000000
2 1000001
3 1000002
4 1000003
5 1000004
6 1000005
7 1000006
8 1000007
9 1000008
10 1000009
Time taken: 1.464 seconds, Fetched: 10 row(s)
```

### Aggregate Clause Example

```
hive> select count(*) from only_sensor_1;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = jackietien_20191016202416_d1e3e233-d367-4453-b39a-2aac9327a3b6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2019-10-16 20:24:18,305 Stage-1 map = 0%, reduce = 0%
2019-10-16 20:24:27,443 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local867757288_0002
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
1000000
Time taken: 11.334 seconds, Fetched: 1 row(s)
```

## What's Next

We're currently only supporting read operation.
Writing tables to Tsfiles is under development.

Loading