Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOCS-561: Repaired docs build #9376

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
274 changes: 274 additions & 0 deletions docs/fa/interfaces/formats.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,11 @@ Format | INSERT | SELECT
[PrettyCompactMonoBlock](formats.md#prettycompactmonoblock) | ✗ | ✔ |
[PrettyNoEscapes](formats.md#prettynoescapes) | ✗ | ✔ |
[PrettySpace](formats.md#prettyspace) | ✗ | ✔ |
[Protobuf](#protobuf) | ✔ | ✔ |
[Avro](#data-format-avro) | ✔ | ✔ |
[AvroConfluent](#data-format-avro-confluent) | ✔ | ✗ |
[Parquet](#data-format-parquet) | ✔ | ✔ |
[ORC](#data-format-orc) | ✔ | ✗ |
[RowBinary](formats.md#rowbinary) | ✔ | ✔ |
[Native](formats.md#native) | ✔ | ✔ |
[Null](formats.md#null) | ✗ | ✔ |
Expand Down Expand Up @@ -750,4 +755,273 @@ struct Message {

</div>

## Protobuf {#protobuf}

Protobuf - is a [Protocol Buffers](https://developers.google.com/protocol-buffers/) format.

This format requires an external format schema. The schema is cached between queries.
ClickHouse supports both `proto2` and `proto3` syntaxes. Repeated/optional/required fields are supported.

Usage examples:

```sql
SELECT * FROM test.table FORMAT Protobuf SETTINGS format_schema = 'schemafile:MessageType'
```

```bash
cat protobuf_messages.bin | clickhouse-client --query "INSERT INTO test.table FORMAT Protobuf SETTINGS format_schema='schemafile:MessageType'"
```

where the file `schemafile.proto` looks like this:

```capnp
syntax = "proto3";

message MessageType {
string name = 1;
string surname = 2;
uint32 birthDate = 3;
repeated string phoneNumbers = 4;
};
```

To find the correspondence between table columns and fields of Protocol Buffers' message type ClickHouse compares their names.
This comparison is case-insensitive and the characters `_` (underscore) and `.` (dot) are considered as equal.
If types of a column and a field of Protocol Buffers' message are different the necessary conversion is applied.

Nested messages are supported. For example, for the field `z` in the following message type

```capnp
message MessageType {
message XType {
message YType {
int32 z;
};
repeated YType y;
};
XType x;
};
```

ClickHouse tries to find a column named `x.y.z` (or `x_y_z` or `X.y_Z` and so on).
Nested messages are suitable to input or output a [nested data structures](../data_types/nested_data_structures/nested.md).

Default values defined in a protobuf schema like this

```capnp
syntax = "proto2";

message MessageType {
optional int32 result_per_page = 3 [default = 10];
}
```

are not applied; the [table defaults](../query_language/create.md#create-default-values) are used instead of them.

ClickHouse inputs and outputs protobuf messages in the `length-delimited` format.
It means before every message should be written its length as a [varint](https://developers.google.com/protocol-buffers/docs/encoding#varints).
See also [how to read/write length-delimited protobuf messages in popular languages](https://cwiki.apache.org/confluence/display/GEODE/Delimiting+Protobuf+Messages).

## Avro {#data-format-avro}

[Apache Avro](http://avro.apache.org/) is a row-oriented data serialization framework developed within Apache's Hadoop project.

ClickHouse Avro format supports reading and writing [Avro data files](http://avro.apache.org/docs/current/spec.html#Object+Container+Files).

### Data Types Matching

The table below shows supported data types and how they match ClickHouse [data types](../data_types/index.md) in `INSERT` and `SELECT` queries.

| Avro data type `INSERT` | ClickHouse data type | Avro data type `SELECT` |
| -------------------- | -------------------- | ------------------ |
| `boolean`, `int`, `long`, `float`, `double` | [Int(8\|16\|32\)](../data_types/int_uint.md), [UInt(8\|16\|32)](../data_types/int_uint.md) | `int` |
| `boolean`, `int`, `long`, `float`, `double` | [Int64](../data_types/int_uint.md), [UInt64](../data_types/int_uint.md) | `long` |
| `boolean`, `int`, `long`, `float`, `double` | [Float32](../data_types/float.md) | `float` |
| `boolean`, `int`, `long`, `float`, `double` | [Float64](../data_types/float.md) | `double` |
| `bytes`, `string`, `fixed`, `enum` | [String](../data_types/string.md) | `bytes` |
| `bytes`, `string`, `fixed` | [FixedString(N)](../data_types/fixedstring.md) | `fixed(N)` |
| `enum` | [Enum(8\|16)](../data_types/enum.md) | `enum` |
| `array(T)` | [Array(T)](../data_types/array.md) | `array(T)` |
| `union(null, T)`, `union(T, null)` | [Nullable(T)](../data_types/date.md) | `union(null, T)`|
| `null` | [Nullable(Nothing)](../data_types/special_data_types/nothing.md) | `null` |
| `int (date)` * | [Date](../data_types/date.md) | `int (date)` * |
| `long (timestamp-millis)` * | [DateTime64(3)](../data_types/datetime.md) | `long (timestamp-millis)` * |
| `long (timestamp-micros)` * | [DateTime64(6)](../data_types/datetime.md) | `long (timestamp-micros)` * |

\* [Avro logical types](http://avro.apache.org/docs/current/spec.html#Logical+Types)



Unsupported Avro data types: `record` (non-root), `map`

Unsupported Avro logical data types: `uuid`, `time-millis`, `time-micros`, `duration`

### Inserting Data

To insert data from an Avro file into ClickHouse table:

```bash
$ cat file.avro | clickhouse-client --query="INSERT INTO {some_table} FORMAT Avro"
```

The root schema of input Avro file must be of `record` type.

To find the correspondence between table columns and fields of Avro schema ClickHouse compares their names. This comparison is case-sensitive.
Unused fields are skipped.

Data types of a ClickHouse table columns can differ from the corresponding fields of the Avro data inserted. When inserting data, ClickHouse interprets data types according to the table above and then [casts](../query_language/functions/type_conversion_functions/#type_conversion_function-cast) the data to corresponding column type.

### Selecting Data

To select data from ClickHouse table into an Avro file:

```bash
$ clickhouse-client --query="SELECT * FROM {some_table} FORMAT Avro" > file.avro
```

Column names must:

- start with `[A-Za-z_]`
- subsequently contain only `[A-Za-z0-9_]`

Output Avro file compression and sync interval can be configured with [output_format_avro_codec](../operations/settings/settings.md#settings-output_format_avro_codec) and [output_format_avro_sync_interval](../operations/settings/settings.md#settings-output_format_avro_sync_interval) respectively.

## AvroConfluent {#data-format-avro-confluent}

AvroConfluent supports decoding single-object Avro messages commonly used with [Kafka](https://kafka.apache.org/) and [Confluent Schema Registry](https://docs.confluent.io/current/schema-registry/index.html).

Each Avro message embeds a schema id that can be resolved to the actual schema with help of the Schema Registry.

Schemas are cached once resolved.

Schema Registry URL is configured with [format_avro_schema_registry_url](../operations/settings/settings.md#settings-format_avro_schema_registry_url)

### Data Types Matching

Same as [Avro](#data-format-avro)

### Usage

To quickly verify schema resolution you can use [kafkacat](https://github.com/edenhill/kafkacat) with [clickhouse-local](../operations/utils/clickhouse-local.md):

```bash
$ kafkacat -b kafka-broker -C -t topic1 -o beginning -f '%s' -c 3 | clickhouse-local --input-format AvroConfluent --format_avro_schema_registry_url 'http://schema-registry' -S "field1 Int64, field2 String" -q 'select * from table'
1 a
2 b
3 c
```

To use `AvroConfluent` with [Kafka](../operations/table_engines/kafka.md):
```sql
CREATE TABLE topic1_stream
(
field1 String,
field2 String
)
ENGINE = Kafka()
SETTINGS
kafka_broker_list = 'kafka-broker',
kafka_topic_list = 'topic1',
kafka_group_name = 'group1',
kafka_format = 'AvroConfluent';

SET format_avro_schema_registry_url = 'http://schema-registry';

SELECT * FROM topic1_stream;
```

!!! note "Warning"
Setting `format_avro_schema_registry_url` needs to be configured in `users.xml` to maintain it's value after a restart.


## Parquet {#data-format-parquet}

[Apache Parquet](http://parquet.apache.org/) is a columnar storage format widespread in the Hadoop ecosystem. ClickHouse supports read and write operations for this format.

### Data Types Matching

The table below shows supported data types and how they match ClickHouse [data types](../data_types/index.md) in `INSERT` and `SELECT` queries.

| Parquet data type (`INSERT`) | ClickHouse data type | Parquet data type (`SELECT`) |
| -------------------- | ------------------ | ---- |
| `UINT8`, `BOOL` | [UInt8](../data_types/int_uint.md) | `UINT8` |
| `INT8` | [Int8](../data_types/int_uint.md) | `INT8` |
| `UINT16` | [UInt16](../data_types/int_uint.md) | `UINT16` |
| `INT16` | [Int16](../data_types/int_uint.md) | `INT16` |
| `UINT32` | [UInt32](../data_types/int_uint.md) | `UINT32` |
| `INT32` | [Int32](../data_types/int_uint.md) | `INT32` |
| `UINT64` | [UInt64](../data_types/int_uint.md) | `UINT64` |
| `INT64` | [Int64](../data_types/int_uint.md) | `INT64` |
| `FLOAT`, `HALF_FLOAT` | [Float32](../data_types/float.md) | `FLOAT` |
| `DOUBLE` | [Float64](../data_types/float.md) | `DOUBLE` |
| `DATE32` | [Date](../data_types/date.md) | `UINT16` |
| `DATE64`, `TIMESTAMP` | [DateTime](../data_types/datetime.md) | `UINT32` |
| `STRING`, `BINARY` | [String](../data_types/string.md) | `STRING` |
| — | [FixedString](../data_types/fixedstring.md) | `STRING` |
| `DECIMAL` | [Decimal](../data_types/decimal.md) | `DECIMAL` |

ClickHouse supports configurable precision of `Decimal` type. The `INSERT` query treats the Parquet `DECIMAL` type as the ClickHouse `Decimal128` type.

Unsupported Parquet data types: `DATE32`, `TIME32`, `FIXED_SIZE_BINARY`, `JSON`, `UUID`, `ENUM`.

Data types of a ClickHouse table columns can differ from the corresponding fields of the Parquet data inserted. When inserting data, ClickHouse interprets data types according to the table above and then [cast](../query_language/functions/type_conversion_functions/#type_conversion_function-cast) the data to that data type which is set for the ClickHouse table column.

### Inserting and Selecting Data

You can insert Parquet data from a file into ClickHouse table by the following command:

```bash
$ cat {filename} | clickhouse-client --query="INSERT INTO {some_table} FORMAT Parquet"
```

You can select data from a ClickHouse table and save them into some file in the Parquet format by the following command:

```bash
$ clickhouse-client --query="SELECT * FROM {some_table} FORMAT Parquet" > {some_file.pq}
```

To exchange data with Hadoop, you can use [HDFS table engine](../operations/table_engines/hdfs.md).

## ORC {#data-format-orc}

[Apache ORC](https://orc.apache.org/) is a columnar storage format widespread in the Hadoop ecosystem. You can only insert data in this format to ClickHouse.

### Data Types Matching

The table below shows supported data types and how they match ClickHouse [data types](../data_types/index.md) in `INSERT` queries.

| ORC data type (`INSERT`) | ClickHouse data type |
| -------------------- | ------------------ |
| `UINT8`, `BOOL` | [UInt8](../data_types/int_uint.md) |
| `INT8` | [Int8](../data_types/int_uint.md) |
| `UINT16` | [UInt16](../data_types/int_uint.md) |
| `INT16` | [Int16](../data_types/int_uint.md) |
| `UINT32` | [UInt32](../data_types/int_uint.md) |
| `INT32` | [Int32](../data_types/int_uint.md) |
| `UINT64` | [UInt64](../data_types/int_uint.md) |
| `INT64` | [Int64](../data_types/int_uint.md) |
| `FLOAT`, `HALF_FLOAT` | [Float32](../data_types/float.md) |
| `DOUBLE` | [Float64](../data_types/float.md) |
| `DATE32` | [Date](../data_types/date.md) |
| `DATE64`, `TIMESTAMP` | [DateTime](../data_types/datetime.md) |
| `STRING`, `BINARY` | [String](../data_types/string.md) |
| `DECIMAL` | [Decimal](../data_types/decimal.md) |

ClickHouse supports configurable precision of the `Decimal` type. The `INSERT` query treats the ORC `DECIMAL` type as the ClickHouse `Decimal128` type.

Unsupported ORC data types: `DATE32`, `TIME32`, `FIXED_SIZE_BINARY`, `JSON`, `UUID`, `ENUM`.

The data types of ClickHouse table columns don't have to match the corresponding ORC data fields. When inserting data, ClickHouse interprets data types according to the table above and then [casts](../query_language/functions/type_conversion_functions/#type_conversion_function-cast) the data to the data type set for the ClickHouse table column.

### Inserting Data

You can insert ORC data from a file into ClickHouse table by the following command:

```bash
$ cat filename.orc | clickhouse-client --query="INSERT INTO some_table FORMAT ORC"
```

To exchange data with Hadoop, you can use [HDFS table engine](../operations/table_engines/hdfs.md).


[مقاله اصلی](https://clickhouse.tech/docs/fa/interfaces/formats/) <!--hide-->
7 changes: 7 additions & 0 deletions docs/toc_fa.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ nav:
- ' کلاینت Command-line': 'interfaces/cli.md'
- 'Native interface (TCP)': 'interfaces/tcp.md'
- 'HTTP interface': 'interfaces/http.md'
- 'MySQL Interface': 'interfaces/mysql.md'
- ' فرمت های Input و Output': 'interfaces/formats.md'
- ' درایور JDBC': 'interfaces/jdbc.md'
- ' درایور ODBC': 'interfaces/odbc.md'
Expand Down Expand Up @@ -183,6 +184,10 @@ nav:
- 'Operators': 'query_language/operators.md'
- 'General Syntax': 'query_language/syntax.md'

- 'Guides':
- 'Overview': 'guides/index.md'
- 'Applying CatBoost Models': 'guides/apply_catboost_model.md'

- 'Operations':
- 'Introduction': 'operations/index.md'
- 'Requirements': 'operations/requirements.md'
Expand Down Expand Up @@ -225,6 +230,8 @@ nav:
- 'Browse ClickHouse Source Code': 'development/browse_code.md'
- 'How to Build ClickHouse on Linux': 'development/build.md'
- 'How to Build ClickHouse on Mac OS X': 'development/build_osx.md'
- 'How to Build ClickHouse on Linux for Mac OS X': 'development/build_cross_osx.md'
- 'How to Build ClickHouse on Linux for AARCH64 (ARM64)': 'development/build_cross_arm.md'
- 'How to Write C++ code': 'development/style.md'
- 'How to Run ClickHouse Tests': 'development/tests.md'
- 'Third-Party Libraries Used': 'development/contrib.md'
Expand Down
1 change: 1 addition & 0 deletions docs/toc_ja.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ nav:
- 'Command-Line Client': 'interfaces/cli.md'
- 'Native Interface (TCP)': 'interfaces/tcp.md'
- 'HTTP Interface': 'interfaces/http.md'
- 'MySQL Interface': 'interfaces/mysql.md'
- 'Input and Output Formats': 'interfaces/formats.md'
- 'JDBC Driver': 'interfaces/jdbc.md'
- 'ODBC Driver': 'interfaces/odbc.md'
Expand Down
3 changes: 3 additions & 0 deletions docs/toc_ru.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ nav:
- 'Движки баз данных':
- 'Введение': 'database_engines/index.md'
- 'MySQL': 'database_engines/mysql.md'
- 'Lazy': 'database_engines/lazy.md'

- 'Движки таблиц':
- 'Введение': 'operations/table_engines/index.md'
Expand Down Expand Up @@ -218,6 +219,7 @@ nav:
- 'Введение': 'operations/utils/index.md'
- 'clickhouse-copier': 'operations/utils/clickhouse-copier.md'
- 'clickhouse-local': 'operations/utils/clickhouse-local.md'
- 'clickhouse-benchmark': 'operations/utils/clickhouse-benchmark.md'

- 'Разработка':
- 'hidden': 'development/index.md'
Expand All @@ -227,6 +229,7 @@ nav:
- 'Как собрать ClickHouse на Linux': 'development/build.md'
- 'Как собрать ClickHouse на Mac OS X': 'development/build_osx.md'
- 'Как собрать ClickHouse на Linux для Mac OS X': 'development/build_cross_osx.md'
- 'Как собрать ClickHouse на Linux для AARCH64 (ARM64)': 'development/build_cross_arm.md'
- 'Как писать код на C++': 'development/style.md'
- 'Как запустить тесты': 'development/tests.md'
- 'Сторонние библиотеки': 'development/contrib.md'
Expand Down
7 changes: 7 additions & 0 deletions docs/toc_zh.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ nav:
- '命令行客户端接口': 'interfaces/cli.md'
- '原生客户端接口 (TCP)': 'interfaces/tcp.md'
- 'HTTP 客户端接口': 'interfaces/http.md'
- 'MySQL 客户端接口': 'interfaces/mysql.md'
- '输入输出格式': 'interfaces/formats.md'
- 'JDBC 驱动': 'interfaces/jdbc.md'
- 'ODBC 驱动': 'interfaces/odbc.md'
Expand Down Expand Up @@ -69,6 +70,7 @@ nav:
- '数据库引擎':
- '介绍': 'database_engines/index.md'
- 'MySQL': 'database_engines/mysql.md'
- 'Lazy': 'database_engines/lazy.md'

- '表引擎':
- '介绍': 'operations/table_engines/index.md'
Expand Down Expand Up @@ -182,6 +184,10 @@ nav:
- '操作符': 'query_language/operators.md'
- '语法说明': 'query_language/syntax.md'

- 'Guides':
- 'Overview': 'guides/index.md'
- 'Applying CatBoost Models': 'guides/apply_catboost_model.md'

- '运维':
- '介绍': 'operations/index.md'
- '环境要求': 'operations/requirements.md'
Expand Down Expand Up @@ -225,6 +231,7 @@ nav:
- '如何在Linux中编译ClickHouse': 'development/build.md'
- '如何在Mac OS X中编译ClickHouse': 'development/build_osx.md'
- '如何在Linux中编译Mac OS X ClickHouse': 'development/build_cross_osx.md'
- '如何在Linux中编译AARCH64 (ARM64) ClickHouse': 'development/build_cross_arm.md'
- '如何编写C++代码': 'development/style.md'
- '如何运行ClickHouse测试': 'development/tests.md'
- '使用的第三方库': 'development/contrib.md'
Expand Down
4 changes: 2 additions & 2 deletions docs/zh/interfaces/formats.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ ClickHouse 可以接受多种数据格式,可以在 (`INSERT`) 以及 (`SELECT
| [PrettySpace](#prettyspace) | ✗ | ✔ |
| [Protobuf](#protobuf) | ✔ | ✔ |
| [Avro](#data-format-avro) | ✔ | ✔ |
| AvroConfluent | ✔ | ✗ |
| [AvroConfluent](#data-format-avro-confluent) | ✔ | ✗ |
| [Parquet](#data-format-parquet) | ✔ | ✔ |
| [ORC](#data-format-orc) | ✔ | ✗ |
| [RowBinary](#rowbinary) | ✔ | ✔ |
Expand Down Expand Up @@ -914,7 +914,7 @@ Column names must:

Output Avro file compression and sync interval can be configured with [output_format_avro_codec](../operations/settings/settings.md#settings-output_format_avro_codec) and [output_format_avro_sync_interval](../operations/settings/settings.md#settings-output_format_avro_sync_interval) respectively.

## AvroConfluent
## AvroConfluent {#data-format-avro-confluent}

AvroConfluent supports decoding single-object Avro messages commonly used with [Kafka](https://kafka.apache.org/) and [Confluent Schema Registry](https://docs.confluent.io/current/schema-registry/index.html).

Expand Down