Skip to content

Commit 3f98cd8

Browse files
authored
[Improve][Connector-V2] Support read archive compress file (#7633)
1 parent bc0326c commit 3f98cd8

File tree

50 files changed

+2579
-44
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

50 files changed

+2579
-44
lines changed

docs/en/connector-v2/source/CosFile.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,7 @@ To use this connector you need put hadoop-cos-{hadoop.version}-{version}.jar and
6666
| xml_use_attr_format | boolean | no | - |
6767
| file_filter_pattern | string | no | - |
6868
| compress_codec | string | no | none |
69+
| archive_compress_codec | string | no | none |
6970
| encoding | string | no | UTF-8 |
7071
| common-options | | no | - |
7172

@@ -284,6 +285,17 @@ The compress codec of files and the details that supported as the following show
284285
- orc/parquet:
285286
automatically recognizes the compression type, no additional settings required.
286287

288+
### archive_compress_codec [string]
289+
290+
The compress codec of archive files and the details that supported as the following shown:
291+
292+
| archive_compress_codec | file_format | archive_compress_suffix |
293+
|------------------------|--------------------|-------------------------|
294+
| ZIP | txt,json,excel,xml | .zip |
295+
| TAR | txt,json,excel,xml | .tar |
296+
| TAR_GZ | txt,json,excel,xml | .tar.gz |
297+
| NONE | all | .* |
298+
287299
### encoding [string]
288300

289301
Only used when file_format_type is json,text,csv,xml.

docs/en/connector-v2/source/FtpFile.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,7 @@ If you use SeaTunnel Engine, It automatically integrated the hadoop jar when you
6060
| xml_use_attr_format | boolean | no | - |
6161
| file_filter_pattern | string | no | - |
6262
| compress_codec | string | no | none |
63+
| archive_compress_codec | string | no | none |
6364
| encoding | string | no | UTF-8 |
6465
| common-options | | no | - |
6566

@@ -265,6 +266,17 @@ The compress codec of files and the details that supported as the following show
265266
- orc/parquet:
266267
automatically recognizes the compression type, no additional settings required.
267268

269+
### archive_compress_codec [string]
270+
271+
The compress codec of archive files and the details that supported as the following shown:
272+
273+
| archive_compress_codec | file_format | archive_compress_suffix |
274+
|------------------------|--------------------|-------------------------|
275+
| ZIP | txt,json,excel,xml | .zip |
276+
| TAR | txt,json,excel,xml | .tar |
277+
| TAR_GZ | txt,json,excel,xml | .tar.gz |
278+
| NONE | all | .* |
279+
268280
### encoding [string]
269281

270282
Only used when file_format_type is json,text,csv,xml.

docs/en/connector-v2/source/HdfsFile.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,8 @@ Read data from hdfs file system.
6363
| xml_row_tag | string | no | - | Specifies the tag name of the data rows within the XML file, only used when file_format is xml. |
6464
| xml_use_attr_format | boolean | no | - | Specifies whether to process data using the tag attribute format, only used when file_format is xml. |
6565
| compress_codec | string | no | none | The compress codec of files |
66-
| encoding | string | no | UTF-8 |
66+
| archive_compress_codec | string | no | none |
67+
| encoding | string | no | UTF-8 | |
6768
| common-options | | no | - | Source plugin common parameters, please refer to [Source Common Options](../source-common-options.md) for details. |
6869

6970
### delimiter/field_delimiter [string]
@@ -80,6 +81,17 @@ The compress codec of files and the details that supported as the following show
8081
- orc/parquet:
8182
automatically recognizes the compression type, no additional settings required.
8283

84+
### archive_compress_codec [string]
85+
86+
The compress codec of archive files and the details that supported as the following shown:
87+
88+
| archive_compress_codec | file_format | archive_compress_suffix |
89+
|------------------------|--------------------|-------------------------|
90+
| ZIP | txt,json,excel,xml | .zip |
91+
| TAR | txt,json,excel,xml | .tar |
92+
| TAR_GZ | txt,json,excel,xml | .tar.gz |
93+
| NONE | all | .* |
94+
8395
### encoding [string]
8496

8597
Only used when file_format_type is json,text,csv,xml.

docs/en/connector-v2/source/LocalFile.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,7 @@ If you use SeaTunnel Engine, It automatically integrated the hadoop jar when you
6060
| xml_use_attr_format | boolean | no | - |
6161
| file_filter_pattern | string | no | - |
6262
| compress_codec | string | no | none |
63+
| archive_compress_codec | string | no | none |
6364
| encoding | string | no | UTF-8 |
6465
| common-options | | no | - |
6566
| tables_configs | list | no | used to define a multiple table task |
@@ -263,6 +264,17 @@ The compress codec of files and the details that supported as the following show
263264
- orc/parquet:
264265
automatically recognizes the compression type, no additional settings required.
265266

267+
### archive_compress_codec [string]
268+
269+
The compress codec of archive files and the details that supported as the following shown:
270+
271+
| archive_compress_codec | file_format | archive_compress_suffix |
272+
|------------------------|--------------------|-------------------------|
273+
| ZIP | txt,json,excel,xml | .zip |
274+
| TAR | txt,json,excel,xml | .tar |
275+
| TAR_GZ | txt,json,excel,xml | .tar.gz |
276+
| NONE | all | .* |
277+
266278
### encoding [string]
267279

268280
Only used when file_format_type is json,text,csv,xml.

docs/en/connector-v2/source/OssJindoFile.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,7 @@ It only supports hadoop version **2.9.X+**.
7070
| xml_use_attr_format | boolean | no | - |
7171
| file_filter_pattern | string | no | - |
7272
| compress_codec | string | no | none |
73+
| archive_compress_codec | string | no | none |
7374
| encoding | string | no | UTF-8 |
7475
| common-options | | no | - |
7576

@@ -276,6 +277,17 @@ The compress codec of files and the details that supported as the following show
276277
- orc/parquet:
277278
automatically recognizes the compression type, no additional settings required.
278279

280+
### archive_compress_codec [string]
281+
282+
The compress codec of archive files and the details that supported as the following shown:
283+
284+
| archive_compress_codec | file_format | archive_compress_suffix |
285+
|------------------------|--------------------|-------------------------|
286+
| ZIP | txt,json,excel,xml | .zip |
287+
| TAR | txt,json,excel,xml | .tar |
288+
| TAR_GZ | txt,json,excel,xml | .tar.gz |
289+
| NONE | all | .* |
290+
279291
### encoding [string]
280292

281293
Only used when file_format_type is json,text,csv,xml.

docs/en/connector-v2/source/S3File.md

Lines changed: 26 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -20,14 +20,14 @@ Read all the data in a split in a pollNext call. What splits are read will be sa
2020
- [x] [parallelism](../../concept/connector-v2-features.md)
2121
- [ ] [support user-defined split](../../concept/connector-v2-features.md)
2222
- [x] file format type
23-
- [x] text
24-
- [x] csv
25-
- [x] parquet
26-
- [x] orc
27-
- [x] json
28-
- [x] excel
29-
- [x] xml
30-
- [x] binary
23+
- [x] text
24+
- [x] csv
25+
- [x] parquet
26+
- [x] orc
27+
- [x] json
28+
- [x] excel
29+
- [x] xml
30+
- [x] binary
3131

3232
## Description
3333

@@ -196,7 +196,7 @@ If you assign file type to `parquet` `orc`, schema option not required, connecto
196196

197197
## Options
198198

199-
| name | type | required | default value | Description |
199+
| name | type | required | default value | Description |
200200
|---------------------------------|---------|----------|-------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
201201
| path | string | yes | - | The s3 path that needs to be read can have sub paths, but the sub paths need to meet certain format requirements. Specific requirements can be referred to "parse_partition_from_path" option |
202202
| file_format_type | string | yes | - | File type, supported as the following file types: `text` `csv` `parquet` `orc` `json` `excel` `xml` `binary` |
@@ -217,8 +217,9 @@ If you assign file type to `parquet` `orc`, schema option not required, connecto
217217
| sheet_name | string | no | - | Reader the sheet of the workbook,Only used when file_format is excel. |
218218
| xml_row_tag | string | no | - | Specifies the tag name of the data rows within the XML file, only valid for XML files. |
219219
| xml_use_attr_format | boolean | no | - | Specifies whether to process data using the tag attribute format, only valid for XML files. |
220-
| compress_codec | string | no | none |
221-
| encoding | string | no | UTF-8 |
220+
| compress_codec | string | no | none | |
221+
| archive_compress_codec | string | no | none | |
222+
| encoding | string | no | UTF-8 | |
222223
| common-options | | no | - | Source plugin common parameters, please refer to [Source Common Options](../source-common-options.md) for details. |
223224

224225
### delimiter/field_delimiter [string]
@@ -235,6 +236,17 @@ The compress codec of files and the details that supported as the following show
235236
- orc/parquet:
236237
automatically recognizes the compression type, no additional settings required.
237238

239+
### archive_compress_codec [string]
240+
241+
The compress codec of archive files and the details that supported as the following shown:
242+
243+
| archive_compress_codec | file_format | archive_compress_suffix |
244+
|------------------------|--------------------|-------------------------|
245+
| ZIP | txt,json,excel,xml | .zip |
246+
| TAR | txt,json,excel,xml | .tar |
247+
| TAR_GZ | txt,json,excel,xml | .tar.gz |
248+
| NONE | all | .* |
249+
238250
### encoding [string]
239251

240252
Only used when file_format_type is json,text,csv,xml.
@@ -346,8 +358,8 @@ sink {
346358
### Next version
347359

348360
- [Feature] Support S3A protocol ([3632](https://github.com/apache/seatunnel/pull/3632))
349-
- Allow user to add additional hadoop-s3 parameters
350-
- Allow the use of the s3a protocol
351-
- Decouple hadoop-aws dependencies
361+
- Allow user to add additional hadoop-s3 parameters
362+
- Allow the use of the s3a protocol
363+
- Decouple hadoop-aws dependencies
352364
- [Feature]Set S3 AK to optional ([3688](https://github.com/apache/seatunnel/pull/))
353365

docs/en/connector-v2/source/SftpFile.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,7 @@ The File does not have a specific type list, and we can indicate which SeaTunnel
9292
| xml_use_attr_format | boolean | no | - | Specifies whether to process data using the tag attribute format, only used when file_format is xml. |
9393
| schema | Config | No | - | Please check #schema below |
9494
| compress_codec | String | No | None | The compress codec of files and the details that supported as the following shown: <br/> - txt: `lzo` `None` <br/> - json: `lzo` `None` <br/> - csv: `lzo` `None` <br/> - orc: `lzo` `snappy` `lz4` `zlib` `None` <br/> - parquet: `lzo` `snappy` `lz4` `gzip` `brotli` `zstd` `None` <br/> Tips: excel type does Not support any compression format |
95+
| archive_compress_codec | string | no | none |
9596
| encoding | string | no | UTF-8 |
9697
| common-options | | No | - | Source plugin common parameters, please refer to [Source Common Options](../source-common-options.md) for details. |
9798

@@ -176,6 +177,17 @@ The compress codec of files and the details that supported as the following show
176177
- orc/parquet:
177178
automatically recognizes the compression type, no additional settings required.
178179

180+
### archive_compress_codec [string]
181+
182+
The compress codec of archive files and the details that supported as the following shown:
183+
184+
| archive_compress_codec | file_format | archive_compress_suffix |
185+
|------------------------|--------------------|-------------------------|
186+
| ZIP | txt,json,excel,xml | .zip |
187+
| TAR | txt,json,excel,xml | .tar |
188+
| TAR_GZ | txt,json,excel,xml | .tar.gz |
189+
| NONE | all | .* |
190+
179191
### encoding [string]
180192

181193
Only used when file_format_type is json,text,csv,xml.

0 commit comments

Comments
 (0)