Merge branch 'dev' into jdbc-gbase8a

apache · Oct 18, 2022 · d0430dd · d0430dd
2 parents b14234c + 44ee9a8
commit d0430dd
Show file tree

Hide file tree

Showing 120 changed files with 1,979 additions and 733 deletions.
diff --git a/docs/en/command/usage.mdx b/docs/en/command/usage.mdx
@@ -11,6 +11,8 @@ import TabItem from '@theme/TabItem';
     values={[
         {label: 'Spark', value: 'spark'},
         {label: 'Flink', value: 'flink'},
+        {label: 'Spark V2', value: 'spark V2'},
+        {label: 'Flink V2', value: 'flink V2'},
     ]}>
 <TabItem value="spark">
 
@@ -25,6 +27,20 @@ bin/start-seatunnel-spark.sh
 bin/start-seatunnel-flink.sh  
 ```
 
+</TabItem>
+<TabItem value="spark V2">
+
+    ```bash
+    bin/start-seatunnel-spark-connector-v2.sh
+    ```
+
+</TabItem>
+<TabItem value="flink V2">
+
+    ```bash
+    bin/start-seatunnel-flink-connector-v2.sh
+    ```
+
 </TabItem>
 </Tabs>
 
@@ -37,6 +53,8 @@ bin/start-seatunnel-flink.sh
     values={[
         {label: 'Spark', value: 'spark'},
         {label: 'Flink', value: 'flink'},
+        {label: 'Spark V2', value: 'spark V2'},
+        {label: 'Flink V2', value: 'flink V2'},
     ]}>
 <TabItem value="spark">
 
@@ -52,6 +70,24 @@ bin/start-seatunnel-spark.sh \
 
 - Use `-e` or `--deploy-mode` to specify the deployment mode
 
+</TabItem>
+<TabItem value="spark V2">
+
+    ```bash
+    bin/start-seatunnel-spark-connector-v2.sh \
+    -c config-path \
+    -m master \
+    -e deploy-mode \
+    -i city=beijing \
+    -n spark-test
+    ```
+
+    - Use `-m` or `--master` to specify the cluster manager
+
+    - Use `-e` or `--deploy-mode` to specify the deployment mode
+
+    - Use `-n` or `--name` to specify the app name
+
 </TabItem>
 <TabItem value="flink">
 
@@ -65,6 +101,22 @@ bin/start-seatunnel-flink.sh \
 
 - Use `-r` or `--run-mode` to specify the flink job run mode, you can use `run-application` or `run` (default value)
 
+</TabItem>
+<TabItem value="flink V2">
+
+    ```bash
+    bin/start-seatunnel-flink-connector-v2.sh \
+    -c config-path \
+    -i key=value \
+    -r run-application \
+    -n flink-test \
+    [other params]
+    ```
+
+    - Use `-r` or `--run-mode` to specify the flink job run mode, you can use `run-application` or `run` (default value)
+
+    - Use `-n` or `--name` to specify the app name
+
 </TabItem>
 </Tabs>
 

diff --git a/docs/en/connector-v2/sink/Jdbc.md b/docs/en/connector-v2/sink/Jdbc.md
@@ -117,6 +117,7 @@ there are some reference value for params above.
 | sqlserver  | com.microsoft.sqlserver.jdbc.SQLServerDriver | jdbc:microsoft:sqlserver://localhost:1433                    | com.microsoft.sqlserver.jdbc.SQLServerXADataSource | https://mvnrepository.com/artifact/com.microsoft.sqlserver/mssql-jdbc |
 | oracle     | oracle.jdbc.OracleDriver                     | jdbc:oracle:thin:@localhost:1521/xepdb1                      | oracle.jdbc.xa.OracleXADataSource                  | https://mvnrepository.com/artifact/com.oracle.database.jdbc/ojdbc8 |
 | gbase8a    | com.gbase.jdbc.Driver                        | jdbc:gbase://e2e_gbase8aDb:5258/test                         | /                                                  | https://www.gbase8.cn/wp-content/uploads/2020/10/gbase-connector-java-8.3.81.53-build55.5.7-bin_min_mix.jar |
+| starrocks  | com.mysql.cj.jdbc.Driver                     | jdbc:mysql://localhost:3306/test                             | /                                                  | https://mvnrepository.com/artifact/mysql/mysql-connector-java |
 
 ## Example
 

diff --git a/docs/en/connector-v2/sink/Kafka.md b/docs/en/connector-v2/sink/Kafka.md
@@ -21,6 +21,7 @@ By default, we will use 2pc to guarantee the message is sent to kafka exactly on
 | bootstrap.servers  | string                 | yes      | -             |
 | kafka.*            | kafka producer config  | no       | -             |
 | semantic           | string                 | no       | NON           |
+| partition_key      | string                 | no       | -             |
 | partition          | int                    | no       | -             |
 | assign_partitions  | list                   | no       | -             |
 | transaction_prefix | string                 | no       | -             |
@@ -50,6 +51,23 @@ In AT_LEAST_ONCE, producer will wait for all outstanding messages in the Kafka b
 
 NON does not provide any guarantees: messages may be lost in case of issues on the Kafka broker and messages may be duplicated.
 
+### partition_key [string]
+
+Configure which field is used as the key of the kafka message.
+
+For example, if you want to use value of a field from upstream data as key, you can assign it to the field name.
+
+Upstream data is the following:
+
+| name | age  | data          |
+| ---- | ---- | ------------- |
+| Jack | 16   | data-example1 |
+| Mary | 23   | data-example2 |
+
+If name is set as the key, then the hash value of the name column will determine which partition the message is sent to.
+
+If the field name does not exist in the upstream data, the configured parameter will be used as the key.
+
 ### partition [int]
 
 We can specify the partition, all messages will be sent to this partition.
@@ -93,7 +111,9 @@ sink {
 
 ###  change log
 ####  next version
- 
+
  - Add kafka sink doc 
  - New feature : Kafka specified partition to send 
- - New feature : Determine the partition that kafka send based on the message content
+ - New feature : Determine the partition that kafka send message based on the message content
+ - New feature : Configure which field is used as the key of the kafka message
+
diff --git a/docs/en/connector-v2/sink/common-options.md b/docs/en/connector-v2/sink/common-options.md
@@ -5,18 +5,27 @@
 | name              | type   | required | default value |
 | ----------------- | ------ | -------- | ------------- |
 | source_table_name | string | no       | -             |
+| parallelism       | int    | no       | -             |
+
 
 ### source_table_name [string]
 
 When `source_table_name` is not specified, the current plug-in processes the data set `dataset` output by the previous plugin in the configuration file;
 
 When `source_table_name` is specified, the current plug-in is processing the data set corresponding to this parameter.
 
+### parallelism [int]
+
+When `parallelism` is not specified, the `parallelism` in env is used by default.
+
+When parallelism is specified, it will override the parallelism in env.
+
 ## Examples
 
 ```bash
 source {
     FakeSourceStream {
+      parallelism = 2
       result_table_name = "fake"
       field_name = "name,age"
     }
@@ -37,6 +46,7 @@ transform {
 
 sink {
     console {
+      parallelism = 3
       source_table_name = "fake_name"
     }
 }

diff --git a/docs/en/connector-v2/source/FakeSource.md b/docs/en/connector-v2/source/FakeSource.md
@@ -18,15 +18,17 @@ just for some test cases such as type conversion or connector new feature testin
 
 ## Options
 
-| name           | type   | required | default value |
-| -------------- | ------ | -------- | ------------- |
-| schema         | config | yes      | -             |
-| row.num        | int    | no       | 5             |
-| map.size       | int    | no       | 5             |
-| array.size     | int    | no       | 5             |
-| bytes.length   | int    | no       | 5             |
-| string.length  | int    | no       | 5             |
-| common-options |        | no       | -             |
+| name                | type   | required | default value |
+|---------------------|--------|----------|---------------|
+| schema              | config | yes      | -             |
+| row.num             | int    | no       | 5             |
+| split.num           | int    | no       | 1             |
+| split.read-interval | long   | no       | 1             |
+| map.size            | int    | no       | 5             |
+| array.size          | int    | no       | 5             |
+| bytes.length        | int    | no       | 5             |
+| string.length       | int    | no       | 5             |
+| common-options      |        | no       | -             |
 
 ### schema [config]
 
@@ -81,7 +83,15 @@ Source plugin common parameters, please refer to [Source Common Options](common-
 
 ### row.num
 
-Total num of data that connector generated
+The total number of data generated per degree of parallelism
+
+### split.num
+
+the number of splits generated by the enumerator for each degree of parallelism
+
+### split.read-interval
+
+The interval(mills) between two split reads in a reader
 
 ### map.size
 

diff --git a/docs/en/connector-v2/source/FtpFile.md b/docs/en/connector-v2/source/FtpFile.md
@@ -21,20 +21,21 @@ Read data from ftp file server.
 
 ## Options
 
-| name            | type    | required | default value       |
-|-----------------|---------|----------|---------------------|
-| host            | string  | yes      | -                   |
-| port            | int     | yes      | -                   |
-| user            | string  | yes      | -                   |
-| password        | string  | yes      | -                   |
-| path            | string  | yes      | -                   |
-| type            | string  | yes      | -                   |
-| delimiter       | string  | no       | \001                |
-| date_format     | string  | no       | yyyy-MM-dd          |
-| datetime_format | string  | no       | yyyy-MM-dd HH:mm:ss |
-| time_format     | string  | no       | HH:mm:ss            |
-| schema          | config  | no       | -                   |
-| common-options  |         | no       | -                   |
+| name                       | type    | required | default value       |
+|----------------------------|---------|----------|---------------------|
+| host                       | string  | yes      | -                   |
+| port                       | int     | yes      | -                   |
+| user                       | string  | yes      | -                   |
+| password                   | string  | yes      | -                   |
+| path                       | string  | yes      | -                   |
+| type                       | string  | yes      | -                   |
+| delimiter                  | string  | no       | \001                |
+| parse_partition_from_path  | boolean | no       | true                |
+| date_format                | string  | no       | yyyy-MM-dd          |
+| datetime_format            | string  | no       | yyyy-MM-dd HH:mm:ss |
+| time_format                | string  | no       | HH:mm:ss            |
+| schema                     | config  | no       | -                   |
+| common-options             |         | no       | -                   |
 
 ### host [string]
 
@@ -62,6 +63,20 @@ Field delimiter, used to tell connector how to slice and dice fields when readin
 
 default `\001`, the same as hive's default delimiter
 
+### parse_partition_from_path [boolean]
+
+Control whether parse the partition keys and values from file path
+
+For example if you read a file from path `ftp://hadoop-cluster/tmp/seatunnel/parquet/name=tyrantlucifer/age=26`
+
+Every record data from file will be added these two fields:
+
+| name           | age |
+|----------------|-----|
+| tyrantlucifer  | 26  |
+
+Tips: **Do not define partition fields in schema option**
+
 ### date_format [string]
 
 Date type format, used to tell connector how to convert string to date, supported as the following formats:

diff --git a/docs/en/connector-v2/source/HdfsFile.md b/docs/en/connector-v2/source/HdfsFile.md
@@ -26,17 +26,18 @@ Read all the data in a split in a pollNext call. What splits are read will be sa
 
 ## Options
 
-| name            | type   | required | default value       |
-|-----------------|--------|----------|---------------------|
-| path            | string | yes      | -                   |
-| type            | string | yes      | -                   |
-| fs.defaultFS    | string | yes      | -                   |
-| delimiter       | string | no       | \001                |
-| date_format     | string | no       | yyyy-MM-dd          |
-| datetime_format | string | no       | yyyy-MM-dd HH:mm:ss |
-| time_format     | string | no       | HH:mm:ss            |
-| schema          | config | no       | -                   |
-| common-options  |        | no       | -                   |
+| name                       | type    | required | default value       |
+|----------------------------|---------|----------|---------------------|
+| path                       | string  | yes      | -                   |
+| type                       | string  | yes      | -                   |
+| fs.defaultFS               | string  | yes      | -                   |
+| delimiter                  | string  | no       | \001                |
+| parse_partition_from_path  | boolean | no       | true                |
+| date_format                | string  | no       | yyyy-MM-dd          |
+| datetime_format            | string  | no       | yyyy-MM-dd HH:mm:ss |
+| time_format                | string  | no       | HH:mm:ss            |
+| schema                     | config  | no       | -                   |
+| common-options             |         | no       | -                   |
 
 ### path [string]
 
@@ -48,6 +49,20 @@ Field delimiter, used to tell connector how to slice and dice fields when readin
 
 default `\001`, the same as hive's default delimiter
 
+### parse_partition_from_path [boolean]
+
+Control whether parse the partition keys and values from file path
+
+For example if you read a file from path `hdfs://hadoop-cluster/tmp/seatunnel/parquet/name=tyrantlucifer/age=26`
+
+Every record data from file will be added these two fields:
+
+| name           | age |
+|----------------|-----|
+| tyrantlucifer  | 26  |
+
+Tips: **Do not define partition fields in schema option**
+
 ### date_format [string]
 
 Date type format, used to tell connector how to convert string to date, supported as the following formats:

diff --git a/docs/en/connector-v2/source/Jdbc.md b/docs/en/connector-v2/source/Jdbc.md
@@ -99,6 +99,7 @@ there are some reference value for params above.
 | sqlserver  | com.microsoft.sqlserver.jdbc.SQLServerDriver | jdbc:microsoft:sqlserver://localhost:1433                    | https://mvnrepository.com/artifact/com.microsoft.sqlserver/mssql-jdbc |
 | oracle     | oracle.jdbc.OracleDriver                     | jdbc:oracle:thin:@localhost:1521/xepdb1                      | https://mvnrepository.com/artifact/com.oracle.database.jdbc/ojdbc8 |
 | gbase8a    | com.gbase.jdbc.Driver                        | jdbc:gbase://e2e_gbase8aDb:5258/test                         | https://www.gbase8.cn/wp-content/uploads/2020/10/gbase-connector-java-8.3.81.53-build55.5.7-bin_min_mix.jar |
+| starrocks  | com.mysql.cj.jdbc.Driver                     | jdbc:mysql://localhost:3306/test                             | https://mvnrepository.com/artifact/mysql/mysql-connector-java |
 
 ## Example
 

diff --git a/docs/en/connector-v2/source/LocalFile.md b/docs/en/connector-v2/source/LocalFile.md
@@ -26,16 +26,17 @@ Read all the data in a split in a pollNext call. What splits are read will be sa
 
 ## Options
 
-| name            | type   | required | default value       |
-|-----------------|--------|----------|---------------------|
-| path            | string | yes      | -                   |
-| type            | string | yes      | -                   |
-| delimiter       | string | no       | \001                |
-| date_format     | string | no       | yyyy-MM-dd          |
-| datetime_format | string | no       | yyyy-MM-dd HH:mm:ss |
-| time_format     | string | no       | HH:mm:ss            |
-| schema          | config | no       | -                   |
-| common-options  |        | no       | -                   |
+| name                       | type      | required | default value       |
+|----------------------------|-----------|----------|---------------------|
+| path                       | string    | yes      | -                   |
+| type                       | string    | yes      | -                   |
+| delimiter                  | string    | no       | \001                |
+| parse_partition_from_path  | boolean   | no       | true                |
+| date_format                | string    | no       | yyyy-MM-dd          |
+| datetime_format            | string    | no       | yyyy-MM-dd HH:mm:ss |
+| time_format                | string    | no       | HH:mm:ss            |
+| schema                     | config    | no       | -                   |
+| common-options             |           | no       | -                   |
 
 ### path [string]
 
@@ -47,6 +48,20 @@ Field delimiter, used to tell connector how to slice and dice fields when readin
 
 default `\001`, the same as hive's default delimiter
 
+### parse_partition_from_path [boolean]
+
+Control whether parse the partition keys and values from file path
+
+For example if you read a file from path `file://hadoop-cluster/tmp/seatunnel/parquet/name=tyrantlucifer/age=26`
+
+Every record data from file will be added these two fields:
+
+| name           | age |
+|----------------|-----|
+| tyrantlucifer  | 26  |
+
+Tips: **Do not define partition fields in schema option**
+
 ### date_format [string]
 
 Date type format, used to tell connector how to convert string to date, supported as the following formats: