Merge branch 'develop'

dimajix · Nov 8, 2022 · 6701340 · 6701340
2 parents 19b58a0 + 3024e78
commit 6701340
Show file tree

Hide file tree

Showing 206 changed files with 3,616 additions and 1,177 deletions.
diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
@@ -181,3 +181,12 @@ build-cdp7.1:
     paths:
       - flowman-dist/target/flowman-dist-*-bin.tar.gz
     expire_in: 5 days
+
+build-emr6.8:
+  stage: build
+  script: 'mvn ${MAVEN_CLI_OPTS} clean package -PEMR-6.8 -Ddockerfile.skip'
+  artifacts:
+    name: "flowman-dist-emr6.8"
+    paths:
+      - flowman-dist/target/flowman-dist-*-bin.tar.gz
+    expire_in: 5 days
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,4 +1,21 @@
-# Version 0.28.0
+# Version 0.29.0 - 2022-11-08
+
+* github-260: Remove hive-storage-api from several plugins and lib
+* github-261: Add descriptions to all pom.xml
+* github-262: Verification of "relation" targets should only check existence
+* github-263: Add filter condition to data quality checks in documentation
+* github-265: Make JDBC dialects pluggable
+* github-264: Provide "jars" for all plugins
+* github-267: Add new flowman-spark-dependencies module to simplify dependency management
+* github-269: Implement new 'iterativeSql' mapping
+* github-270: Upgrade Spark to 3.3.1
+* github-271: Upgrade Delta to 2.1.1
+* github-272: Create build profile for AWS EMR 6.8.0
+* github-273: Refactor file abstraction
+* github-274: Print Flowman configuration to console
+
+
+# Version 0.28.0 - 2022-10-07
 
 * Improve support for MariaDB / MySQL as data sinks
 * github-245: Bump ejs, @vue/cli-plugin-babel, @vue/cli-plugin-eslint and @vue/cli-service in /flowman-studio-ui

diff --git a/build-release.sh b/build-release.sh
@@ -44,6 +44,8 @@ build_profile -Phadoop-3.3 -Pspark-3.2 -Dhadoop.version=3.3.1
 build_profile -Phadoop-2.7 -Pspark-3.3
 build_profile -Phadoop-3.3 -Pspark-3.3 -Dhadoop.version=3.3.2
 
+build_profile -PEMR-6.8
+
 export JAVA_HOME=/usr/lib/jvm/java-1.8.0
 build_profile -PCDH-6.3
 build_profile -PCDP-7.1

diff --git a/docker/Dockerfile b/docker/Dockerfile
@@ -21,6 +21,8 @@ RUN curl -sL --retry 3 "https://archive.apache.org/dist/spark/spark-${BUILD_SPAR
 COPY bin/ /opt/docker/bin/
 COPY libexec/ /opt/docker/libexec/
 
+# Update OS
+RUN apt-get update && apt-get upgrade --yes && apt clean
 
 # Copy and install Repository
 COPY $DIST_FILE /tmp/repo/flowman-dist.tar.gz

diff --git a/docker/pom.xml b/docker/pom.xml
@@ -5,12 +5,13 @@
     <modelVersion>4.0.0</modelVersion>
     <artifactId>flowman-docker</artifactId>
     <name>Flowman Docker image</name>
+    <description>Flowman Docker image</description>
     <packaging>pom</packaging>
 
     <parent>
         <groupId>com.dimajix.flowman</groupId>
         <artifactId>flowman-root</artifactId>
-        <version>0.28.0</version>
+        <version>0.29.0</version>
         <relativePath>../pom.xml</relativePath>
     </parent>
 

diff --git a/docs/cookbook/advanced-jdbc.md b/docs/cookbook/advanced-jdbc.md
@@ -1,4 +1,4 @@
-# Using advanced Features of JDBC Databases
+# Advanced JDBC Database Features
 
 Flowman already provides a very robust for dealing with relation databases, both as data sources and as data sinks.
 But when writing into a relational database, you eventually might find yourself in a situation where Flowman does 

diff --git a/docs/cookbook/data-quality.md b/docs/cookbook/data-quality.md
@@ -46,7 +46,7 @@ generated with an independent command with [`flowexec`](../cli/flowexec/index.md
 ## Data Quality Metrics
 In addition to the `validate` and `verify` targets, Flowman also offers a special [measure target](../spec/target/measure.md).
 This target provides some means to collect some important metrics from data and provide the results as metrics. These 
-in turn can be [published to Prometheus](metrics.md) or other metric collectors.
+in turn can be [published to Prometheus](execution-metrics.md) or other metric collectors.
 
 
 ### Example

diff --git a/docs/cookbook/validation.md → docs/cookbook/data-validation.md b/docs/cookbook/validation.md → docs/cookbook/data-validation.md
@@ -1,4 +1,4 @@
-# Pre-build Validations
+# Data Validation before Build
 
 In many cases, you'd like to perform some validation of input data before you start processing. For example when
 joining data, you often assume some uniqueness constraint on the join key in some tables or mappings. If that

diff --git a/docs/cookbook/metrics.md → docs/cookbook/execution-metrics.md b/docs/cookbook/metrics.md → docs/cookbook/execution-metrics.md
diff --git a/docs/cookbook/hadoop-dependencies.md b/docs/cookbook/hadoop-dependencies.md
@@ -1,4 +1,4 @@
-# Installing additional Hadoop Dependencies
+# Hadoop Dependencies Installation
 
 Starting with version 3.2, Spark has reduced the number of Hadoop libraries which are part of the downloadable Spark
 distribution. Unfortunately, some of the libraries which have been removed are required by some Flowman plugins (for

diff --git a/docs/cookbook/impala.md b/docs/cookbook/impala.md
@@ -1,4 +1,4 @@
-# Updating Impala Metadata
+# Impala Metadata
 
 Impala is another "SQL on Hadoop" execution engine mainly developed and backed up by Cloudera. Impala allows you to
 access data stored in Hadoop and registered in the Hive metastore, just like Hive itself, but often at a significantly

diff --git a/docs/cookbook/kerberos.md b/docs/cookbook/kerberos.md
@@ -1,4 +1,4 @@
-# Using Kerberos Authentication
+# Kerberos Authentication
 
 Of course, you can also run Flowman in a Kerberos environment, as long as the components you use actually support
 Kerberos. This includes Spark, Hadoop and Kafka.

diff --git a/docs/cookbook/override-jars.md b/docs/cookbook/override-jars.md
@@ -1,4 +1,4 @@
-# Force Spark to specific jar version
+# Override jar versions
 
 A common problem with Spark and specifically with many Hadoop environments (like Cloudera) are mismatches between
 application jar versions and jars provided by the runtime environment. Flowman is built with carefully set dependency

diff --git a/docs/cookbook/performance.md b/docs/cookbook/performance.md
@@ -0,0 +1,55 @@
+# Performance Tuning
+
+Processing performance always is an important topic for data transformation, and so is the case with Flowman. In order
+to improve overall performance, there are different configurations, some of them being well known configuration
+parameters for Apache Spark, while others are specific to Flowman.
+
+
+## Spark Parameters
+
+Since Flowman is based on Apache Spark, you can apply all the performance tuning strategies that apply to Spark. 
+You can specify almost all settings either in the [`default-namespace.yml` file](../setup/config.md) or in any other
+project file in a `config` section. The most important settings probably are as follows:
+
+```yaml
+config:
+  # Use 8 CPU cores per Spark executor
+  - spark.executor.cores=8
+  # Allocate 54 GB RAM per Spark executor
+  - spark.executor.memory=54g
+  # Only keep up to 200 jobs in the Spark web UI
+  - spark.ui.retainedJobs=200
+  # Use 400 partitions in shuffle operations
+  - spark.sql.shuffle.partitions=400
+  # Number of executors to allocate
+  - spark.executor.instances=2
+  # Memory overhead as safety margin
+  - spark.executor.memoryOverhead=1G
+```
+
+Often it is a good idea to make these properties easily configurable via system environment variables as follows:
+```yaml
+config:
+ - spark.executor.cores=$System.getenv('SPARK_EXECUTOR_CORES', '8')
+ - spark.executor.memory=$System.getenv('SPARK_EXECUTOR_MEMORY', '54g')
+ - spark.ui.retainedJobs=$System.getenv('RETAINED_JOBS', 200)
+ - spark.sql.shuffle.partitions=$System.getenv('SPARK_PARTITIONS', 400)
+```
+
+
+## Flowman Parameters
+
+In addition to classical Spark tuning parameters, Flowman also offers some advanced functionality which may help to
+cut down processing overhead cost by parallelizing target execution and mapping instantiation. This will not speed
+up the processing itself, but it will help to hide some expensive Spark planning costs, which may involve querying
+the Hive metastore or remote file systems, which are known to be slow.
+
+```yaml
+config:
+  # Enable building multiple targets in parallel
+  - flowman.execution.executor.class=com.dimajix.flowman.execution.ParallelExecutor
+  # Build up to 4 targets in parallel
+  - flowman.execution.executor.parallelism=4
+  # Instantiate up to 16 mappings in parallel
+  - flowman.execution.mapping.parallelism=16
+```
diff --git a/docs/cookbook/syntax-highlighting.md b/docs/cookbook/syntax-highlighting.md
@@ -4,24 +4,6 @@ In order to support the development of Flowman projects, Flowman provides the ca
 which can be used by editors to perform syntax validation and auto complete. You will find a set of pre-generated
 files in the `yaml-schema` directory, which contain syntax information for all core entities and all plugins.
 
-Since you might not use all plugins (or have your own plugins), Flowman also offers a small utility to generate
-the YAML schema files yourself. Using the provided schema generator will ensure that the schema perfectly matches
-to your setup with the right plugins. The schema files can be created with the [`flowman-schema`](../cli/schema.md) 
-command, such that the schema files will include all entities from any plugin loaded via the 
-[`default-namespace`](../spec/namespace.md).
-
-
-## Creating YAML schemas
-
-```shell
-flowman-schema -o my-schema-directory
-```
-
-This command will create multiple different YAML schema files:
-* `module.json` - This is the YAML schema for all modules, i.e. defining relations, mapping, etc.
-* `project.json` - This YAML schema file contains all entities of the [`project.yml`](../spec/project.md) file.
-* `namespace.sjon` - This YAML schema file contains all entities of [namespace definitions](../spec/namespace.md).
-* `documentation.sjon` - This YAML schema file contains all entities of [`documentation.yml`](../documenting/config.md).
 
 
 ## Supported Editors
@@ -64,3 +46,21 @@ In order to benefit from a really excellent autocompletion in Visual Studio Code
     }
 }
 ```
+
+## Creating YAML schemas
+
+Since you might not use all plugins (or have your own plugins), Flowman also offers a small utility to generate
+the YAML schema files yourself. Using the provided schema generator will ensure that the schema perfectly matches
+to your setup with the right plugins. The schema files can be created with the [`flowman-schema`](../cli/schema.md)
+command, such that the schema files will include all entities from any plugin loaded via the
+[`default-namespace`](../spec/namespace.md).
+
+```shell
+flowman-schema -o my-schema-directory
+```
+
+This command will create multiple different YAML schema files:
+* `module.json` - This is the YAML schema for all modules, i.e. defining relations, mapping, etc.
+* `project.json` - This YAML schema file contains all entities of the [`project.yml`](../spec/project.md) file.
+* `namespace.sjon` - This YAML schema file contains all entities of [namespace definitions](../spec/namespace.md).
+* `documentation.sjon` - This YAML schema file contains all entities of [`documentation.yml`](../documenting/config.md).
diff --git a/docs/cookbook/target-ordering.md b/docs/cookbook/target-ordering.md
@@ -1,10 +1,11 @@
-# Manual Target Execution Order
+# Target Execution Ordering
 
 When executing a [job](../spec/job/index.md), Flowman normally figures out the correct execution order of all
 [targets](../spec/target/index.md) automatically. This is implemented by looking at the different targets inputs
 and outputs, such that Flowman ensures that first all the inputs of a target is build before the target itself is
 executed.
 
+
 ## Cyclic Dependencies
 But sometimes, this does not give you the desired result, or Flowman might even detect a cyclic dependency between 
 your targets. Although this might indicate an issue at the conceptional level, there are completely valid use cases

diff --git a/docs/index.md b/docs/index.md
@@ -41,7 +41,7 @@ central place in your value chain for data preparations for the next steps.
 * Powerful yet simple [command line tool for batch execution](cli/flowexec/index.md)
 * Powerful [Command line tool for interactive data flow analysis](cli/flowshell/index.md)
 * [History server](cli/history-server.md) that provides an overview of past jobs and targets including lineage
-* [Metric system](cookbook/metrics.md) with the ability to publish these to servers like Prometheus
+* [Metric system](cookbook/execution-metrics.md) with the ability to publish these to servers like Prometheus
 * Extendable via Plugins
 
 
@@ -109,7 +109,7 @@ Flowman also provides optional plugins which extend functionality. You can find
    spec/index
    testing/index
    documenting/index
-   workflow
+   workflow/index
    setup/index
    connectors/index
    plugins/index

diff --git a/docs/releases.md b/docs/releases.md
@@ -14,7 +14,24 @@ The following gives an (incomplete) list of past releases of the last 12 months.
 changes over time.
 
 
-### Version 0.28.0
+### Version 0.29.0 - 2022-11-08
+
+* github-260 Remove hive-storage-api from several plugins and lib
+* github-261: Add descriptions to all pom.xml
+* github-262: Verification of "relation" targets should only check existence
+* github-263: Add filter condition to data quality checks in documentation
+* github-265: Make JDBC dialects pluggable
+* github-264: Provide "jars" for all plugins
+* github-267: Add new flowman-spark-dependencies module to simplify dependency management
+* github-269: Create 'iterativeSql' mapping
+* github-270: Upgrade Spark to 3.3.1
+* github-271: Upgrade Delta to 2.1.1
+* github-272: Create build profile for AWS EMR 6.8.0
+* github-273: Refactor file abstraction
+* github-274: Print Flowman configuration to console
+
+
+### Version 0.28.0 - 2022-10-07
 
 * Improve support for MariaDB / MySQL as data sinks
 * github-245: Bump ejs, @vue/cli-plugin-babel, @vue/cli-plugin-eslint and @vue/cli-service in /flowman-studio-ui

diff --git a/docs/setup/config.md b/docs/setup/config.md
@@ -51,7 +51,7 @@ the existence of targets and/or the history to decide if a rebuild is required.
 - `flowman.execution.executor.class` *(type: class)* *(default: `com.dimajix.flowman.execution.SimpleExecutor`)* (since Flowman 0.16.0)
 Configure the executor to use. The default `SimpleExecutor` will process all targets in the correct order
   sequentially. The alternative implementation `com.dimajix.flowman.execution.ParallelExecutor` will run multiple 
-  targets in parallel (if they are not depending on each other)
+  targets in parallel (if they do not depend on each other)
 
 - `flowman.execution.executor.parallelism` *(type: int)* *(default: 4)* (since Flowman 0.16.0)
   The number of targets to be executed in parallel, when the `ParallelExecutor` is used.

diff --git a/docs/spec/mapping/iterative-sql.md b/docs/spec/mapping/iterative-sql.md
@@ -0,0 +1,73 @@
+# Iterative SQL Mapping
+The `iterativeSql` mapping allows to iteratively execute SQL transformation which contains Spark SQL code. The
+iteration will stop when the data does not change anymore.
+
+## Example
+The following example will detect trees within a company hierarchy table, which provides simple parent-child
+relations. The objective of the query is to assign a separate ID to each company tree. The query will essentially
+propagate the `tree_id` from each parent down to its direct children. This step is performed over and over again
+until the `tree_id` from the root companies without a parent are propagated to the leave companies without any
+children.
+```
+mappings:
+  organization_hierarchy:
+    kind: iterativeSql
+    input: companies
+    sql: |
+      SELECT
+        COALESCE(parent.tree_id, c.tree_id) AS tree_id,
+        c.parent_company_number,
+        c.company_number
+      FROM companies c
+      LEFT JOIN __this__ parent
+      ON c.parent_company_number = parent.company_number
+```
+Within the first step, the output of the input mapping `companies` is assigned to the identifier `__this__`. Then the
+SQL query is executed for the first time, which will provide the start value of the forthcoming iteration. In each
+iteration, the result of the previous iteration is assigned to `__this__` and the query is executed.
+Then the result is compared to the result of the previous iteration. If the results are the same, a fix point is 
+reached and the execution stops. Otherwise, the iteration will continue. 
+
+## Fields
+* `kind` **(mandatory)** *(type: string)*: `iterativeSql`
+
+* `broadcast` **(optional)** *(type: boolean)* *(default: false)*: 
+Hint for broadcasting the result of this mapping for map-side joins.
+
+* `cache` **(optional)** *(type: string)* *(default: NONE)*:
+Cache mode for the results of this mapping. Supported values are
+  * `NONE` - Disables caching of teh results of this mapping
+  * `DISK_ONLY` - Caches the results on disk
+  * `MEMORY_ONLY` - Caches the results in memory. If not enough memory is available, records will be uncached.
+  * `MEMORY_ONLY_SER` - Caches the results in memory in a serialized format. If not enough memory is available, records will be uncached.
+  * `MEMORY_AND_DISK` - Caches the results first in memory and then spills to disk.
+  * `MEMORY_AND_DISK_SER` - Caches the results first in memory in a serialized format and then spills to disk.
+
+* `input` **(required)** *(type: string)*:
+The input mapping which serves as the starting point of the iteration. This means that for the first execution, 
+the identifier `__this__` will simply refer the output of this mapping. Within the next iterations, `__this__` will 
+refer to the result of the previous iteration.
+
+* `sql` **(optional)** *(type: string)* *(default: empty)*: 
+The SQL statement to execute
+
+* `file` **(optional)** *(type: string)* *(default: empty)*: 
+The name of a file containing the SQL to execute.
+
+* `uri` **(optional)** *(type: string)* *(default: empty)*: 
+A url pointing to a resource containing the SQL to execute.
+
+* `maxIterations` **(optional)** *(type: int)* *(default: 99)*:
+The maximum of iterations. The mapping will fail if the number of actual iterations required to find the fix point 
+exceeds this number.
+
+
+## Outputs
+* `main` - the only output of the mapping
+
+
+## Description
+The `iterativeSql` mapping allows to execute recursive SQL statements, which refer to themselves.
+
+Flowman also supports [`recursiveSql` mappings](recursive-sql.md), which provide similar functionality more along
+the lines of classical recursive SQL statements.
diff --git a/docs/spec/mapping/recursive-sql.md b/docs/spec/mapping/recursive-sql.md
@@ -20,6 +20,11 @@ mappings:
         WHERE n < 6
     "
 ```
+Within the first step, `__this__` is assigned an empty table. Then the SQL query is executed for the first time, 
+which will provide the next value of the forthcoming iterations. In each iteration, the result of the previous 
+iteration is assigned to `__this__` and the query is executed again. Then the result is compared to the result of
+the previous iteration. If the results are the same, a fix point is reached and the
+execution stops. Otherwise, the iteration will continue.
 
 ## Fields
 * `kind` **(mandatory)** *(type: string)*: `recursiveSql`
@@ -45,13 +50,19 @@ The name of a file containing the SQL to execute.
 * `uri` **(optional)** *(type: string)* *(default: empty)*: 
 A url pointing to a resource containing the SQL to execute.
 
+* `maxIterations` **(optional)** *(type: int)* *(default: 99)*:
+The maximum of iterations. The mapping will fail if the number of actual iterations required to find the fix point 
+exceeds this number.
+
 
 ## Outputs
 * `main` - the only output of the mapping
 
 
 ## Description
 The `recursiveSql` mapping allows to execute recursive SQL statements, which refer to themselves. The result of each
-step is made available as a temporary table `__this__`. Currently the query has to be a `UNION` where the first part
+step is made available as a temporary table `__this__`. Currently, the query has to be a `UNION` where the first part
 may not contain a reference to `__this__`. The first part of the `UNION` will be used to determine the schema of the
 result.
+
+Flowman also supports [`iterativeSql` mappings](iterative-sql.md), which provide similar functionality.
diff --git a/docs/spec/measure/index.md b/docs/spec/measure/index.md
@@ -1,7 +1,7 @@
 # Measures
 
 Flowman provides capabilities to assess data quality by taking *measures* from mappings and provide the result as
-[metrics](../../cookbook/metrics.md). This enables developer to build data quality dashboards using well known tools
+[metrics](../../cookbook/execution-metrics.md). This enables developer to build data quality dashboards using well known tools
 like Prometheus and Grafana.