Skip to content

Commit

Permalink
Merge branch 'develop'
Browse files Browse the repository at this point in the history
  • Loading branch information
kupferk committed Nov 8, 2022
2 parents 19b58a0 + 3024e78 commit 6701340
Show file tree
Hide file tree
Showing 206 changed files with 3,616 additions and 1,177 deletions.
9 changes: 9 additions & 0 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -181,3 +181,12 @@ build-cdp7.1:
paths:
- flowman-dist/target/flowman-dist-*-bin.tar.gz
expire_in: 5 days

build-emr6.8:
stage: build
script: 'mvn ${MAVEN_CLI_OPTS} clean package -PEMR-6.8 -Ddockerfile.skip'
artifacts:
name: "flowman-dist-emr6.8"
paths:
- flowman-dist/target/flowman-dist-*-bin.tar.gz
expire_in: 5 days
19 changes: 18 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,21 @@
# Version 0.28.0
# Version 0.29.0 - 2022-11-08

* github-260: Remove hive-storage-api from several plugins and lib
* github-261: Add descriptions to all pom.xml
* github-262: Verification of "relation" targets should only check existence
* github-263: Add filter condition to data quality checks in documentation
* github-265: Make JDBC dialects pluggable
* github-264: Provide "jars" for all plugins
* github-267: Add new flowman-spark-dependencies module to simplify dependency management
* github-269: Implement new 'iterativeSql' mapping
* github-270: Upgrade Spark to 3.3.1
* github-271: Upgrade Delta to 2.1.1
* github-272: Create build profile for AWS EMR 6.8.0
* github-273: Refactor file abstraction
* github-274: Print Flowman configuration to console


# Version 0.28.0 - 2022-10-07

* Improve support for MariaDB / MySQL as data sinks
* github-245: Bump ejs, @vue/cli-plugin-babel, @vue/cli-plugin-eslint and @vue/cli-service in /flowman-studio-ui
Expand Down
2 changes: 2 additions & 0 deletions build-release.sh
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@ build_profile -Phadoop-3.3 -Pspark-3.2 -Dhadoop.version=3.3.1
build_profile -Phadoop-2.7 -Pspark-3.3
build_profile -Phadoop-3.3 -Pspark-3.3 -Dhadoop.version=3.3.2

build_profile -PEMR-6.8

export JAVA_HOME=/usr/lib/jvm/java-1.8.0
build_profile -PCDH-6.3
build_profile -PCDP-7.1
Expand Down
2 changes: 2 additions & 0 deletions docker/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@ RUN curl -sL --retry 3 "https://archive.apache.org/dist/spark/spark-${BUILD_SPAR
COPY bin/ /opt/docker/bin/
COPY libexec/ /opt/docker/libexec/

# Update OS
RUN apt-get update && apt-get upgrade --yes && apt clean

# Copy and install Repository
COPY $DIST_FILE /tmp/repo/flowman-dist.tar.gz
Expand Down
3 changes: 2 additions & 1 deletion docker/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,13 @@
<modelVersion>4.0.0</modelVersion>
<artifactId>flowman-docker</artifactId>
<name>Flowman Docker image</name>
<description>Flowman Docker image</description>
<packaging>pom</packaging>

<parent>
<groupId>com.dimajix.flowman</groupId>
<artifactId>flowman-root</artifactId>
<version>0.28.0</version>
<version>0.29.0</version>
<relativePath>../pom.xml</relativePath>
</parent>

Expand Down
2 changes: 1 addition & 1 deletion docs/cookbook/advanced-jdbc.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Using advanced Features of JDBC Databases
# Advanced JDBC Database Features

Flowman already provides a very robust for dealing with relation databases, both as data sources and as data sinks.
But when writing into a relational database, you eventually might find yourself in a situation where Flowman does
Expand Down
2 changes: 1 addition & 1 deletion docs/cookbook/data-quality.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ generated with an independent command with [`flowexec`](../cli/flowexec/index.md
## Data Quality Metrics
In addition to the `validate` and `verify` targets, Flowman also offers a special [measure target](../spec/target/measure.md).
This target provides some means to collect some important metrics from data and provide the results as metrics. These
in turn can be [published to Prometheus](metrics.md) or other metric collectors.
in turn can be [published to Prometheus](execution-metrics.md) or other metric collectors.


### Example
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Pre-build Validations
# Data Validation before Build

In many cases, you'd like to perform some validation of input data before you start processing. For example when
joining data, you often assume some uniqueness constraint on the join key in some tables or mappings. If that
Expand Down
File renamed without changes.
2 changes: 1 addition & 1 deletion docs/cookbook/hadoop-dependencies.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Installing additional Hadoop Dependencies
# Hadoop Dependencies Installation

Starting with version 3.2, Spark has reduced the number of Hadoop libraries which are part of the downloadable Spark
distribution. Unfortunately, some of the libraries which have been removed are required by some Flowman plugins (for
Expand Down
2 changes: 1 addition & 1 deletion docs/cookbook/impala.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Updating Impala Metadata
# Impala Metadata

Impala is another "SQL on Hadoop" execution engine mainly developed and backed up by Cloudera. Impala allows you to
access data stored in Hadoop and registered in the Hive metastore, just like Hive itself, but often at a significantly
Expand Down
2 changes: 1 addition & 1 deletion docs/cookbook/kerberos.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Using Kerberos Authentication
# Kerberos Authentication

Of course, you can also run Flowman in a Kerberos environment, as long as the components you use actually support
Kerberos. This includes Spark, Hadoop and Kafka.
Expand Down
2 changes: 1 addition & 1 deletion docs/cookbook/override-jars.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Force Spark to specific jar version
# Override jar versions

A common problem with Spark and specifically with many Hadoop environments (like Cloudera) are mismatches between
application jar versions and jars provided by the runtime environment. Flowman is built with carefully set dependency
Expand Down
55 changes: 55 additions & 0 deletions docs/cookbook/performance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Performance Tuning

Processing performance always is an important topic for data transformation, and so is the case with Flowman. In order
to improve overall performance, there are different configurations, some of them being well known configuration
parameters for Apache Spark, while others are specific to Flowman.


## Spark Parameters

Since Flowman is based on Apache Spark, you can apply all the performance tuning strategies that apply to Spark.
You can specify almost all settings either in the [`default-namespace.yml` file](../setup/config.md) or in any other
project file in a `config` section. The most important settings probably are as follows:

```yaml
config:
# Use 8 CPU cores per Spark executor
- spark.executor.cores=8
# Allocate 54 GB RAM per Spark executor
- spark.executor.memory=54g
# Only keep up to 200 jobs in the Spark web UI
- spark.ui.retainedJobs=200
# Use 400 partitions in shuffle operations
- spark.sql.shuffle.partitions=400
# Number of executors to allocate
- spark.executor.instances=2
# Memory overhead as safety margin
- spark.executor.memoryOverhead=1G
```

Often it is a good idea to make these properties easily configurable via system environment variables as follows:
```yaml
config:
- spark.executor.cores=$System.getenv('SPARK_EXECUTOR_CORES', '8')
- spark.executor.memory=$System.getenv('SPARK_EXECUTOR_MEMORY', '54g')
- spark.ui.retainedJobs=$System.getenv('RETAINED_JOBS', 200)
- spark.sql.shuffle.partitions=$System.getenv('SPARK_PARTITIONS', 400)
```


## Flowman Parameters

In addition to classical Spark tuning parameters, Flowman also offers some advanced functionality which may help to
cut down processing overhead cost by parallelizing target execution and mapping instantiation. This will not speed
up the processing itself, but it will help to hide some expensive Spark planning costs, which may involve querying
the Hive metastore or remote file systems, which are known to be slow.

```yaml
config:
# Enable building multiple targets in parallel
- flowman.execution.executor.class=com.dimajix.flowman.execution.ParallelExecutor
# Build up to 4 targets in parallel
- flowman.execution.executor.parallelism=4
# Instantiate up to 16 mappings in parallel
- flowman.execution.mapping.parallelism=16
```
36 changes: 18 additions & 18 deletions docs/cookbook/syntax-highlighting.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,24 +4,6 @@ In order to support the development of Flowman projects, Flowman provides the ca
which can be used by editors to perform syntax validation and auto complete. You will find a set of pre-generated
files in the `yaml-schema` directory, which contain syntax information for all core entities and all plugins.

Since you might not use all plugins (or have your own plugins), Flowman also offers a small utility to generate
the YAML schema files yourself. Using the provided schema generator will ensure that the schema perfectly matches
to your setup with the right plugins. The schema files can be created with the [`flowman-schema`](../cli/schema.md)
command, such that the schema files will include all entities from any plugin loaded via the
[`default-namespace`](../spec/namespace.md).


## Creating YAML schemas

```shell
flowman-schema -o my-schema-directory
```

This command will create multiple different YAML schema files:
* `module.json` - This is the YAML schema for all modules, i.e. defining relations, mapping, etc.
* `project.json` - This YAML schema file contains all entities of the [`project.yml`](../spec/project.md) file.
* `namespace.sjon` - This YAML schema file contains all entities of [namespace definitions](../spec/namespace.md).
* `documentation.sjon` - This YAML schema file contains all entities of [`documentation.yml`](../documenting/config.md).


## Supported Editors
Expand Down Expand Up @@ -64,3 +46,21 @@ In order to benefit from a really excellent autocompletion in Visual Studio Code
}
}
```

## Creating YAML schemas

Since you might not use all plugins (or have your own plugins), Flowman also offers a small utility to generate
the YAML schema files yourself. Using the provided schema generator will ensure that the schema perfectly matches
to your setup with the right plugins. The schema files can be created with the [`flowman-schema`](../cli/schema.md)
command, such that the schema files will include all entities from any plugin loaded via the
[`default-namespace`](../spec/namespace.md).

```shell
flowman-schema -o my-schema-directory
```

This command will create multiple different YAML schema files:
* `module.json` - This is the YAML schema for all modules, i.e. defining relations, mapping, etc.
* `project.json` - This YAML schema file contains all entities of the [`project.yml`](../spec/project.md) file.
* `namespace.sjon` - This YAML schema file contains all entities of [namespace definitions](../spec/namespace.md).
* `documentation.sjon` - This YAML schema file contains all entities of [`documentation.yml`](../documenting/config.md).
3 changes: 2 additions & 1 deletion docs/cookbook/target-ordering.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
# Manual Target Execution Order
# Target Execution Ordering

When executing a [job](../spec/job/index.md), Flowman normally figures out the correct execution order of all
[targets](../spec/target/index.md) automatically. This is implemented by looking at the different targets inputs
and outputs, such that Flowman ensures that first all the inputs of a target is build before the target itself is
executed.


## Cyclic Dependencies
But sometimes, this does not give you the desired result, or Flowman might even detect a cyclic dependency between
your targets. Although this might indicate an issue at the conceptional level, there are completely valid use cases
Expand Down
4 changes: 2 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ central place in your value chain for data preparations for the next steps.
* Powerful yet simple [command line tool for batch execution](cli/flowexec/index.md)
* Powerful [Command line tool for interactive data flow analysis](cli/flowshell/index.md)
* [History server](cli/history-server.md) that provides an overview of past jobs and targets including lineage
* [Metric system](cookbook/metrics.md) with the ability to publish these to servers like Prometheus
* [Metric system](cookbook/execution-metrics.md) with the ability to publish these to servers like Prometheus
* Extendable via Plugins


Expand Down Expand Up @@ -109,7 +109,7 @@ Flowman also provides optional plugins which extend functionality. You can find
spec/index
testing/index
documenting/index
workflow
workflow/index
setup/index
connectors/index
plugins/index
Expand Down
19 changes: 18 additions & 1 deletion docs/releases.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,24 @@ The following gives an (incomplete) list of past releases of the last 12 months.
changes over time.


### Version 0.28.0
### Version 0.29.0 - 2022-11-08

* github-260 Remove hive-storage-api from several plugins and lib
* github-261: Add descriptions to all pom.xml
* github-262: Verification of "relation" targets should only check existence
* github-263: Add filter condition to data quality checks in documentation
* github-265: Make JDBC dialects pluggable
* github-264: Provide "jars" for all plugins
* github-267: Add new flowman-spark-dependencies module to simplify dependency management
* github-269: Create 'iterativeSql' mapping
* github-270: Upgrade Spark to 3.3.1
* github-271: Upgrade Delta to 2.1.1
* github-272: Create build profile for AWS EMR 6.8.0
* github-273: Refactor file abstraction
* github-274: Print Flowman configuration to console


### Version 0.28.0 - 2022-10-07

* Improve support for MariaDB / MySQL as data sinks
* github-245: Bump ejs, @vue/cli-plugin-babel, @vue/cli-plugin-eslint and @vue/cli-service in /flowman-studio-ui
Expand Down
2 changes: 1 addition & 1 deletion docs/setup/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ the existence of targets and/or the history to decide if a rebuild is required.
- `flowman.execution.executor.class` *(type: class)* *(default: `com.dimajix.flowman.execution.SimpleExecutor`)* (since Flowman 0.16.0)
Configure the executor to use. The default `SimpleExecutor` will process all targets in the correct order
sequentially. The alternative implementation `com.dimajix.flowman.execution.ParallelExecutor` will run multiple
targets in parallel (if they are not depending on each other)
targets in parallel (if they do not depend on each other)

- `flowman.execution.executor.parallelism` *(type: int)* *(default: 4)* (since Flowman 0.16.0)
The number of targets to be executed in parallel, when the `ParallelExecutor` is used.
Expand Down
73 changes: 73 additions & 0 deletions docs/spec/mapping/iterative-sql.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Iterative SQL Mapping
The `iterativeSql` mapping allows to iteratively execute SQL transformation which contains Spark SQL code. The
iteration will stop when the data does not change anymore.

## Example
The following example will detect trees within a company hierarchy table, which provides simple parent-child
relations. The objective of the query is to assign a separate ID to each company tree. The query will essentially
propagate the `tree_id` from each parent down to its direct children. This step is performed over and over again
until the `tree_id` from the root companies without a parent are propagated to the leave companies without any
children.
```
mappings:
organization_hierarchy:
kind: iterativeSql
input: companies
sql: |
SELECT
COALESCE(parent.tree_id, c.tree_id) AS tree_id,
c.parent_company_number,
c.company_number
FROM companies c
LEFT JOIN __this__ parent
ON c.parent_company_number = parent.company_number
```
Within the first step, the output of the input mapping `companies` is assigned to the identifier `__this__`. Then the
SQL query is executed for the first time, which will provide the start value of the forthcoming iteration. In each
iteration, the result of the previous iteration is assigned to `__this__` and the query is executed.
Then the result is compared to the result of the previous iteration. If the results are the same, a fix point is
reached and the execution stops. Otherwise, the iteration will continue.

## Fields
* `kind` **(mandatory)** *(type: string)*: `iterativeSql`

* `broadcast` **(optional)** *(type: boolean)* *(default: false)*:
Hint for broadcasting the result of this mapping for map-side joins.

* `cache` **(optional)** *(type: string)* *(default: NONE)*:
Cache mode for the results of this mapping. Supported values are
* `NONE` - Disables caching of teh results of this mapping
* `DISK_ONLY` - Caches the results on disk
* `MEMORY_ONLY` - Caches the results in memory. If not enough memory is available, records will be uncached.
* `MEMORY_ONLY_SER` - Caches the results in memory in a serialized format. If not enough memory is available, records will be uncached.
* `MEMORY_AND_DISK` - Caches the results first in memory and then spills to disk.
* `MEMORY_AND_DISK_SER` - Caches the results first in memory in a serialized format and then spills to disk.

* `input` **(required)** *(type: string)*:
The input mapping which serves as the starting point of the iteration. This means that for the first execution,
the identifier `__this__` will simply refer the output of this mapping. Within the next iterations, `__this__` will
refer to the result of the previous iteration.

* `sql` **(optional)** *(type: string)* *(default: empty)*:
The SQL statement to execute

* `file` **(optional)** *(type: string)* *(default: empty)*:
The name of a file containing the SQL to execute.

* `uri` **(optional)** *(type: string)* *(default: empty)*:
A url pointing to a resource containing the SQL to execute.

* `maxIterations` **(optional)** *(type: int)* *(default: 99)*:
The maximum of iterations. The mapping will fail if the number of actual iterations required to find the fix point
exceeds this number.


## Outputs
* `main` - the only output of the mapping


## Description
The `iterativeSql` mapping allows to execute recursive SQL statements, which refer to themselves.

Flowman also supports [`recursiveSql` mappings](recursive-sql.md), which provide similar functionality more along
the lines of classical recursive SQL statements.
13 changes: 12 additions & 1 deletion docs/spec/mapping/recursive-sql.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,11 @@ mappings:
WHERE n < 6
"
```
Within the first step, `__this__` is assigned an empty table. Then the SQL query is executed for the first time,
which will provide the next value of the forthcoming iterations. In each iteration, the result of the previous
iteration is assigned to `__this__` and the query is executed again. Then the result is compared to the result of
the previous iteration. If the results are the same, a fix point is reached and the
execution stops. Otherwise, the iteration will continue.

## Fields
* `kind` **(mandatory)** *(type: string)*: `recursiveSql`
Expand All @@ -45,13 +50,19 @@ The name of a file containing the SQL to execute.
* `uri` **(optional)** *(type: string)* *(default: empty)*:
A url pointing to a resource containing the SQL to execute.

* `maxIterations` **(optional)** *(type: int)* *(default: 99)*:
The maximum of iterations. The mapping will fail if the number of actual iterations required to find the fix point
exceeds this number.


## Outputs
* `main` - the only output of the mapping


## Description
The `recursiveSql` mapping allows to execute recursive SQL statements, which refer to themselves. The result of each
step is made available as a temporary table `__this__`. Currently the query has to be a `UNION` where the first part
step is made available as a temporary table `__this__`. Currently, the query has to be a `UNION` where the first part
may not contain a reference to `__this__`. The first part of the `UNION` will be used to determine the schema of the
result.

Flowman also supports [`iterativeSql` mappings](iterative-sql.md), which provide similar functionality.
2 changes: 1 addition & 1 deletion docs/spec/measure/index.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Measures

Flowman provides capabilities to assess data quality by taking *measures* from mappings and provide the result as
[metrics](../../cookbook/metrics.md). This enables developer to build data quality dashboards using well known tools
[metrics](../../cookbook/execution-metrics.md). This enables developer to build data quality dashboards using well known tools
like Prometheus and Grafana.


Expand Down

0 comments on commit 6701340

Please sign in to comment.