Skip to content

Commit

Permalink
Merge branch 'develop'
Browse files Browse the repository at this point in the history
  • Loading branch information
kupferk committed Jan 3, 2023
2 parents 6701340 + 2c08207 commit 9d7d051
Show file tree
Hide file tree
Showing 426 changed files with 10,628 additions and 4,714 deletions.
17 changes: 12 additions & 5 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -52,8 +52,6 @@ build-site:
build-default:
stage: build
script: 'mvn ${MAVEN_CLI_OPTS} clean package -Ddockerfile.skip'
except:
- pushes
artifacts:
name: "flowman-dist-default"
paths:
Expand Down Expand Up @@ -182,11 +180,20 @@ build-cdp7.1:
- flowman-dist/target/flowman-dist-*-bin.tar.gz
expire_in: 5 days

build-emr6.8:
build-cdp7.1-spark3.2:
stage: build
script: 'mvn ${MAVEN_CLI_OPTS} clean package -PEMR-6.8 -Ddockerfile.skip'
script: 'mvn ${MAVEN_CLI_OPTS} clean package -PCDP-7.1-spark-3.2 -Ddockerfile.skip'
artifacts:
name: "flowman-dist-emr6.8"
name: "flowman-dist-cdp7.1-spark-3.2"
paths:
- flowman-dist/target/flowman-dist-*-bin.tar.gz
expire_in: 5 days

build-emr6.9:
stage: build
script: 'mvn ${MAVEN_CLI_OPTS} clean package -PEMR-6.9 -Ddockerfile.skip'
artifacts:
name: "flowman-dist-emr6.9"
paths:
- flowman-dist/target/flowman-dist-*-bin.tar.gz
expire_in: 5 days
6 changes: 6 additions & 0 deletions BUILDING.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,8 +102,10 @@ using the correct version. The following profiles are available:
* hadoop-3.1
* hadoop-3.2
* hadoop-3.3
* EMR-6.9
* CDH-6.3
* CDP-7.1
* CDP-7.1-spark-3.2

With these profiles it is easy to build Flowman to match your environment.

Expand Down Expand Up @@ -206,6 +208,10 @@ mvn clean install -PCDH-6.3 -DskipTests
mvn clean install -PCDP-7.1 -DskipTests
```

```shell
mvn clean install -PCDP-7.1-spark-3.2 -DskipTests
```


# Coverage Analysis

Expand Down
89 changes: 83 additions & 6 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,36 @@
# Version 0.30.0 - 2023-01-03

* github-278: Parallelize execution of data quality checks. This also introduces a new configuration property
`flowman.execution.check.parallelism` (default `1`)
* github-282: Improve implementation for counting records
* github-288: Support reading local CSV files from fatjar
* github-290: Simplify specifying project name in fatjar
* github-291: Simplify create/destroy Relation interface
* github-292: Upgrade AWS EMR to 6.9
* github-289: Color log output via log4j configuration (requires log4j 2.x)
* Bump postgresql from 42.4.1 to 42.4.3 in /flowman-plugins/postgresql
* Bump loader-utils from 1.4.0 to 1.4.2
* Bump json5 from 2.2.1 to 2.2.3
* github-293: [BUG] Fatal exceptions in parallel mapping instantiation cause deadlock
* github-273: Support projects contained in (fat) jar files
* github-294: [BUG] Parallel execution should not execute more targets after errors
* github-295: Create build profile for CDP 7.1 with Spark 3.2
* github-296: Update npm dependencies (vuetify & co)
* github-297: Parametrize when to execute a specific phase
* github-299: Move migrationPolicy and migrationStrategy from target into relation
* github-115: Implement additional build policy in relation target for forcing dirty. This also introduces a new
configuration property `flowman.default.target.buildPolicy` (default `COMPAT`).
* github-298: Support fine-grained control when to execute each target of a job
* github-300: Implement new 'observe' mapping
* github-301: Upgrade Spark to 3.2.3
* github-302: Upgrade DeltaLake to 2.2.0
* github-303: Use multi-stage build for Docker image
* github-304: Upgrade Cloudera profile to CDP 7.1.8
* github-312: Fix build with Spark 2.4 and Maven 3.8

This version is fully backwards compatible until and including version 0.27.0.


# Version 0.29.0 - 2022-11-08

* github-260: Remove hive-storage-api from several plugins and lib
Expand All @@ -14,22 +47,26 @@
* github-273: Refactor file abstraction
* github-274: Print Flowman configuration to console

This version is fully backwards compatible until and including version 0.27.0.


# Version 0.28.0 - 2022-10-07

* Improve support for MariaDB / MySQL as data sinks
* github-245: Bump ejs, @vue/cli-plugin-babel, @vue/cli-plugin-eslint and @vue/cli-service in /flowman-studio-ui
* github-246: Bump ejs, @vue/cli-plugin-babel, @vue/cli-plugin-eslint and @vue/cli-service in /flowman-server-ui
* github-247: Automatically generate YAML schemas as part of build process
* github-248: Bump scss-tokenizer and node-sass in /flowman-server-u
* github-248: Bump scss-tokenizer and node-sass in /flowman-server-ui
* github-249: Add new options -X and -XX to increase logging
* github-251: Support for log4j2 Configuration
* github-252: Move sftp target into separate plugin
* github-252: Move `sftp` target into separate plugin
* github-253: SQL Server relation should support explicit staging table
* github-254: Use DATETIME2 for timestamps in MS SQL Server
* github-256: Provide Maven archetype for simple Flowman projects
* github-258: Support clustered indexes in MS SQL Server

This version is fully backwards compatible until and including version 0.27.0.


# Version 0.27.0 - 2022-09-09

Expand All @@ -38,15 +75,25 @@
* github-235: Implement new `rest` hook with fine control
* github-229: A build target should not fail if Impala "COMPUTE STATS" fails
* github-236: 'copy' target should not apply output schema
* github-237: jdbcQuery relation should use fields "sql" and "file" instead of "query"
* github-237: `jdbcQuery` relation should use fields "sql" and "file" instead of "query"
* github-239: Allow optional SQL statement for creating jdbcTable
* github-238: Implement new 'jdbcCommand' target
* github-238: Implement new `jdbcCommand` target
* github-240: [BUG] Data quality checks in documentation should not fail on NULL values
* github-241: Throw an error on duplicate entity definitions
* github-220: Upgrade Delta-Lake to 2.0 / 2.1
* github-242: Switch to Spark 3.3 as default
* github-243: Use alternative Spark MS SQL Connector for Spark 3.3
* github-244: Generate project HTML documentation with optional external CSS file
* github-228: Change default of config `flowman.default.relation.input.charVarcharPolicy` to `IGNORE`

This version breaks full backwards compatibility with older versions! See more details below.

## Breaking changes

In order to provide more flexibility and to increase consistency, the properties of `jdbcQuery` have changed. Before
this version, you needed to specify the SQL query in the `query` field. Now you have to specify the SQL query either
in the `sql` field or provide an external file and provide the file name in the `file` field. This new syntax is more
consistent with `jdbcView` and `hiveView` relations.


# Version 0.26.1 - 2022-08-03
Expand All @@ -55,6 +102,8 @@
* github-227: [BUG] Flowman should not fail with field names containing "-", "/" etc
* github-228: Padding and truncation of CHAR(n)/VARCHAR(n) should be configurable

This version is fully backwards compatible until and including version 0.26.0.


# Version 0.26.0 - 2022-07-27

Expand All @@ -80,6 +129,8 @@
* github-205: Initial support Oracle DB via JDBC
* github-225: [BUG] Staging schema should not have comments

This version breaks full backwards compatibility with older versions! See more details below.

## Breaking changes

We take backward compatibility very seriously. But sometimes a breaking change is needed to clean up code and to
Expand All @@ -101,6 +152,8 @@ described in the following table:
* github-195: [BUG] Metric "target_records" is not reset correctly after an execution phase is finished
* github-197: [BUG] Impala REFRESH METADATA should not fail when dropping views

This version is fully backwards compatible until and including version 0.24.0.


# Version 0.25.0 - 2022-05-31

Expand All @@ -114,6 +167,8 @@ described in the following table:
* github-191: Add user provided description to quality checks
* github-192: Provide example queries for JDBC metric sink

This version is fully backwards compatible until and including version 0.24.0.


# Version 0.24.1 - 2022-04-28

Expand All @@ -122,6 +177,8 @@ described in the following table:
* github-177: Implement generic SQL schema check
* github-179: Update DeltaLake dependency to 1.2.1

This version is fully backwards compatible until and including version 0.24.0.


# Version 0.24.0 - 2022-04-05

Expand All @@ -132,6 +189,8 @@ described in the following table:
* github-153: Use non-privileged user in Docker image
* github-174: Provide application for generating YAML schema

This version breaks full backwards compatibility with older versions! See more details below.

## Breaking changes

We take backward compatibility very seriously. But sometimes a breaking change is needed to clean up code and to
Expand Down Expand Up @@ -166,6 +225,8 @@ table:
* github-162: ExpressionColumnCheck does not work when results contain NULL values
* github-163: Implement new column length quality check

This version is fully backwards compatible until and including version 0.18.0.


# Version 0.23.0 - 2022-03-18

Expand All @@ -176,6 +237,8 @@ table:
* github-121: Correctly apply documentation, before/after and other common attributes to templates
* github-152: Implement new 'cast' mapping

This version is fully backwards compatible until and including version 0.18.0.


# Version 0.22.0 - 2022-03-01

Expand All @@ -190,16 +253,22 @@ table:
* Add new config variable `flowman.default.target.verifyPolicy` to ignore empty tables during VERIFY phase
* Implement initial support for indexes in JDBC relations

This version is fully backwards compatible until and including version 0.18.0.


# Version 0.21.2 - 2022-02-14

* Fix importing projects

This version is fully backwards compatible until and including version 0.18.0.


# Version 0.21.1 - 2022-01-28

* flowexec now returns different exit codes depending on the processing result

This version is fully backwards compatible until and including version 0.18.0.


# Version 0.21.0 - 2022-01-26

Expand All @@ -208,11 +277,15 @@ table:
* Implement new `stack` mapping
* Improve error messages of local CSV parser

This version is fully backwards compatible until and including version 0.18.0.


# Version 0.20.1 - 2022-01-06

* Implement detection of dependencies introduced by schema

This version is fully backwards compatible until and including version 0.18.0.


# Version 0.20.0 - 2022-01-05

Expand All @@ -229,6 +302,8 @@ table:
* Change the semantics of config variable `flowman.execution.target.forceDirty` (default is `false`)
* Add new `-d` / `--dirty` option for explicitly marking individual targets as dirty

This version is fully backwards compatible until and including version 0.18.0.


# Version 0.19.0 - 2021-12-13

Expand All @@ -244,6 +319,8 @@ table:
* Implement new `measure` target for creating custom metrics for measuring data quality
* Add new config option `flowman.execution.mapping.parallelism`

This version is fully backwards compatible until and including version 0.18.0.


# Version 0.18.0 - 2021-10-13

Expand Down Expand Up @@ -517,12 +594,12 @@ parameter `spark.sql.sources.commitProtocolClass`
* Add support for checkpoint directory


# Verison 0.6.4 - 2019-06-20
# Version 0.6.4 - 2019-06-20

* Implement column renaming in projections


# Verison 0.6.3 - 2019-06-17
# Version 0.6.3 - 2019-06-17

* CopyRelationTask also performs projection

Expand Down
12 changes: 7 additions & 5 deletions QUICKSTART.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,18 +16,18 @@ Fortunately, Spark is rather simple to install locally on your machine:

### Download & Install Spark

As of this writing, the latest release of Flowman is 0.25.0 and is available prebuilt for Spark 3.2.1 on the Spark
As of this writing, the latest release of Flowman is 0.29.0 and is available prebuilt for Spark 3.3.1 on the Spark
homepage. So we download the appropriate Spark distribution from the Apache archive and unpack it.

```shell
# Create a nice playground which doesn't mess up your system
mkdir playground
cd playground# Download and unpack Spark & Hadoop

curl -L https://archive.apache.org/dist/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz | tar xvzf -
curl -L https://archive.apache.org/dist/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.2.tgz | tar xvzf -

# Create a nice link
ln -snf spark-3.2.1-bin-hadoop3.2 spark
ln -snf spark-3.3.1-bin-hadoop3.2 spark
```

The Spark package already contains Hadoop, so with this single download you already have both installed and integrated with each other.
Expand All @@ -50,8 +50,10 @@ You find prebuilt Flowman packages on the corresponding release page on GitHub.

```shell
# Download and unpack Flowman
curl -L https://github.com/dimajix/flowman/releases/download/0.25.0/flowman-dist-0.25.0-oss-spark3.2-hadoop3.3-bin.tar.gz | tar xvzf -# Create a nice link
ln -snf flowman-0.25.0 flowman
curl -L https://github.com/dimajix/flowman/releases/download/0.29.0/flowman-dist-0.29.0-oss-spark3.3-hadoop3.3-bin.tar.gz | tar xvzf -

# Create a nice link
ln -snf flowman-0.29.0-oss-spark3.3-hadoop3.3 flowman
```

### Flowman Configuration
Expand Down
3 changes: 1 addition & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
# [![Flowman Logo](docs/images/flowman-favicon.png) Flowman](https://flowman.io)

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Build Status](https://travis-ci.org/dimajix/flowman.svg?branch=develop)](https://travis-ci.org/dimajix/flowman)
[![Documentation](https://readthedocs.org/projects/flowman/badge/?version=latest)](https://flowman.readthedocs.io/en/latest/)

Flowman is a Spark based *data build tool* that simplifies the act of writing data transformations as part of ETL
Expand All @@ -14,7 +13,7 @@ In addition to writing and executing data transformations, Flowman can also be u
i.e. Hive or JDBC tables. Flowman can create such tables from a specification with the correct schema. This helps to
keep all aspects (like transformations and schema information) in a single place managed by a single program.

[![Flowman Logo](docs/images/flowman-overview.png)](https://flowman.io)
[![Flowman Diagram](docs/images/flowman-overview.png)](https://flowman.io)


### Noteable Features
Expand Down
5 changes: 2 additions & 3 deletions RELEASING.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ You can also deploy to a different repository by setting the following propertie
* `deployment.repository.id` - contains the ID of the repository. This should match any entry in your settings.xml for authentication
* `deployment.repository.snapshot-id` - contains the ID of the repository. This should match any entry in your settings.xml for authentication
* `deployment.repository.server` - the url of the server as used by the nexus-staging-maven-plugin
* `deployment.repository.url` - the url of the default release repsotiory
* `deployment.repository.url` - the url of the default release repository
* `deployment.repository.snapshot-url` - the url of the snapshot repository

Per default, Flowman uses the staging mechanism provided by the nexus-staging-maven-plugin. This this is not what you
Expand All @@ -43,8 +43,7 @@ want, you can simply disable the Plugin via `skipTests`
With these settings you can deploy to a different (local) repository, for example

mvn deploy \
-Pspark-2.3 \
-PCDH-5.15 \
-PCDP-7.1 \
-Ddeployment.repository.snapshot-url=https://nexus-snapshots.my-company.net/repository/snapshots \
-Ddeployment.repository.snapshot-id=nexus-snapshots \
-DskipStaging \
Expand Down
3 changes: 2 additions & 1 deletion build-release.sh
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,8 @@ build_profile -Phadoop-3.3 -Pspark-3.2 -Dhadoop.version=3.3.1
build_profile -Phadoop-2.7 -Pspark-3.3
build_profile -Phadoop-3.3 -Pspark-3.3 -Dhadoop.version=3.3.2

build_profile -PEMR-6.8
build_profile -PEMR-6.9
build_profile -PCDP-7.1-spark-3.2

export JAVA_HOME=/usr/lib/jvm/java-1.8.0
build_profile -PCDH-6.3
Expand Down

0 comments on commit 9d7d051

Please sign in to comment.