Skip to content

Commit

Permalink
#81: Updated user guide (#82)
Browse files Browse the repository at this point in the history
Co-authored-by: Anastasiia Sergienko <46891819+AnastasiiaSergienko@users.noreply.github.com>
  • Loading branch information
morazow and AnastasiiaSergienko committed Sep 30, 2020
1 parent 237c984 commit 151f8b4
Show file tree
Hide file tree
Showing 8 changed files with 535 additions and 518 deletions.
1 change: 1 addition & 0 deletions .travis.yml
Expand Up @@ -65,6 +65,7 @@ script:

after_success:
- bash <(curl -s https://codecov.io/bash)
- ./sbtx coveralls
- ./scripts/publish.sh

cache:
Expand Down
301 changes: 17 additions & 284 deletions README.md
Expand Up @@ -4,294 +4,30 @@

[![Build Status][travis-badge]][travis-link]
[![Codecov][codecov-badge]][codecov-link]
[![Coveralls][coveralls-badge]][coveralls-link]
[![Maven Central][maven-img-badge]][maven-link]

<p style="border: 1px solid black;padding: 10px; background-color: #FFFFCC;">
<span style="font-size:200%">&#128712;</span> Please note that this is an open
source project which is officially supported by Exasol. For any question, you
can contact our support team or open a Github issue.
</p>

## Overview

This is a connector library that supports an integration between
[Exasol][exasol] and [Apache Spark][spark]. Using this connector, users can
create Spark dataframes from Exasol queries and save Spark dataframes as Exasol
tables.

The implementation is based on Spark [DataSources API][spark-ds-api] and Exasol
[Sub Connections][sol-546].

* [Prerequisite](#prerequisite)
* [Quick Start](#quick-start)
* [Usage](#usage)
* [Configuration](#configuration)
* [Building and Testing](#building-and-testing)

## Prerequisite

* Deployed and running Spark cluster
* Deployed and running Exasol cluster
* Make sure Spark cluster has enough resources to start executors that are more
or equal to the number of Exasol data nodes
* Make sure that Spark cluster can connect to Exasol nodes using private ip
addresses, e.g, `10.0.0.11`
* Make sure that Exasol nodes are reachable from Spark cluster on port `8563`
and on port range `20000-21000`

## Quick Start

Here is short code snippets on how to use the connector in your Spark / Scala
applications.

Reading data from Exasol as Spark dataframe:

```scala
// An Exasol sql syntax query string
val exasolQueryString =
"""
SELECT SALES_DATE, MARKET_ID, PRICE
FROM RETAIL.SALES
WHERE MARKET_ID IN (661, 534, 667)
"""

val df = sparkSession
.read
.format("exasol")
.option("host", "10.0.0.11")
.option("port", "8563")
.option("username", "sys")
.option("password", "exaPass")
.option("query", exasolQueryString)
.load()
```

Saving a Spark dataframe as an Exasol table:

```scala
df
.write
.mode("append")
.option("host", "10.0.0.11")
.option("port", "8563")
.option("username", "sys")
.option("password", "exaPass")
.option("table", "RETAIL.ADJUSTED_SALES")
.format("exasol")
.save()
```

Additionally, you can set the parameter on `SparkConf`:

```scala
// Configure spark session
val sparkConf = new SparkConf()
.setMaster("local[*]")
.set("spark.exasol.host", "localhost")
.set("spark.exasol.port", "8563")
.set("spark.exasol.username", "sys")
.set("spark.exasol.password", "exasol")
.set("spark.exasol.max_nodes", "200")

val sparkSession = SparkSession
.builder()
.config(sparkConf)
.getOrCreate()

val queryStr = "SELECT * FROM MY_SCHEMA.MY_TABLE"

val df = sparkSession
.read
.format("exasol")
.option("query", queryStr)
.load()
```

Please note that parameter values set on Spark configuration will have higher
priority.

For an example walkthrough please check
[doc/example-walkthrough](doc/example-walkthrough.md).

Additionally, you can read about the latest changes in the
[Changelog](doc/changes/changelog.md) file.

## Usage

The latest release version ([![Maven Central][maven-reg-badge]][maven-link]) is
compiled against Scala 2.11 and Spark 2.1+.

In order to use the connector in your Java or Scala applications, you can
include it as a dependency to your projects by adding artifact information into
`build.sbt` or `pom.xml` files.

### build.sbt

```scala
resolvers ++= Seq("Exasol Releases" at "https://maven.exasol.com/artifactory/exasol-releases")

libraryDependencies += "com.exasol" % "spark-connector_2.11" % "$LATEST_VERSION"
```

### pom.xml

```xml
<repository>
<id>maven.exasol.com</id>
<url>https://maven.exasol.com/artifactory/exasol-releases</url>
</repository>

<dependency>
<groupId>com.exasol</groupId>
<artifactId>spark-connector_2.11</artifactId>
<version>$LATEST_VERSION</version>
</dependency>
```

### Alternative Option

As an alternative, you can provide `--repositories` and `--packages` artifact
coordinates to the **spark-submit**, **spark-shell** or **pyspark** commands.

For example:
Spark Exasol Connector supports an integration between [Exasol][exasol] and
[Apache Spark][spark]. It allows creating Spark DataFrame from Exasol queries
and saving Spark DataFrame as an Exasol table.

```sh
spark-shell \
--repositories https://maven.exasol.com/artifactory/exasol-releases \
--packages com.exasol:spark-connector_2.11:$LATEST_VERSION
```
## Features

### Deployment
* Creates Spark DataFrame from Exasol query results
* Saves Spark DataFrame to an Exasol table
* Allows configuring the Spark tasks for parallel connection
* Supports Spark DataFrame column pruning and filter push down

Similarly, you can submit packaged application into the Spark cluster.
## Information for Users

Using spark-submit:
* [User Guide](doc/user_guide/user_guide.md)
* [Changelog](doc/changes/changelog.md)

```sh
spark-submit \
--master spark://spark-master-url:7077
--repositories https://maven.exasol.com/artifactory/exasol-releases \
--packages com.exasol:spark-connector_2.11:$LATEST_VERSION \
--class com.myorg.SparkExasolConnectorApp \
--conf spark.exasol.password=exaTru3P@ss \
path/to/project/folder/target/scala-2.11/sparkexasolconnectorapp_2.11-5.3.1.jar
```
## Information for Contributors

This deployment example also shows that you can configure the Exasol parameters
at startup using `--conf spark.exasol.keyName=value` syntax.

<strong>Please update the `$LATEST_VERION` accordingly with the latest artifact
version number.</strong>

## Configuration

The following configuration parameters can be provided mainly to facilitate a
connection to Exasol cluster. You can add extra jdbc options using
``spark.exasol.jdbc_options`` if needed.

| Spark Configuration | Configuration | Default | Description
| :--- | :--- | :--- | :---
| | ``query`` | *<none>* | A query string to send to Exasol
| | ``table`` | *<none>* | A table name (with schema, e.g. my_schema.my_table) to save dataframe
| ``spark.exasol.host`` | ``host`` | ``localhost`` | A host ip address to the **first** Exasol node (e.g. 10.0.0.11)
| ``spark.exasol.port`` | ``port`` | ``8888`` | A port number to connect to Exasol nodes (e.g. 8563)
| ``spark.exasol.username`` | ``username`` | ``sys`` | An Exasol username for logging in
| ``spark.exasol.password`` | ``password`` | ``exasol`` | An Exasol password for logging in
| ``spark.exasol.max_nodes`` | ``max_nodes`` | ``200`` | The number of data nodes in Exasol cluster
| ``spark.exasol.batch_size`` | ``batch_size`` | ``1000`` | The number of records batched before running execute statement when saving dataframe
| ``spark.exasol.create_table`` | ``create_table`` | ``false`` | A permission to create a table if it does not exist in Exasol when saving dataframe
| ``spark.exasol.drop_table`` | ``drop_table`` | ``false`` | A permission to drop the table if it exists in Exasol when saving dataframe
| ``spark.exasol.jdbc_options`` | ``jdbc_options`` | ``""`` | A string that does not start or end with ";" & specifies a list of exasol jdbc options using format ``key1=value1;key2=value2``

## Building and Testing

Clone the repository,

```bash
git clone https://github.com/exasol/spark-exasol-connector

cd spark-exasol-connector/
```

Compile,

```bash
./sbtx compile
```

Run unit tests,

```bash
./sbtx test
```

To run integration tests, a separate docker network should be created first,

```bash
docker network create -d bridge --subnet 192.168.0.0/24 --gateway 192.168.0.1 dockernet
```

then run,

```bash
./sbtx it:test
```

The integration tests requires [docker][docker],
[exasol/docker-db][exa-docker-db], [testcontainers][testcontainers] and
[spark-testing-base][spark-testing-base].

In order to create a bundled jar,

```bash
./sbtx assembly
```

This creates a jar file under `target/` folder. The jar file can be used with
`spark-submit`, `spark-shell` or `pyspark` commands. For example,

```shell
spark-shell --jars /path/to/spark-exasol-connector-assembly-*.jar
```

## FAQ

* Getting an `com.exasol.jdbc.ConnectFailed: Connection refused` exception

This usually occurs when the Spark connector cannot reach Exasol data nodes.
Please make sure that the Exasol data nodes are reachable on port `8563` and
on port ranges `20000-21000`.

Additionally, please make sure that the `host` parameter value is set to the
first Exasol datanode address, for example, `10.0.0.11`.

* Getting an `Connection was lost and could not be reestablished` error

For example:

```txt
[error] Caused by: com.exasol.jdbc.ConnectFailed: Connection was lost and could not be reestablished. (SessionID: 1615669509094853970)
[error] at com.exasol.jdbc.AbstractEXAConnection.reconnect(AbstractEXAConnection.java:3505)
[error] at com.exasol.jdbc.ServerCommunication.handle(ServerCommunication.java:98)
[error] at com.exasol.jdbc.AbstractEXAConnection.communication(AbstractEXAConnection.java:2537)
[error] at com.exasol.jdbc.AbstractEXAConnection.communication_resultset(AbstractEXAConnection.java:2257)
[error] at com.exasol.jdbc.AbstractEXAStatement.execute(AbstractEXAStatement.java:456)
[error] at com.exasol.jdbc.EXAStatement.execute(EXAStatement.java:278)
[error] at com.exasol.jdbc.AbstractEXAStatement.executeQuery(AbstractEXAStatement.java:601)
[error] at com.exasol.spark.rdd.ExasolRDD.compute(ExasolRDD.scala:125)
[error] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
[error] at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
```

This is one of the known issues. This happens when Spark scheduled parallel
tasks are less than the number of [sub connections][sol-546]. This can be
mitigated by submitting Spark application with enough resources so that it can
start parallel tasks that are more or equal to number of parallel Exasol
connections.

Additionally, you can limit the Exasol parallel connections using `max_nodes`
parameter. However, it is not advised to limit this value in production
environment.
* [Developer Guide](doc/development/developer_guide.md)

## Dependencies

Expand Down Expand Up @@ -344,24 +80,21 @@ These plugins help with project development.
[travis-link]: https://travis-ci.com/exasol/spark-exasol-connector
[codecov-badge]: https://codecov.io/gh/exasol/spark-exasol-connector/branch/master/graph/badge.svg
[codecov-link]: https://codecov.io/gh/exasol/spark-exasol-connector
[coveralls-badge]: https://coveralls.io/repos/github/exasol/spark-exasol-connector/badge.svg?branch=master
[coveralls-link]: https://coveralls.io/github/exasol/spark-exasol-connector?branch=master
[maven-img-badge]: https://img.shields.io/maven-central/v/com.exasol/spark-connector_2.12.svg
[maven-reg-badge]: https://maven-badges.herokuapp.com/maven-central/com.exasol/spark-connector_2.12/badge.svg
[maven-link]: https://maven-badges.herokuapp.com/maven-central/com.exasol/spark-connector_2.12
[exasol]: https://www.exasol.com/en/
[spark]: https://spark.apache.org/
[spark-ds-api]: https://databricks.com/blog/2015/01/09/spark-sql-data-sources-api-unified-data-access-for-the-spark-platform.html
[docker]: https://www.docker.com/
[exa-docker-db]: https://hub.docker.com/r/exasol/docker-db/
[testcontainers]: https://www.testcontainers.org/
[spark-testing-base]: https://github.com/holdenk/spark-testing-base
[sol-546]: https://www.exasol.com/support/browse/SOL-546
[exasol-jdbc-link]: https://www.exasol.com/portal/display/DOWNLOAD/Exasol+Download+Section
[spark-sql-link]: https://spark.apache.org/sql/
[scalatest-link]: http://www.scalatest.org/
[scalatestplus-link]: https://github.com/scalatest/scalatestplus-mockito
[mockitocore-link]: https://site.mockito.org/
[tcont-jdbc-link]: https://www.testcontainers.org/modules/databases/jdbc/
[tcont-scala-link]: https://github.com/testcontainers/testcontainers-scala
[spark-testing-base]: https://github.com/holdenk/spark-testing-base
[sbt-coursier-link]: https://github.com/coursier/coursier
[sbt-wartremover-link]: http://github.com/puffnfresh/wartremover
[sbt-wcontrib-link]: http://github.com/wartremover/wartremover-contrib
Expand Down
8 changes: 8 additions & 0 deletions doc/changes/changes_0.3.2.md
Expand Up @@ -11,9 +11,17 @@ to `7.0.0`.
* [#76](https://github.com/exasol/spark-exasol-connector/issues/76): Enabled extra options for JDBC connection (PR [#77](https://github.com/exasol/spark-exasol-connector/pull/77)).
* [#78](https://github.com/exasol/spark-exasol-connector/issues/78): Updated Exasol JDBC version to `7.0.0` (PR [#79](https://github.com/exasol/spark-exasol-connector/pull/79)).

## Documentation

* [#81](https://github.com/exasol/spark-exasol-connector/issues/81): Updated user and development guide (PR [#82](https://github.com/exasol/spark-exasol-connector/pull/82)).

## Dependency Updates

* Updated Scala version from `2.12.10` to `2.12.12`.
* Updated ``com.exasol:exasol-jdbc`` from `6.2.5` to `7.0.0`.
* Updated ``org.mockito:mockito-core`` from `3.5.7` to `3.5.13`.
* Updated ``com.dimafeng:testcontainers-scala`` from `0.38.1` to `0.38.4`.

### Plugin Updates

* Added ``org.scoverage:sbt-coveralls`` plugin for additional coverage report.
24 changes: 24 additions & 0 deletions doc/development/developer_guide.md
@@ -0,0 +1,24 @@
# Developer Guide

Please read the general [developer guide for the Scala projects][dev-guide].

## Integration Tests

The integration tests are run using [Docker][docker] containers. The tests use
[exasol/docker-db][exa-docker-db], [testcontainers][testcontainers] and
[spark-testing-base][spark-testing-base].

To run integration tests, a separate docker network should be created first:

```bash
docker network create -d bridge --subnet 192.168.0.0/24 --gateway 192.168.0.1 dockernet
```

The docker network is required since we connect to the Exasol docker container
using an internal IPv4 address.

[dev-guide]: https://github.com/exasol/import-export-udf-common-scala/blob/master/doc/development/developer_guide.md
[docker]: https://www.docker.com/
[exa-docker-db]: https://hub.docker.com/r/exasol/docker-db/
[testcontainers]: https://www.testcontainers.org/
[spark-testing-base]: https://github.com/holdenk/spark-testing-base

0 comments on commit 151f8b4

Please sign in to comment.