Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User Guide improvements #274

Merged
merged 8 commits into from
Sep 29, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 21 additions & 17 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,16 +17,20 @@
under the License.
-->

# Developer Documentation
# Ballista Documentation

Developer documentation can be found [here](developer/README.md).
User documentation can be found [here](source/user-guide/introduction.md).
## User Documentation

Documentation for the current published release can be found at https://arrow.apache.org/ballista and the source
content is located [here](source/user-guide/introduction.md).

# User Documentation
## Developer Documentation

Developer documentation can be found [here](developer/README.md).

_These instructions were forked from the `arrow-datafusion` repository and are outdated_
## Building the User Guide

## Dependencies
### Dependencies

It's recommended to install build dependencies and build the documentation
inside a Python virtualenv.
Expand All @@ -38,21 +42,21 @@ inside a Python virtualenv.
## Build

```bash
make html
./build.sh
```

## Release

The documentation is served through the
[arrow-site](https://github.com/apache/arrow-site/) repo. To release a new
version of the docs, follow these steps:
The documentation is served through the [arrow-site](https://github.com/apache/arrow-site/) repository. To release
a new version of the documentation, follow these steps:

1. Run `make html` inside `docs` folder to generate the docs website inside the `build/html` folder.
2. Clone the arrow-site repo
3. Checkout to the `asf-site` branch (NOT `master`)
4. Copy build artifacts into `arrow-site` repo's `datafusion` folder with a command such as
1. Download the release source tarball (we can only publish documentation from official releases)
2. Run `./build.sh` inside `docs` folder to generate the docs website inside the `build/html` folder.
3. Clone the arrow-site repo
4. Checkout to the `asf-site` branch (NOT `master`)
5. Copy build artifacts into `arrow-site` repo's `ballista` folder with a command such as

- `cp -rT ./build/html/ ../../arrow-site/datafusion/` (doesn't work on mac)
- `rsync -avzr ./build/html/ ../../arrow-site/datafusion/`
- `cp -rT ./build/html/ ../../arrow-site/ballista/` (doesn't work on mac)
- `rsync -avzr ./build/html/ ../../arrow-site/ballista/`

5. Commit changes in `arrow-site` and send a PR.
6. Commit changes in `arrow-site` and send a PR.
21 changes: 21 additions & 0 deletions docs/build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#!/bin/bash

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

rm -rf build
make html
27 changes: 22 additions & 5 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,11 +29,28 @@ Table of content
:maxdepth: 1
:caption: User Guide

user-guide/introduction
user-guide/deployment/index
user-guide/python
user-guide/rust
user-guide/cli
Introduction <user-guide/introduction>

.. toctree::
:maxdepth: 1
:caption: Cluster Deployment

Deployment <user-guide/deployment/index>
Scheduler <user-guide/scheduler>

.. toctree::
:maxdepth: 1
:caption: Clients

Python <user-guide/python>
Rust <user-guide/rust>
Flight SQL JDBC <user-guide/flightsql>
SQL CLI <user-guide/cli>

.. toctree::
:maxdepth: 1
:caption: Reference

user-guide/configs
user-guide/tuning-guide
user-guide/faq
Expand Down
69 changes: 30 additions & 39 deletions docs/source/user-guide/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,27 +17,35 @@
under the License.
-->

# DataFusion Command-line Interface
# Ballista Command-line Interface

The DataFusion CLI allows SQL queries to be executed by an in-process DataFusion context, or by a distributed
Ballista context.
The Ballista CLI allows SQL queries to be executed against a Ballista cluster, or in standalone mode in a single
process.

Use Cargo to install:

```bash
cargo install ballista-cli
```
USAGE:
datafusion-cli [FLAGS] [OPTIONS]

FLAGS:
-h, --help Prints help information
-q, --quiet Reduce printing other than the results and work quietly
-V, --version Prints version information
## Usage

```
USAGE:
ballista-cli [OPTIONS]

OPTIONS:
-c, --batch-size <batch-size> The batch size of each query, or use DataFusion default
-p, --data-path <data-path> Path to your data, default to current directory
-f, --file <file>... Execute commands from file(s), then exit
--format <format> Output format [default: table] [possible values: csv, tsv, table, json, ndjson]
--host <host> Ballista scheduler host
--port <port> Ballista scheduler port
-c, --batch-size <BATCH_SIZE> The batch size of each query, or use DataFusion default
-f, --file <FILE>... Execute commands from file(s), then exit
--format <FORMAT> [default: table] [possible values: csv, tsv, table, json,
nd-json]
-h, --help Print help information
--host <HOST> Ballista scheduler host
-p, --data-path <DATA_PATH> Path to your data, default to current directory
--port <PORT> Ballista scheduler port
-q, --quiet Reduce printing other than the results and work quietly
-r, --rc <RC>... Run the provided files on startup instead of ~/.datafusionrc
-V, --version Print version information
```

## Example
Expand All @@ -48,10 +56,12 @@ Create a CSV file to query.
$ echo "1,2" > data.csv
```

## Run Ballista CLI in Standalone Mode

```bash
$ datafusion-cli
$ ballista-cli

DataFusion CLI v8.0.0
Ballista CLI v8.0.0

> CREATE EXTERNAL TABLE foo (a INT, b INT) STORED AS CSV LOCATION 'data.csv';
0 rows in set. Query took 0.001 seconds.
Expand All @@ -65,36 +75,17 @@ DataFusion CLI v8.0.0
1 row in set. Query took 0.017 seconds.
```

## DataFusion-Cli

Build the `datafusion-cli` without the feature of ballista.

```bash
cd arrow-datafusion/datafusion-cli
cargo build
```

## Ballista

The DataFusion CLI can also connect to a Ballista scheduler for query execution.

Before you use the `datafusion-cli` to connect the Ballista scheduler, you should build/compile
the `datafusion-cli` with feature of "ballista" first.

```bash
cd arrow-datafusion/datafusion-cli
cargo build --features ballista
```
## Run Ballista CLI in Distributed Mode

Then, you can connect the Ballista by below command.
The CLI can also connect to a Ballista scheduler for query execution.

```bash
datafusion-cli --host localhost --port 50050
```

## Cli commands

Available commands inside DataFusion CLI are:
Available commands inside Ballista CLI are:

- Quit

Expand Down
4 changes: 2 additions & 2 deletions docs/source/user-guide/deployment/docker-compose.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,8 @@ There is no officially published Docker image so it is currently necessary to bu
Run the following commands to clone the source repository and build the Docker image.

```bash
git clone git@github.com:apache/arrow-datafusion.git -b 8.0.0
cd arrow-datafusion
git clone git@github.com:apache/arrow-ballista.git -b 8.0.0
cd arrow-ballista
./dev/build-ballista-docker.sh
```

Expand Down
4 changes: 2 additions & 2 deletions docs/source/user-guide/deployment/docker.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,8 @@ There is no officially published Docker image so it is currently necessary to bu
Run the following commands to clone the source repository and build the Docker image.

```bash
git clone git@github.com:apache/arrow-datafusion.git -b 8.0.0
cd arrow-datafusion
git clone git@github.com:apache/arrow-ballista.git -b 8.0.0
cd arrow-ballista
./dev/build-ballista-docker.sh
```

Expand Down
9 changes: 4 additions & 5 deletions docs/source/user-guide/deployment/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,7 @@ Start a Ballista Cluster
.. toctree::
:maxdepth: 2

cargo-install
docker
docker-compose
kubernetes
configuration
Cargo <cargo-install>
Docker <docker>
Docker Compose <docker-compose>
Kubernetes <kubernetes>
2 changes: 1 addition & 1 deletion docs/source/user-guide/flightsql.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ To register a table, find a `.csv`, `.json`, or `.parquet` file for testing, and

```sql
create external table customer stored as CSV with header row
location '/home/username/arrow-datafusion/datafusion/core/tests/tpch-csv/customer.csv';
location '/path/to/customer.csv';
```

Once the table has been registered, all the normal SQL queries can be performed:
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
29 changes: 14 additions & 15 deletions docs/source/user-guide/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,16 +19,19 @@

# Overview

Ballista is a distributed compute platform primarily implemented in Rust, and powered by Apache Arrow. It is
built on an architecture that allows other programming languages to be supported as first-class citizens without paying
a penalty for serialization costs.
Ballista is a distributed compute platform primarily implemented in Rust, and powered by Apache Arrow.

The foundational technologies in Ballista are:
Ballista has a scheduler and an executor process that are standard Rust executables and can be executed directly, but
Dockerfiles are provided to build images for use in containerized environments, such as Docker, Docker Compose, and
Kubernetes. See the [deployment guide](deployment.md) for more information

- [Apache Arrow](https://arrow.apache.org/) memory model and compute kernels for efficient processing of data.
- [Apache Arrow Flight Protocol](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/) for efficient data transfer between processes.
- [Google Protocol Buffers](https://developers.google.com/protocol-buffers) for serializing query plans.
- [DataFusion](https://github.com/apache/arrow-datafusion/) for query execution.
SQL and DataFrame queries can be submitted from Python and Rust, and SQL queries can be submitted via the Arrow
Flight SQL JDBC driver, supporting your favorite JDBC compliant tools such as [DataGrip](datagrip)
or [tableau](tableau). For setup instructions, please see the [FlightSQL guide](flightsql.md).

The scheduler has a web user interface for monitoring query status as well as a REST API.

![Ballista Scheduler Web UI](./images/ballista-web-ui.png)

## How does this compare to Apache Spark?

Expand All @@ -45,10 +48,6 @@ Although Ballista is largely inspired by Apache Spark, there are some key differ
- The use of Apache Arrow as the memory model and network protocol means that data can be exchanged between executors
in any programming language with minimal serialization overhead.

## Status

Ballista is still in the early stages of development but is capable of executing complex analytical queries at scale.

## Usage

Ballista can be used from your favorite JDBC compliant tools such as [DataGrip](https://www.jetbrains.com/datagrip/) or [tableau](https://help.tableau.com/current/pro/desktop/en-us/examples_otherdatabases_jdbc.htm). For setup instructions, please see the [FlightSQL guide](flightsql.md).
[deployment](./deployment)
[datagrip](https://www.jetbrains.com/datagrip/)
[tableau](https://help.tableau.com/current/pro/desktop/en-us/examples_otherdatabases_jdbc.htm)
47 changes: 45 additions & 2 deletions docs/source/user-guide/python.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,9 @@

Ballista provides Python bindings, allowing SQL and DataFrame queries to be executed from the Python shell.

Like PySpark, it allows you to build a plan through SQL or a DataFrame API against Parquet, CSV, JSON, and other
popular file formats files, run it in a distributed environment, and obtain the result back in Python.

## Connecting to a Cluster

The following code demonstrates how to create a Ballista context and connect to a scheduler.
Expand All @@ -30,7 +33,13 @@ The following code demonstrates how to create a Ballista context and connect to
>>> ctx = ballista.BallistaContext("localhost", 50050)
```

## Registering Tables
## SQL

The Python bindings support executing SQL queries as well.

### Registering Tables

Before SQL queries can be executed, tables need to be registered with the context.

Tables can be registered against the context by calling one of the `register` methods, or by executing SQL.

Expand All @@ -42,7 +51,7 @@ Tables can be registered against the context by calling one of the `register` me
>>> ctx.sql("CREATE EXTERNAL TABLE trips STORED AS PARQUET LOCATION '/mnt/bigdata/nyctaxi'")
```

## Executing Queries
### Executing Queries

The `sql` method creates a `DataFrame`. The query is executed when an action such as `show` or `collect` is executed.

Expand Down Expand Up @@ -88,3 +97,37 @@ The `explain` method can be used to show the logical and physical query plans fo
| | |
+---------------+-------------------------------------------------------------+
```

## DataFrame

The following example demonstrates creating arrays with PyArrow and then creating a Ballista DataFrame.

```python
import ballista
import pyarrow

# an alias
f = ballista.functions

# create a context
ctx = ballista.BallistaContext("localhost", 50050)

# create a RecordBatch and a new DataFrame from it
batch = pyarrow.RecordBatch.from_arrays(
[pyarrow.array([1, 2, 3]), pyarrow.array([4, 5, 6])],
names=["a", "b"],
)
df = ctx.create_dataframe([[batch]])

# create a new statement
df = df.select(
f.col("a") + f.col("b"),
f.col("a") - f.col("b"),
)

# execute and collect the first (and only) batch
result = df.collect()[0]

assert result.column(0) == pyarrow.array([5, 7, 9])
assert result.column(1) == pyarrow.array([-3, -3, -3])
```
5 changes: 1 addition & 4 deletions docs/source/user-guide/rust.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,7 @@

# Ballista Rust Client

Ballista usage is very similar to DataFusion. Tha main difference is that the starting point is a `BallistaContext`
instead of the DataFusion `SessionContext`. Ballista uses the same DataFrame API as DataFusion.

The following code sample demonstrates how to create a `BallistaContext` to connect to a Ballista scheduler process.
To connect to a Ballista cluster from Rust, first start by creating a `BallistaContext`.

```rust
let config = BallistaConfig::builder()
Expand Down
Loading