Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ docs/example1.dat
docs/example3.dat
python/.eggs/
python/doc/
python/examples/.ipynb_checkpoints
# Egg metadata
*.egg-info

Expand Down
96 changes: 96 additions & 0 deletions python/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,102 @@ pyarrow_batches = df.collect()

Check [DataFusion python](https://datafusion.apache.org/python/) provides more examples and manuals.

## Jupyter Notebook Support

PyBallista provides first-class Jupyter notebook support with SQL magic commands and rich HTML rendering.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Install Jupyter extras first:
```bash
pip install "ballista[jupyter]"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

### Install Jupyter extras first:
```bash
pip install "ballista[jupyter]"
```

### HTML Table Rendering

DataFrames automatically render as styled HTML tables in Jupyter notebooks:

```python
from ballista import BallistaSessionContext

ctx = BallistaSessionContext("df://localhost:50050")
df = ctx.sql("SELECT * FROM my_table LIMIT 10")
df # Renders as HTML table via _repr_html_()
```

### SQL Magic Commands

For a more interactive SQL experience, load the Ballista Jupyter extension:

```python
# Load the extension
%load_ext ballista.jupyter

# Connect to a Ballista cluster
%ballista connect df://localhost:50050

# Register .parquet table
%register parquet public.test_data_v1 ../testdata/test.parquet

# Check connection status
%ballista status

# List registered tables
%ballista tables

# Show table schema
%ballista schema my_table

# Execute a simple query (line magic)
%sql SELECT COUNT(*) FROM orders

# Execute a complex query (cell magic)
%%sql
SELECT
customer_id,
SUM(amount) as total
FROM orders
GROUP BY customer_id
ORDER BY total DESC
LIMIT 10
```

You can also store results in a variable:

```python
%%sql my_result
SELECT * FROM orders WHERE status = 'pending'
```

### Execution Plan Visualization

Visualize query execution plans directly in notebooks:

```python
df = ctx.sql("SELECT * FROM orders WHERE amount > 100")
df.explain_visual() # Displays SVG visualization

# With runtime statistics
df.explain_visual(analyze=True)
```

> **Note:** Full SVG visualization requires graphviz to be installed (`brew install graphviz` on macOS).

### Progress Indicators

For long-running queries, use `collect_with_progress()` to see execution status:

```python
df = ctx.sql("SELECT * FROM large_table")
batches = df.collect_with_progress()
```

### Example Notebooks

See the `examples/` directory for Jupyter notebooks demonstrating various features:

- `getting_started.ipynb` - Basic connection and queries
- `dataframe_api.ipynb` - DataFrame transformations
- `distributed_queries.ipynb` - Multi-stage distributed query examples

## Scheduler and Executor

Scheduler and executors can be configured and started from python code.
Expand Down
Loading
Loading