<!---
  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
  distributed with this work for additional information
  regarding copyright ownership.  The ASF licenses this file
  to you under the Apache License, Version 2.0 (the
  "License"); you may not use this file except in compliance
  with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing,
  software distributed under the License is distributed on an
  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  KIND, either express or implied.  See the License for the
  specific language governing permissions and limitations
  under the License.
-->

# Getting Started with PyBallista

This notebook demonstrates how to get started with Ballista using Python.

## Prerequisites

1. Install PyBallista: `pip install ballista`
2. Have a Ballista cluster running (or use the built-in test cluster)

## Overview

Ballista is a distributed query engine built on Apache DataFusion. PyBallista provides:

- **BallistaSessionContext**: Drop-in replacement for DataFusion's SessionContext
- **SQL Magic Commands**: Interactive SQL in Jupyter notebooks via `%sql` and `%%sql`
- **DataFrame API**: Full DataFrame API for data transformations
- **Rich HTML Display**: DataFrames render as styled HTML tables

## Method 1: Python API

The most straightforward way to use Ballista is via the Python API.

In [None]:
import ballista
from ballista import BallistaSessionContext, setup_test_cluster

# Check versions
print(f"Ballista version: {ballista.__version__}")

In [None]:
# For this demo, we'll use the built-in test cluster
# In production, you would connect to your Ballista scheduler:
# ctx = BallistaSessionContext("df://your-scheduler:50050")

host, port = setup_test_cluster()
ctx = BallistaSessionContext(f"df://{host}:{port}")

print(f"Connected to Ballista at {host}:{port}")
print(f"Session ID: {ctx.session_id}")

In [None]:
# Register a Parquet file as a table
ctx.register_parquet("test_data", "../testdata/test.parquet")

# List registered tables
print("Registered tables:", ctx.tables())

In [None]:
# Execute a SQL query - the DataFrame will render as a nice HTML table
df = ctx.sql("SELECT * FROM test_data LIMIT 10")
df

In [None]:
# You can also use show() for terminal-style output
df.show(5)

In [None]:
# Get the execution plan
print(df.explain())

In [None]:
# Visualize the execution plan (requires graphviz for full SVG)
df.explain_visual()

## Method 2: SQL Magic Commands

For a more interactive experience, use the SQL magic commands!

In [None]:
# Load the Ballista Jupyter extension
%load_ext ballista.jupyter

In [None]:
# Connect to the cluster
%ballista connect df://localhost:50050

In [None]:
# Check connection status
%ballista status

In [None]:
# List registered tables
%ballista tables

In [None]:
# Execute a single-line SQL query
%sql SELECT COUNT(*) as total_rows FROM test_data

In [None]:
%%sql
-- Multi-line queries work with %%sql cell magic
SELECT
    id,
    bool_col,
    tinyint_col
FROM test_data
WHERE id > 2
ORDER BY id
LIMIT 5

In [None]:
%%sql my_result
-- Store the result in a variable for further processing
SELECT * FROM test_data WHERE id <= 3

In [None]:
# The result is now available as a variable
print(f"Number of rows: {my_result.count()}")

# Convert to pandas for further analysis
pandas_df = my_result.to_pandas()
pandas_df.describe()

In [None]:
# View query history
%ballista history

## Data Export

Ballista supports exporting data in multiple formats.

In [None]:
df = ctx.sql("SELECT * FROM test_data LIMIT 100")

# Export to various formats
# df.write_parquet("output.parquet")
# df.write_csv("output.csv")
# df.write_json("output.json")

# Convert to Arrow, Pandas, or Polars
arrow_table = df.to_arrow_table()
print(f"Arrow Table Schema:\n{arrow_table.schema}")

In [None]:
# Convert to pandas
pandas_df = df.to_pandas()
pandas_df.head()

## Next Steps

- Check out the `dataframe_api.ipynb` notebook for more DataFrame operations
- See `distributed_queries.ipynb` for examples of distributed query execution
- Read the [PyBallista documentation](https://datafusion.apache.org/ballista/) for more details