# Interactive Data Visualization with Dask and hvplot

This notebook demonstrates interactive data visualization using Dask and hvplot from S3 data.

## Overview

This example shows how to:
1. Read large datasets from S3 using Dask
2. Create interactive visualizations with hvplot
3. Explore data interactively with widgets
4. Perform efficient computations on large datasets

## Data Source

Data is loaded from S3 using the same pattern as the existing `dask_s3_plot` example:
- Dataset: shboost_08july2024_pub.parq
- S3 Endpoint: https://s3.data.aip.de:9000
- Access: Anonymous (public)


In [None]:
# Import required libraries
import dask.dataframe as dd
import hvplot.pandas
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Enable hvplot extension for Jupyter
hvplot.extension('bokeh')

## Loading Data from S3

We'll read the parquet data from S3 using Dask for efficient handling of large datasets.

In [None]:
# Read parquet data from S3
print("Reading parquet data from S3...")
df = dd.read_parquet(
    "s3://shboost2024/shboost_08july2024_pub.parq/*.parquet",
    storage_options={
        'use_ssl': True,
        'anon': True,
        'client_kwargs': dict(endpoint_url='https://s3.data.aip.de:9000')
    }
)

print(f"Dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

## Data Sampling for Interactive Exploration

Since the full dataset is large, we'll sample it for interactive exploration while maintaining the ability to work with the full dataset when needed.

In [None]:
# Sample data for interactive exploration (5% of the dataset)
print("Sampling data for interactive exploration...")
df_sample, _ = df.random_split([0.05, 0.95], random_state=42)
df_computed = df_sample.compute()

print(f"Sample size: {len(df_computed)} rows")
print(f"Memory usage: {df_computed.memory_usage(deep=True).sum() / 1e6:.2f} MB")

## Basic Data Exploration

Let's examine some basic statistics and data characteristics.

In [None]:
# Display basic statistics
print("Basic Statistics:")
df_computed.describe()

## Interactive Scatter Plot

Create an interactive scatter plot showing the relationship between two variables.

In [None]:
# Create an interactive scatter plot
scatter_plot = df_computed.hvplot.scatter(
    x='xg', 
    y='yg',
    c='bprp0',
    size=10,
    alpha=0.6,
    hover_cols=['xg', 'yg', 'bprp0', 'mg0'],
    width=600,
    height=400,
    title='Galactic Coordinates with BP-RP Color'
)

scatter_plot

## Interactive Hexbin Plot

Create an interactive hexbin plot to visualize density of points in 2D space.

In [None]:
# Create an interactive hexbin plot
hexbin_plot = df_computed.hvplot.hexbin(
    x='xg', 
    y='yg',
    C='bprp0',
    reduce_function=np.mean,
    gridsize=30,
    width=600,
    height=400,
    title='Galactic Coordinates Density (Hexbin)'
)

hexbin_plot

## Interactive Histogram

Create an interactive histogram to visualize distributions.

In [None]:
# Create an interactive histogram
histogram_plot = df_computed.hvplot.hist(
    'bprp0',
    bins=50,
    alpha=0.7,
    width=600,
    height=400,
    title='Distribution of BP-RP Color'
)

histogram_plot

## Interactive Filtering with Widgets

Demonstrate how to use widgets for dynamic filtering of data.

In [None]:
# Create interactive plots with widget-based filtering
import panel as pn
pn.extension()

# Create widgets for filtering
x_range = pn.widgets.RangeSlider(name='X Range', start=-20, end=20, value=(-10, 10))
y_range = pn.widgets.RangeSlider(name='Y Range', start=-15, end=15, value=(-10, 10))
color_range = pn.widgets.RangeSlider(name='BP-RP Range', start=-1, end=5, value=(0, 2))

# Define a function to update the plot based on filters
@pn.depends(x_range.param.value, y_range.param.value, color_range.param.value)
def filtered_scatter(x_vals, y_vals, color_vals):
    # Filter data based on widget values
    filtered_df = df_computed[
        (df_computed['xg'] >= x_vals[0]) & (df_computed['xg'] <= x_vals[1]) &
        (df_computed['yg'] >= y_vals[0]) & (df_computed['yg'] <= y_vals[1]) &
        (df_computed['bprp0'] >= color_vals[0]) & (df_computed['bprp0'] <= color_vals[1])
    ]
    
    return filtered_df.hvplot.scatter(
        x='xg', 
        y='yg',
        c='bprp0',
        size=10,
        alpha=0.6,
        width=600,
        height=400,
        title=f'Filtered Galactic Coordinates (Points: {len(filtered_df)})'
    )

# Display the widgets and plot
pn.Column(
    pn.Row(x_range, y_range),
    color_range,
    filtered_scatter
)

## Advanced Interactive Visualization

Combine multiple plots into a comprehensive dashboard.

In [None]:
# Create a combined dashboard
dashboard = pn.Column(
    pn.Row(scatter_plot, hexbin_plot),
    pn.Row(histogram_plot),
    pn.Row(x_range, y_range, color_range),
    filtered_scatter
)

dashboard

## Conclusion

This notebook demonstrates:

1. **Data Loading**: Efficiently reading large datasets from S3 using Dask
2. **Interactive Plots**: Creating responsive visualizations with hvplot
3. **Dynamic Filtering**: Using widgets to interactively explore data
4. **Scalability**: Working with large datasets while maintaining interactivity

Key advantages of this approach:
- Uses Dask for lazy evaluation and efficient memory usage
- hvplot provides rich, interactive visualizations
- Panel enables creation of dashboards with interactive widgets
- All operations work with the full S3 dataset when needed

The interactive plots allow for real-time exploration of the data, making it easy to identify patterns and relationships in large datasets.