# spark-dist-fit API Demo

This notebook demonstrates the complete API for the `spark-dist-fit` library, including:

1. **Configuration** - FitConfig, PlotConfig, SparkConfig, and AppConfig
2. **Config Loading** - From code, strings, and files (HOCON/YAML/JSON)
3. **Distribution Fitting** - Using DistributionFitter
4. **Working with Results** - FitResults and FitResult objects
5. **Plotting** - Visualization with PlotConfig
6. **Convenience Methods** - One-line config loading with `from_config()`

## Setup

First, let's import the required modules and create a Spark session.

In [None]:
import numpy as np
from pyspark.sql import SparkSession

# Import all public API components
from spark_dist_fit import (
    DistributionFitter,
    FitConfig,
    PlotConfig,
    SparkConfig,
    AppConfig,
    DEFAULT_EXCLUDED_DISTRIBUTIONS,
)

# Create Spark session
spark = (
    SparkSession.builder
    .appName("API-Demo")
    .config("spark.sql.shuffle.partitions", "10")
    .config("spark.sql.execution.arrow.pyspark.enabled", "true")
    .getOrCreate()
)

print(f"Spark version: {spark.version}")

## Generate Sample Data

We'll create sample data from known distributions for demonstration.

In [None]:
np.random.seed(42)

# Normal distribution data
normal_data = np.random.normal(loc=50, scale=10, size=50_000)
df_normal = spark.createDataFrame([(float(x),) for x in normal_data], ["value"])

# Exponential distribution data (non-negative)
exp_data = np.random.exponential(scale=5, size=50_000)
df_exp = spark.createDataFrame([(float(x),) for x in exp_data], ["value"])

# Gamma distribution data
gamma_data = np.random.gamma(shape=2.0, scale=2.0, size=50_000)
df_gamma = spark.createDataFrame([(float(x),) for x in gamma_data], ["value"])

print(f"Normal data: {df_normal.count():,} rows, mean={normal_data.mean():.2f}, std={normal_data.std():.2f}")
print(f"Exponential data: {df_exp.count():,} rows, mean={exp_data.mean():.2f}")
print(f"Gamma data: {df_gamma.count():,} rows, mean={gamma_data.mean():.2f}")

---

# Part 1: Configuration

spark-dist-fit uses frozen dataclasses for type-safe, immutable configuration.

## 1.1 FitConfig - Distribution Fitting Configuration

`FitConfig` controls histogram computation, sampling, and distribution selection.

In [None]:
# Default configuration
default_fit_config = FitConfig()
print("Default FitConfig:")
print(f"  bins: {default_fit_config.bins}")
print(f"  use_rice_rule: {default_fit_config.use_rice_rule}")
print(f"  support_at_zero: {default_fit_config.support_at_zero}")
print(f"  enable_sampling: {default_fit_config.enable_sampling}")
print(f"  sample_threshold: {default_fit_config.sample_threshold:,}")
print(f"  max_sample_size: {default_fit_config.max_sample_size:,}")
print(f"  max_sample_fraction: {default_fit_config.max_sample_fraction}")
print(f"  random_seed: {default_fit_config.random_seed}")
print(f"  excluded_distributions: {len(default_fit_config.excluded_distributions)} distributions")

In [None]:
# Custom configuration
custom_fit_config = FitConfig(
    bins=100,                      # More bins for better resolution
    use_rice_rule=False,           # Don't auto-calculate bins
    support_at_zero=True,          # Only fit non-negative distributions
    enable_sampling=True,          # Enable adaptive sampling
    sample_fraction=0.3,           # Sample 30% of data
    max_sample_size=500_000,       # Cap samples at 500K
    random_seed=123,               # Custom seed for reproducibility
)

print("Custom FitConfig (for non-negative data):")
print(f"  bins: {custom_fit_config.bins}")
print(f"  support_at_zero: {custom_fit_config.support_at_zero}")
print(f"  sample_fraction: {custom_fit_config.sample_fraction}")

In [None]:
# Customizing excluded distributions
print(f"\nDefault excluded distributions ({len(DEFAULT_EXCLUDED_DISTRIBUTIONS)}):")
for dist in sorted(DEFAULT_EXCLUDED_DISTRIBUTIONS):
    print(f"  - {dist}")

# Include a specific distribution that's excluded by default
custom_exclusions = tuple(d for d in DEFAULT_EXCLUDED_DISTRIBUTIONS if d != "wald")
config_with_wald = FitConfig(excluded_distributions=custom_exclusions)
print(f"\nNow fitting 'wald' distribution (removed from exclusions)")

## 1.2 PlotConfig - Visualization Configuration

`PlotConfig` controls matplotlib figure settings.

In [None]:
# Default PlotConfig
default_plot_config = PlotConfig()
print("Default PlotConfig:")
print(f"  figsize: {default_plot_config.figsize}")
print(f"  dpi: {default_plot_config.dpi}")
print(f"  show_histogram: {default_plot_config.show_histogram}")
print(f"  histogram_alpha: {default_plot_config.histogram_alpha}")
print(f"  pdf_linewidth: {default_plot_config.pdf_linewidth}")
print(f"  save_format: {default_plot_config.save_format}")

In [None]:
# Custom PlotConfig for presentations
presentation_plot_config = PlotConfig(
    figsize=(16, 10),
    dpi=150,  # Lower DPI for notebook display
    histogram_alpha=0.6,
    pdf_linewidth=3,
    title_fontsize=18,
    label_fontsize=14,
    legend_fontsize=12,
    grid_alpha=0.4,
)

print("Presentation PlotConfig:")
print(f"  figsize: {presentation_plot_config.figsize}")
print(f"  title_fontsize: {presentation_plot_config.title_fontsize}")

## 1.3 SparkConfig - Spark Session Configuration

`SparkConfig` manages Spark session settings for optimal performance.

In [None]:
# Default SparkConfig
default_spark_config = SparkConfig()
print("Default SparkConfig:")
print(f"  app_name: {default_spark_config.app_name}")
print(f"  arrow_enabled: {default_spark_config.arrow_enabled}")
print(f"  adaptive_enabled: {default_spark_config.adaptive_enabled}")
print(f"  adaptive_coalesce_enabled: {default_spark_config.adaptive_coalesce_enabled}")

# Convert to Spark config dict
spark_settings = default_spark_config.to_spark_config()
print("\nSpark settings dict:")
for k, v in spark_settings.items():
    print(f"  {k}: {v}")

## 1.4 Config Immutability

All config classes are frozen dataclasses - they cannot be modified after creation.

In [None]:
# Attempt to modify config (will fail)
try:
    default_fit_config.bins = 200
except AttributeError as e:
    print(f"Cannot modify frozen config: {e}")

# To change config values, create a new instance
new_config = FitConfig(bins=200, support_at_zero=True)
print(f"\nCreated new config with bins={new_config.bins}")

---

# Part 2: Config Loading

Configs can be loaded from HOCON, YAML, or JSON files/strings.

## 2.1 Config from String (HOCON format)

In [None]:
# Load FitConfig from HOCON string (flat structure - no nesting)
hocon_string = """
bins = 75
use_rice_rule = false
support_at_zero = false
enable_sampling = true
sample_fraction = 0.25
random_seed = 99
"""

config_from_string = FitConfig.from_string(hocon_string)
print("FitConfig from HOCON string:")
print(f"  bins: {config_from_string.bins}")
print(f"  sample_fraction: {config_from_string.sample_fraction}")
print(f"  random_seed: {config_from_string.random_seed}")

In [None]:
# Load PlotConfig from HOCON string
plot_hocon = """
figsize = [14, 8]
dpi = 100
histogram_alpha = 0.7
pdf_linewidth = 2
title_fontsize = 16
"""

plot_config_from_string = PlotConfig.from_string(plot_hocon)
print("PlotConfig from HOCON string:")
print(f"  figsize: {plot_config_from_string.figsize}")
print(f"  dpi: {plot_config_from_string.dpi}")

## 2.2 Nested Config with AppConfig

For HOCON files with nested structure (`fit{}`, `plot{}`, `spark{}`), use `AppConfig`.

In [None]:
# Nested HOCON config (typical production format)
nested_hocon = """
spark {
    app_name = "demo-app"
    arrow_enabled = true
    adaptive_enabled = true
}

fit {
    bins = 80
    use_rice_rule = false
    support_at_zero = false
    enable_sampling = true
    sample_fraction = 0.4
    random_seed = 42
}

plot {
    figsize = [12, 8]
    dpi = 150
    histogram_alpha = 0.5
}
"""

app_config = AppConfig.from_string(nested_hocon)
print("AppConfig from nested HOCON:")
print(f"  spark.app_name: {app_config.spark.app_name}")
print(f"  fit.bins: {app_config.fit.bins}")
print(f"  fit.sample_fraction: {app_config.fit.sample_fraction}")
print(f"  plot.figsize: {app_config.plot.figsize}")

## 2.3 Config from File

Load configuration from a file (HOCON, YAML, or JSON).

In [None]:
from pathlib import Path

# Load from repository example config
config_path = Path("../config/example.conf")

if config_path.exists():
    app_config_from_file = AppConfig.from_file(str(config_path))
    print("AppConfig from file:")
    print(f"  File: {config_path}")
    print(f"  spark.app_name: {app_config_from_file.spark.app_name}")
    print(f"  fit.bins: {app_config_from_file.fit.bins}")
    print(f"  fit.excluded_distributions: {len(app_config_from_file.fit.excluded_distributions)} distributions")
    print(f"  plot.dpi: {app_config_from_file.plot.dpi}")
else:
    print(f"Config file not found: {config_path}")
    print("Run this notebook from the examples/ directory")

---

# Part 3: Distribution Fitting

The `DistributionFitter` class is the main entry point for fitting distributions.

## 3.1 Basic Fitting

In [None]:
# Create fitter with default config
fitter = DistributionFitter()

# Fit distributions to normal data (limit to 20 for demo speed)
print("Fitting distributions to normal data...")
results_normal = fitter.fit(df_normal, column="value", max_distributions=20)

print(f"\nFitted {results_normal.count()} distributions")

## 3.2 Fitting with Custom Config

In [None]:
# Config for non-negative distributions only
nonneg_config = FitConfig(
    bins=100,
    support_at_zero=True,  # Only fit non-negative distributions
    enable_sampling=True,
)

fitter_nonneg = DistributionFitter(config=nonneg_config)

print("Fitting non-negative distributions to exponential data...")
results_exp = fitter_nonneg.fit(df_exp, column="value", max_distributions=15)

print(f"Fitted {results_exp.count()} non-negative distributions")

## 3.3 Using from_config() Convenience Method

In [None]:
# One-liner to create fitter from config file
config_path = Path("../config/example.conf")

if config_path.exists():
    # Load fitter with all configs in one line
    fitter_from_config = DistributionFitter.from_config(str(config_path))
    
    print("Fitter created from config file:")
    print(f"  fit.bins: {fitter_from_config.config.bins}")
    print(f"  plot_config available: {fitter_from_config.plot_config is not None}")
    
    # The plot_config is automatically loaded
    if fitter_from_config.plot_config:
        print(f"  plot.dpi: {fitter_from_config.plot_config.dpi}")
else:
    print("Config file not found - run from examples/ directory")

---

# Part 4: Working with Results

The `fit()` method returns a `FitResults` object for easy result manipulation.

## 4.1 Getting Best Distributions

In [None]:
# Get best distribution by SSE (default)
best_sse = results_normal.best(n=1)[0]
print(f"Best by SSE: {best_sse.distribution}")
print(f"  SSE: {best_sse.sse:.6f}")
print(f"  AIC: {best_sse.aic:.2f}")
print(f"  BIC: {best_sse.bic:.2f}")
print(f"  Parameters: {[f'{p:.4f}' for p in best_sse.parameters]}")

In [None]:
# Get top 5 by different metrics
print("\nTop 5 by SSE:")
for i, r in enumerate(results_normal.best(n=5, metric="sse"), 1):
    print(f"  {i}. {r.distribution:20s} SSE={r.sse:.6f}")

print("\nTop 5 by AIC:")
for i, r in enumerate(results_normal.best(n=5, metric="aic"), 1):
    print(f"  {i}. {r.distribution:20s} AIC={r.aic:.2f}")

print("\nTop 5 by BIC:")
for i, r in enumerate(results_normal.best(n=5, metric="bic"), 1):
    print(f"  {i}. {r.distribution:20s} BIC={r.bic:.2f}")

## 4.2 Filtering Results

In [None]:
# Filter by SSE threshold
good_fits = results_normal.filter(sse_threshold=0.01)
print(f"Distributions with SSE < 0.01: {good_fits.count()}")

for r in good_fits.best(n=10):
    print(f"  {r.distribution:20s} SSE={r.sse:.6f}")

## 4.3 Converting to Pandas

In [None]:
# Convert to pandas DataFrame for further analysis
df_results = results_normal.to_pandas()
print("Results as pandas DataFrame:")
df_results.head(10)

## 4.4 Using Fitted Distributions

In [None]:
# The FitResult object wraps the scipy.stats distribution
best = results_normal.best(n=1)[0]

# Generate samples from the fitted distribution
samples = best.sample(size=10000, random_state=42)
print(f"Generated {len(samples)} samples from fitted {best.distribution}")
print(f"  Sample mean: {samples.mean():.2f} (original: {normal_data.mean():.2f})")
print(f"  Sample std: {samples.std():.2f} (original: {normal_data.std():.2f})")

In [None]:
# Evaluate PDF at specific points
x = np.array([30, 40, 50, 60, 70])
pdf_values = best.pdf(x)
cdf_values = best.cdf(x)

print("\nPDF and CDF values:")
for xi, pdf, cdf in zip(x, pdf_values, cdf_values):
    print(f"  x={xi}: PDF={pdf:.6f}, CDF={cdf:.4f}")

---

# Part 5: Plotting

Visualize the fitted distribution with the data histogram.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

## 5.1 Basic Plot

In [None]:
# Basic plot with default config
fig, ax = fitter.plot(
    best,
    df_normal,
    "value",
    title="Best Fit Distribution (Normal Data)",
    xlabel="Value",
    ylabel="Density"
)
plt.show()

## 5.2 Plot with Custom PlotConfig

In [None]:
# Custom plot configuration
custom_plot_config = PlotConfig(
    figsize=(14, 8),
    dpi=100,
    histogram_alpha=0.7,
    pdf_linewidth=3,
    title_fontsize=18,
    label_fontsize=14,
    legend_fontsize=12,
    grid_alpha=0.4,
)

fig, ax = fitter.plot(
    best,
    df_normal,
    "value",
    config=custom_plot_config,
    title="Distribution Fit with Custom Styling",
    xlabel="Value",
    ylabel="Density"
)
plt.show()

## 5.3 Plot Non-Negative Distribution

In [None]:
# Best fit for exponential data
best_exp = results_exp.best(n=1)[0]
print(f"Best fit for exponential data: {best_exp.distribution}")

fig, ax = fitter_nonneg.plot(
    best_exp,
    df_exp,
    "value",
    config=custom_plot_config,
    title=f"Best Fit: {best_exp.distribution.capitalize()}",
    xlabel="Value",
    ylabel="Density"
)
plt.show()

---

# Part 6: Complete Workflow Example

Putting it all together - a complete production-style workflow.

In [None]:
# Define config as HOCON string (could also be loaded from file)
production_config = """
spark {
    app_name = "production-fitting"
    arrow_enabled = true
    adaptive_enabled = true
}

fit {
    bins = 100
    use_rice_rule = false
    support_at_zero = false
    enable_sampling = true
    max_sample_size = 1000000
    random_seed = 42
}

plot {
    figsize = [14, 9]
    dpi = 150
    histogram_alpha = 0.6
    pdf_linewidth = 3
    title_fontsize = 16
}
"""

# Load config
config = AppConfig.from_string(production_config)

# Create fitter with config
fitter = DistributionFitter(config=config.fit, spark_config=config.spark)

# Fit distributions
print("Fitting gamma distribution data...")
results = fitter.fit(df_gamma, column="value", max_distributions=25)

# Get best result
best = results.best(n=1)[0]
print(f"\nBest distribution: {best.distribution}")
print(f"SSE: {best.sse:.6f}")
print(f"Parameters: {[f'{p:.4f}' for p in best.parameters]}")

# Plot with config
fig, ax = fitter.plot(
    best,
    df_gamma,
    "value",
    config=config.plot,
    title=f"Gamma Data - Best Fit: {best.distribution.capitalize()}",
    xlabel="Value",
    ylabel="Density"
)
plt.show()

# Show top 5 results
print("\nTop 5 distributions:")
df_top5 = results.to_pandas().head(5)
df_top5[["distribution", "sse", "aic", "bic"]]

---

# Cleanup

In [None]:
spark.stop()
print("Spark session stopped.")

---

## Summary

This notebook demonstrated:

1. **Configuration Classes**:
   - `FitConfig` - Distribution fitting parameters
   - `PlotConfig` - Visualization settings
   - `SparkConfig` - Spark session configuration
   - `AppConfig` - Container for nested configs

2. **Config Loading**:
   - `FitConfig()` - Direct instantiation
   - `FitConfig.from_string()` - From HOCON/YAML/JSON string
   - `AppConfig.from_file()` - From nested config file
   - `DistributionFitter.from_config()` - One-liner convenience method

3. **Fitting**:
   - `DistributionFitter.fit()` - Fit distributions to data
   - `max_distributions` parameter to limit fitting scope
   - `support_at_zero` for non-negative data

4. **Results**:
   - `results.best(n, metric)` - Get top N by SSE/AIC/BIC
   - `results.filter()` - Filter by threshold
   - `results.to_pandas()` - Convert to pandas DataFrame
   - `FitResult.sample()`, `.pdf()`, `.cdf()` - Use fitted distribution

5. **Plotting**:
   - `fitter.plot()` - Visualize fitted distribution with data histogram