# PySuricata — Pandas Example

This notebook demonstrates how to use PySuricata with pandas DataFrames.

## 1. Installation

```bash
pip install pysuricata
```

## 2. Basic Report

Generate a full HTML report from any pandas DataFrame:

In [None]:
import pandas as pd
from pysuricata import profile

# Load the Titanic dataset
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)
print(f"Loaded {len(df)} rows × {len(df.columns)} columns")
df.head()

In [None]:
# Generate and save the report
report = profile(df)
report.save_html("titanic_report.html")
print("Report saved to titanic_report.html")

## 3. Statistics Only (No HTML)

Use `summarize()` when you just need the numbers — great for CI/CD pipelines:

In [None]:
from pysuricata import summarize

stats = summarize(df)

# Dataset-level metrics
print(f"Rows: {stats['dataset']['row_count']}")
print(f"Columns: {stats['dataset']['column_count']}")
print(f"Missing cells: {stats['dataset']['missing_cells_pct']:.1f}%")
print(f"Duplicate rows (est): {stats['dataset']['duplicate_rows_pct_est']:.1f}%")

In [None]:
# Column-level metrics
for col_name, col_stats in stats["columns"].items():
    col_type = col_stats.get("type", "unknown")
    missing = col_stats.get("missing_pct", 0)
    print(f"  {col_name:20s}  type={col_type:12s}  missing={missing:.1f}%")

## 4. Custom Configuration

Tune chunk sizes, sample sizes, and correlation thresholds:

In [None]:
from pysuricata import profile, ReportConfig

config = ReportConfig()
config.compute.chunk_size = 500          # rows per chunk
config.compute.random_seed = 42          # reproducible results
config.compute.compute_correlations = True
config.compute.corr_threshold = 0.5      # only show strong correlations
config.render.title = "Titanic Analysis"

report = profile(df, config=config)
report.save_html("titanic_custom.html")
print("Custom report saved")

## 5. Streaming Large Datasets

Process datasets larger than RAM by passing a chunk generator:

In [None]:
def read_in_chunks(path, chunk_size=10_000):
    """Yield chunks from a large CSV without loading it all into memory."""
    for chunk in pd.read_csv(path, chunksize=chunk_size):
        yield chunk

# Example (using the same small file for demo purposes)
# report = profile(read_in_chunks("large_dataset.csv"))
# report.save_html("large_report.html")
print("Streaming is supported via generators — see docs for details")

## 6. Quality Gate Example

Use `summarize()` in CI to enforce data quality thresholds:

In [None]:
stats = summarize(df)

# Define quality gates
assert stats["dataset"]["missing_cells_pct"] < 20.0, "Too many missing values!"
assert stats["dataset"]["duplicate_rows_pct_est"] < 5.0, "Too many duplicates!"

print("All quality gates passed ✓")