# PySuricata — Pandas Example

This notebook demonstrates how to use PySuricata with pandas DataFrames.

## 1. Installation

```bash
pip install pysuricata
```

## 2. Basic Report

Generate a full HTML report from any pandas DataFrame:

In [1]:
import pandas as pd
from pysuricata import profile

# Load the Titanic dataset
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)
print(f"Loaded {len(df)} rows × {len(df.columns)} columns")
df.head()

Loaded 891 rows × 12 columns


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [2]:
# Generate and save the report
report = profile(df)
report.save_html("titanic_report.html")
print("Report saved to titanic_report.html")

Report saved to titanic_report.html


## 3. Statistics Only (No HTML)

Use `summarize()` when you just need the numbers — great for CI/CD pipelines:

In [3]:
from pysuricata import summarize

stats = summarize(df)

# Dataset-level metrics
print(f"Rows: {stats['dataset']['rows_est']}")
print(f"Columns: {stats['dataset']['cols']}")
print(f"Missing cells: {stats['dataset']['missing_cells_pct']:.1f}%")
print(f"Duplicate rows (est): {stats['dataset']['duplicate_rows_pct_est']:.1f}%")

Rows: 891
Columns: 12
Missing cells: 8.1%
Duplicate rows (est): 0.0%


In [4]:
# Column-level metrics
for col_name, col_stats in stats["columns"].items():
    col_type = col_stats.get("type", "unknown")
    missing = col_stats.get("missing_pct", 0)
    print(f"  {col_name:20s}  type={col_type:12s}  missing={missing:.1f}%")

  PassengerId           type=numeric       missing=0.0%
  Survived              type=categorical   missing=0.0%
  Pclass                type=categorical   missing=0.0%
  Name                  type=categorical   missing=0.0%
  Sex                   type=categorical   missing=0.0%
  Age                   type=numeric       missing=0.0%
  SibSp                 type=categorical   missing=0.0%
  Parch                 type=categorical   missing=0.0%
  Ticket                type=categorical   missing=0.0%
  Fare                  type=numeric       missing=0.0%
  Cabin                 type=categorical   missing=0.0%
  Embarked              type=categorical   missing=0.0%


## 4. Custom Configuration

Tune chunk sizes, sample sizes, and correlation thresholds:

In [5]:
from pysuricata import profile, ReportConfig

config = ReportConfig()
config.compute.chunk_size = 500          # rows per chunk
config.compute.random_seed = 42          # reproducible results
config.compute.compute_correlations = True
config.compute.corr_threshold = 0.5      # only show strong correlations
config.render.title = "Titanic Analysis"

report = profile(df, config=config)
report.save_html("titanic_custom.html")
print("Custom report saved")

Custom report saved


## 5. Streaming Large Datasets

Process datasets larger than RAM by passing a chunk generator:

In [6]:
def read_in_chunks(path, chunk_size=10_000):
    """Yield chunks from a large CSV without loading it all into memory."""
    for chunk in pd.read_csv(path, chunksize=chunk_size):
        yield chunk

# Example (using the same small file for demo purposes)
# report = profile(read_in_chunks("large_dataset.csv"))
# report.save_html("large_report.html")
print("Streaming is supported via generators — see docs for details")

Streaming is supported via generators — see docs for details


## 6. Quality Gate Example

Use `summarize()` in CI to enforce data quality thresholds:

In [7]:
stats = summarize(df)

# Define quality gates
assert stats["dataset"]["missing_cells_pct"] < 20.0, "Too many missing values!"
assert stats["dataset"]["duplicate_rows_pct_est"] < 5.0, "Too many duplicates!"

print("All quality gates passed ✓")

All quality gates passed ✓
