# Start Here: Prerequisites

> **No sample data required!** This framework works directly with your own CSV, Parquet, or Delta files. The datasets below are internal examples for learning - skip to **01_data_discovery.ipynb** if you have your own data.

**Purpose:** Set up your environment and optionally download sample datasets for learning.

**What you'll do:**
- Verify your Python environment
- (Optional) Set up Kaggle API credentials
- (Optional) Download sample churn datasets

---

## 0.1 Verify Environment

First, let's make sure the customer_retention package is installed.

In [1]:
try:
    import customer_retention
    from customer_retention.core.config.experiments import FINDINGS_DIR, EXPERIMENTS_DIR, OUTPUT_DIR, setup_experiments_structure
    print(f"customer_retention is installed")
except ImportError:
    print("customer_retention not found. Install with:")
    print("  uv sync")
    print("  # or: pip install -e .")
from customer_retention.stages.temporal import TEMPORAL_METADATA_COLS


customer_retention is installed


## 0.2 Available Datasets

This framework includes several internal datasets for testing and learning. **You do not need any of these to use the framework with your own data.**

### Entity-Level Datasets (one row per customer)
Use these with the standard exploration flow (notebooks 02, 03, 04).

| Dataset | Status | Description |
|---------|--------|-------------|
| `customer_retention_retail.csv` | Included | Retail customer retention (~31K rows) |
| `bank_customer_churn.csv` | Download | Bank customer churn (~10K rows) |
| `netflix_customer_churn.csv` | Download | Netflix subscription churn (~10K rows) |

### Event-Level Datasets (multiple rows per customer)
Use these with the Event Bronze Track (notebooks 01a, 01b, 01c, 01d).

| Dataset | Status | Description |
|---------|--------|-------------|
| `customer_transactions.csv` | Included | Transaction events (~5K rows) |
| `customer_emails.csv` | Included | Email engagement events (large) |

In [2]:
from pathlib import Path

FIXTURES_DIR = Path("../tests/fixtures")

# Entity-level datasets
entity_datasets = {
    "customer_retention_retail.csv": "Included",
    "bank_customer_churn.csv": "Download from Kaggle",
    "netflix_customer_churn.csv": "Download from Kaggle",
}

# Event-level datasets (internal)
event_datasets = {
    "customer_transactions.csv": "Included",
    "customer_emails.csv": "Included",
}

print("Entity-Level Datasets:")
print("-" * 50)
for name, source in entity_datasets.items():
    path = FIXTURES_DIR / name
    if path.exists():
        size_mb = path.stat().st_size / (1024 * 1024)
        print(f"  [x] {name} ({size_mb:.1f} MB)")
    else:
        print(f"  [ ] {name} - {source}")

print("\nEvent-Level Datasets:")
print("-" * 50)
for name, source in event_datasets.items():
    path = FIXTURES_DIR / name
    if path.exists():
        size_mb = path.stat().st_size / (1024 * 1024)
        print(f"  [x] {name} ({size_mb:.1f} MB)")
    else:
        print(f"  [ ] {name} - {source}")

Entity-Level Datasets:
--------------------------------------------------
  [x] customer_retention_retail.csv (2.4 MB)
  [x] bank_customer_churn.csv (0.6 MB)
  [x] netflix_customer_churn.csv (0.5 MB)

Event-Level Datasets:
--------------------------------------------------
  [x] customer_transactions.csv (4.8 MB)
  [x] customer_emails.csv (5.6 MB)


## 0.3 Kaggle API Setup

To download datasets from Kaggle, you need to set up API credentials:

1. Create a Kaggle account at https://www.kaggle.com
2. Go to **Account Settings** → **API** → **Create New Token**
3. This downloads `kaggle.json` - move it to `~/.kaggle/kaggle.json`
4. Set permissions: `chmod 600 ~/.kaggle/kaggle.json`

In [3]:
# Check if Kaggle credentials exist
kaggle_config = Path.home() / ".kaggle" / "kaggle.json"

if kaggle_config.exists():
    print(f"Kaggle credentials found at {kaggle_config}")
else:
    print("Kaggle credentials not found.")
    print("\nTo set up:")
    print("1. Go to https://www.kaggle.com/settings")
    print("2. Scroll to 'API' section and click 'Create New Token'")
    print(f"3. Move downloaded file to {kaggle_config}")
    print(f"4. Run: chmod 600 {kaggle_config}")

Kaggle credentials not found.

To set up:
1. Go to https://www.kaggle.com/settings
2. Scroll to 'API' section and click 'Create New Token'
3. Move downloaded file to /Users/Vital/.kaggle/kaggle.json
4. Run: chmod 600 /Users/Vital/.kaggle/kaggle.json


## 0.4 Download Kaggle Datasets

Run the cells below to download each dataset. You only need to do this once.

### Bank Customer Churn Dataset
Source: https://www.kaggle.com/datasets/gauravtopre/bank-customer-churn-dataset

In [4]:
# Download Bank Customer Churn dataset
import subprocess
import shutil

FIXTURES_DIR.mkdir(parents=True, exist_ok=True)
bank_churn_path = FIXTURES_DIR / "bank_customer_churn.csv"

if bank_churn_path.exists():
    print(f"Already exists: {bank_churn_path}")
else:
    print("Downloading Bank Customer Churn dataset...")
    try:
        subprocess.run([
            "kaggle", "datasets", "download", "-d", "gauravtopre/bank-customer-churn-dataset",
            "-p", str(FIXTURES_DIR), "--unzip"
        ], check=True)
        # Rename to consistent name
        downloaded = FIXTURES_DIR / "Bank_Churn.csv"
        if downloaded.exists():
            shutil.move(downloaded, bank_churn_path)
        print(f"Downloaded to: {bank_churn_path}")
    except FileNotFoundError:
        print("Error: kaggle CLI not found. Install with: pip install kaggle")
    except subprocess.CalledProcessError as e:
        print(f"Error downloading: {e}")

Already exists: ../tests/fixtures/bank_customer_churn.csv


### Netflix Customer Churn Dataset
Source: https://www.kaggle.com/datasets/vasifasad/netflix-customer-churn-prediction

In [5]:
# Download Netflix Customer Churn dataset
netflix_churn_path = FIXTURES_DIR / "netflix_customer_churn.csv"

if netflix_churn_path.exists():
    print(f"Already exists: {netflix_churn_path}")
else:
    print("Downloading Netflix Customer Churn dataset...")
    try:
        subprocess.run([
            "kaggle", "datasets", "download", "-d", "vasifasad/netflix-customer-churn-prediction",
            "-p", str(FIXTURES_DIR), "--unzip"
        ], check=True)
        print(f"Downloaded to: {netflix_churn_path}")
    except FileNotFoundError:
        print("Error: kaggle CLI not found. Install with: pip install kaggle")
    except subprocess.CalledProcessError as e:
        print(f"Error downloading: {e}")

Already exists: ../tests/fixtures/netflix_customer_churn.csv


## 0.5 Verify Downloads

In [6]:
import pandas as pd

all_datasets = {**entity_datasets, **event_datasets}

print("Dataset Summary:")
print("=" * 60)

for name in all_datasets.keys():
    path = FIXTURES_DIR / name
    if path.exists():
        df = pd.read_csv(path)
        print(f"\n{name}:")
        print(f"  Rows: {len(df):,}")
        print(f"  Columns: {len(df.columns)}")
        print(f"  Columns: {', '.join(df.columns[:5])}{'...' if len(df.columns) > 5 else ''}")
    else:
        print(f"\n{name}: Not downloaded")

Dataset Summary:

customer_retention_retail.csv:
  Rows: 30,801
  Columns: 15
  Columns: custid, retained, created, firstorder, lastorder...

bank_customer_churn.csv:
  Rows: 10,000
  Columns: 13
  Columns: CustomerId, Surname, CreditScore, Geography, Gender...

netflix_customer_churn.csv:
  Rows: 5,000
  Columns: 14
  Columns: customer_id, age, gender, subscription_type, watch_hours...

customer_transactions.csv:
  Rows: 50,000
  Columns: 15
  Columns: transaction_id, customer_id, transaction_date, amount, product_category...



customer_emails.csv:
  Rows: 83,198
  Columns: 13
  Columns: email_id, customer_id, sent_date, campaign_type, opened...


---

## Temporal Framework Overview

This framework includes a **leakage-safe temporal infrastructure** for preventing data leakage in ML pipelines:

- **Timestamp Management**: Automatic detection and handling of `feature_timestamp` and `label_timestamp`
- **Versioned Snapshots**: Point-in-time training snapshots with integrity hashing
- **Scenario Detection**: Automatic detection of production vs Kaggle-style datasets
- **Leakage Detection**: Multi-probe validation (correlation, separation, temporal logic)

The temporal framework ensures that:
1. Features are only computed using data available at prediction time
2. Training data is versioned and reproducible
3. Temporal leakage is detected before model training

---

## 0.6 Using the Temporal Framework

### Loading Data with Snapshot Manager

For production use, load data through the snapshot system to ensure reproducibility:

```python
from pathlib import Path
from customer_retention.stages.temporal import SnapshotManager, UnifiedDataPreparer, ScenarioDetector

output_path = Path("../experiments/findings")
snapshot_manager = SnapshotManager(output_path)

snapshots = snapshot_manager.list_snapshots()
if snapshots:
    latest = snapshot_manager.get_latest_snapshot()
    df, metadata = snapshot_manager.load_snapshot(latest)
    print(f"Loaded {latest}: {df.shape}, created {metadata['created_at']}")
```

### Auto-Detecting Dataset Scenario

The framework automatically detects whether your data is production (with timestamps) or Kaggle-style:

```python
from customer_retention.stages.temporal import ScenarioDetector

detector = ScenarioDetector()
scenario, config, discovery_result = detector.detect(df, target_column="churned")

print(f"Scenario: {scenario}")
print(f"Feature timestamp: {config.feature_timestamp_column}")
print(f"Label timestamp: {config.label_timestamp_column}")
print(f"Strategy: {config.strategy.value}")
```

### Manual Override (When Auto-Detection Fails)

If auto-detection picks wrong columns or an unsuitable strategy, bypass it entirely by creating `TimestampConfig` directly:

```python
from customer_retention.stages.temporal import TimestampManager, TimestampConfig, TimestampStrategy

config = TimestampConfig(
    strategy=TimestampStrategy.PRODUCTION,
    feature_timestamp_column="my_observation_date",
    label_timestamp_column="my_outcome_date",
    observation_window_days=90,
)
manager = TimestampManager(config)
df_with_timestamps = manager.ensure_timestamps(df)
```

**Available strategies:**

| Strategy | When to Use |
|----------|-------------|
| `PRODUCTION` | Data has explicit timestamp columns |
| `DERIVED` | Timestamps can be computed from other columns (e.g., tenure) |
| `SYNTHETIC_FIXED` | No temporal info - use fixed date for all rows |
| `SYNTHETIC_RANDOM` | No temporal info - generate random dates within range |
| `SYNTHETIC_INDEX` | No temporal info - generate dates based on row order |

**Force synthetic timestamps (Kaggle-style data):**

```python
config = TimestampConfig(
    strategy=TimestampStrategy.SYNTHETIC_FIXED,
    synthetic_base_date="2024-01-01",
    observation_window_days=90,
)
```

**Derive timestamps from tenure column:**

```python
config = TimestampConfig(
    strategy=TimestampStrategy.DERIVED,
    derivation_config={
        "feature_derivation": {
            "formula": "reference_date - tenure_months",
            "sources": ["tenure_months"],
        }
    },
    observation_window_days=90,
)
```

### Creating a Training Snapshot from Raw Data

```python
from customer_retention.stages.temporal import UnifiedDataPreparer
from customer_retention.core.config import TemporalConfig

config = TemporalConfig(
    feature_timestamp_column="feature_timestamp",
    label_timestamp_column="label_timestamp",
)

preparer = UnifiedDataPreparer(output_path=Path("../experiments/findings"), timestamp_config=config)

unified_df = preparer.prepare_from_raw(
    df=raw_df, target_column="churned", entity_column="customer_id"
)

snapshot_df, metadata = preparer.create_training_snapshot(df=unified_df, snapshot_name="training")
print(f"Created snapshot: {metadata['snapshot_id']}")
```

---

## Next Steps

You're ready to start exploring! Continue to **01_data_discovery.ipynb**.

**Using your own data?** Just set `DATA_PATH` to your file:
```python
DATA_PATH = "/path/to/your/data.csv"
```

**Using sample datasets?** Choose one based on your learning goal:
```python
# Entity-level (standard flow)
DATA_PATH = "../tests/fixtures/customer_retention_retail.csv"
DATA_PATH = "../tests/fixtures/bank_customer_churn.csv"
DATA_PATH = "../tests/fixtures/netflix_customer_churn.csv"

# Event-level (time series flow)
DATA_PATH = "../tests/fixtures/customer_transactions.csv"
DATA_PATH = "../tests/fixtures/customer_emails.csv"
```