
# 🚀 Data Preparation (SageMaker + Redshift)

**Purpose:** Create a *stable*, **reproducible**, and **config-driven** data preparation process that is safe to promote toward production deployment.

> This notebook is designed to run on SageMaker or locally. It loads data from **Redshift** or **S3 Parquet** via the provided `load_data` function in `data_io.py`, applies deterministic preprocessing & feature engineering, validates schema, and writes versioned artifacts for downstream training.



## 📦 What You Get
- Config-first **data loader** (Redshift or S3 Parquet) using `load_data()` from `data_io.py`  
- **Cleaning & feature engineering** examples
- **Leakage-aware** train/val/test splits → `splits.json`  
- Export **processed dataset** → Parquet (optionally partitioned)  



## 🧰 Prerequisites
- Python 3.9+
- Packages: `pandas`, `numpy`, `pyarrow`, `scikit-learn`, `mlflow` (optional), `sqlalchemy`, `redshift_connector`, `s3fs`
- A `data_io.py` next to this notebook containing the provided `load_data(...)` implementation.


In [1]:

# If running on a fresh environment, uncomment as needed (SageMaker kernels usually have most of these)
# %pip install pandas numpy pyarrow scikit-learn mlflow sqlalchemy redshift_connector s3fs



## ♻️ Reproducibility & Environment Capture
- **Fixed seeds** for deterministic results.
- Capture **package versions** for traceability.
- All artifacts are written under a **run folder** with a unique timestamp/hash.


In [2]:
import os, sys, json, platform, random, hashlib
from datetime import datetime
import numpy as np
import pandas as pd

# Reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)

# Run folder
RUN_TS = datetime.utcnow().strftime("%Y%m%dT%H%M%SZ")
RUN_ID = hashlib.sha1(f"{RUN_TS}-{SEED}".encode()).hexdigest()[:8]
ARTIFACT_DIR = os.environ.get("ARTIFACT_DIR", f"artifacts/run_{RUN_TS}_{RUN_ID}")
os.makedirs(ARTIFACT_DIR, exist_ok=True)

# Minimal environment capture
ENV_INFO = {
    "python": sys.version,
    "platform": platform.platform(),
    "timestamp_utc": RUN_TS,
    "seed": SEED,
    "packages": {"pandas": pd.__version__, "numpy": np.__version__}
}
with open(os.path.join(ARTIFACT_DIR, "env_info.json"), "w") as f:
    json.dump(ENV_INFO, f, indent=2)

ARTIFACT_DIR


  RUN_TS = datetime.utcnow().strftime("%Y%m%dT%H%M%SZ")


'artifacts/run_20251016T180850Z_0e3e59e5'


## ⚙️ Configuration
Single source of truth for inputs, outputs, and behavior. Switch **source** between `"redshift"` and `"parquet"` using this cell only.


In [3]:
# Configuration - similar to production but simplified for training
CONFIG = {
    "target_col": "churn",
    "test_size": 0.2,
    "random_seed": SEED,
    
    # Feature lists (identify which columns are what type)
    "id_features": ["idconsumo", "codigocontaservico", "idconta", "iddim_cliente"],
    "int_features": ["codigocontaservico", "codigocliente", "n_dias_subscricao"],
    "datetime_features": ["iddim_date_inicio", "iddim_date_fim"]
}

print(f"Target: {CONFIG['target_col']}")
print(f"Test size: {CONFIG['test_size']:.0%}")

Target: churn
Test size: 20%



## 📥 Load Data (Redshift or S3 Parquet)
Uses `load_data()` defined in `data_io.py`:

```python
df = load_data(
    source=CONFIG["data"]["source"],
    uri=CONFIG["data"]["parquet_uri"],
    sql=CONFIG["data"]["sql"],
    redshift_kwargs=CONFIG["data"]["redshift_kwargs"]
)
```
If `data_io.py` is not found, we fall back to a **synthetic demo dataset** so the rest of the pipeline remains testable.


In [4]:
df = pd.read_parquet('../sample_data_from_redshift/sample.parquet')


## 🔎 Quick Profile
Lightweight overview to understand data types, nulls, and basic distributions.


In [5]:
# Quick look at the data
print(f"Dataset shape: {df.shape[0]:,} rows x {df.shape[1]} columns\n")

# Check target distribution
print("Target distribution:")
print(df[CONFIG['target_col']].value_counts())
print(f"\nChurn rate: {df[CONFIG['target_col']].mean():.1%}")

# Check for missing values
print(f"\nTotal missing values: {df.isnull().sum().sum():,}")

# Show first few rows
print("\nFirst 3 rows:")
df.head(3)

Dataset shape: 100,000 rows x 58 columns

Target distribution:
churn
0    70329
1    29671
Name: count, dtype: int64

Churn rate: 29.7%

Total missing values: 297,424

First 3 rows:


Unnamed: 0,idconsumo,id_contaservico,codigocontaservico,idconta,iddim_date_inicio,iddim_date_fim,id_produto_actual,tipo_produto_actual,tipo_subscricao,tipo_stb,...,was_contacted,topup_count,topup_total_value,topup_avg_value,topup_std_value,topup_cv_value,topup_days_since_last,used_selfcare,topup_type_nunique,topup_channel_nunique
0,511039064,3594153,110614850201,3539750,2025-05-17,2025-05-19,24,tafacil7,7,HD,...,0,32,12280.7,383.77,578.333869,1.50698,2,0,4,1
1,412255701,1157023,133299940101,1120996,2024-05-30,2024-06-25,24,normal,7,HD,...,0,7,10657.89,1522.55,1898.960612,1.247224,26,0,3,0
2,534347088,2987119,111385530401,2930755,2025-08-04,2025-08-06,24,normal,7,HD,...,0,72,26570.18,369.03,653.186315,1.770009,2,0,4,1



## 🧾 Feature Schema (Draft)
Define **types, nullability, and basic constraints**. This schema will be exported to JSON and should be reviewed by ML + DevOps.


In [6]:
# Identify feature types using config (like production)
print("Feature types in the dataset:\n")

# Get feature lists from config
id_cols = [c for c in CONFIG['id_features'] if c in df.columns]
int_cols = [c for c in CONFIG['int_features'] if c in df.columns]
datetime_cols = [c for c in CONFIG['datetime_features'] if c in df.columns]

# Everything else is either numeric or categorical
all_special = set(id_cols + int_cols + datetime_cols + [CONFIG['target_col']])
remaining = [c for c in df.columns if c not in all_special]

# Split remaining into numeric vs categorical
numeric_cols = df[remaining].select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = df[remaining].select_dtypes(include=['object']).columns.tolist()

print(f"ID features: {len(id_cols)}")
print(f"Integer features: {len(int_cols)}")
print(f"Datetime features: {len(datetime_cols)}")
print(f"Other numeric: {len(numeric_cols)}")
print(f"Categorical: {len(categorical_cols)}")
print(f"Target: 1")

print(f"\nTotal features: {len(df.columns)}")

Feature types in the dataset:

ID features: 4
Integer features: 3
Datetime features: 2
Other numeric: 38
Categorical: 10
Target: 1

Total features: 58



## 🧼 Cleaning
- Handle missing values
- Normalize categoricals
- Fix invalid values (e.g., negative prices)
- Optional: Outlier capping via IQR
> **Stable rule sets** are critical—avoid ad-hoc fixes. All transformations must be deterministic and versioned.


In [7]:
# Data cleaning - based on production preprocess_data() function
print("Starting data cleaning...\n")

# 1. Convert datetime features to datetime type
for col in datetime_cols:
    if 'date' in col.lower():
        df[col] = pd.to_datetime(df[col], errors='coerce')
        print(f"Converted {col} to datetime")

# 2. Fill integer features with -1 (production approach)
for col in int_cols:
    missing = df[col].isnull().sum()
    if missing > 0:
        df[col] = df[col].fillna(-1).astype(int)
        print(f"Filled {missing:,} missing in '{col}' with -1")
    else:
        df[col] = df[col].astype(int)

# 3. Fill other numeric features with 0.0
for col in numeric_cols:
    missing = df[col].isnull().sum()
    if missing > 0:
        df[col] = df[col].fillna(0.0)
        print(f"Filled {missing:,} missing in '{col}' with 0.0")

# 4. Fill categorical features with '<MISSING>' (production approach)
for col in categorical_cols:
    missing = df[col].isnull().sum()
    if missing > 0:
        df[col] = df[col].fillna('<MISSING>')
        print(f"Filled {missing:,} missing in '{col}' with '<MISSING>'")

# 5. Drop rows where target is missing
initial_rows = len(df)
df = df.dropna(subset=[CONFIG['target_col']])
dropped = initial_rows - len(df)
if dropped > 0:
    print(f"Dropped {dropped:,} rows with missing target")

# 6. Ensure target is integer
df[CONFIG['target_col']] = df[CONFIG['target_col']].astype(int)

print(f"\n✓ Cleaning complete: {len(df):,} rows, {df.isnull().sum().sum()} missing values")

Starting data cleaning...

Converted iddim_date_inicio to datetime
Converted iddim_date_fim to datetime
Filled 80 missing in 'codigocliente' with -1
Filled 3,751 missing in 'gap_since_prev_expiry' with 0.0
Filled 7,229 missing in 'mean_len_prev' with 0.0
Filled 10,721 missing in 'std_len_prev' with 0.0
Filled 80 missing in 'idcliente' with 0.0
Filled 66,524 missing in 'age' with 0.0
Filled 80 missing in 'age_missing' with 0.0
Filled 80 missing in 'account_age_d_cliente' with 0.0
Filled 6,102 missing in 'days_since_last_update_cliente' with 0.0
Filled 80 missing in 'iddim_conta' with 0.0
Filled 848 missing in 'idgrupo_dim_contadimensao' with 0.0
Filled 80 missing in 'account_age_d_conta' with 0.0
Filled 80 missing in 'iddim_contaservico_dth' with 0.0
Filled 80 missing in 'account_age_d_contaservico' with 0.0
Filled 99,373 missing in 'days_since_last_contact' with 0.0
Filled 196 missing in 'topup_avg_value' with 0.0
Filled 1,012 missing in 'topup_std_value' with 0.0
Filled 778 missing in


## 🧩 Feature Engineering (Examples)
- Ratios and interactions
- Encoding categoricals for model training (defer heavy encoders to training pipeline)
- Date/time features (if present)


In [8]:
# Feature engineering (optional - keep it minimal for now)
# Production code doesn't create new features here, just prepares existing ones
# You can add domain-specific features if needed

print("Feature engineering step...")

# Example: Convert tenure_bucket to string if it exists (production does this)
if 'tenure_bucket' in df.columns:
    df['tenure_bucket'] = df['tenure_bucket'].astype(str)
    print("Converted tenure_bucket to string")

# Add your own feature engineering here if needed
# Examples:
# - Ratios: df['ratio'] = df['col1'] / df['col2']  
# - Bins: df['age_group'] = pd.cut(df['age'], bins=[0, 30, 50, 100])
# - Flags: df['is_active'] = (df['days_since_last'] < 30).astype(int)

print(f"\nTotal features: {len(df.columns)}")

Feature engineering step...
Converted tenure_bucket to string

Total features: 58



## ✅ Lightweight Validation
Simple checks before export. For production, consider a formal framework (e.g., Great Expectations or pandera).


In [9]:
# Quick validation checks
print("Running basic checks...\n")

# Check 1: No missing targets
target_col = CONFIG['target_col']
assert df[target_col].isnull().sum() == 0, "Target has missing values!"
print("✓ No missing targets")

# Check 2: Target is binary (0 and 1)
unique_targets = df[target_col].unique()
assert set(unique_targets).issubset({0, 1}), f"Target should be 0/1, found: {unique_targets}"
print("✓ Target is binary")

# Check 3: Enough data
assert len(df) > 1000, f"Not enough data: {len(df)} rows"
print(f"✓ Sufficient data: {len(df):,} rows")

# Check 4: Check class balance
churn_rate = df[target_col].mean()
print(f"✓ Churn rate: {churn_rate:.1%} ({churn_rate*len(df):.0f} churners)")

print("\nAll checks passed!")

Running basic checks...

✓ No missing targets
✓ Target is binary
✓ Sufficient data: 100,000 rows
✓ Churn rate: 29.7% (29671 churners)

All checks passed!



## ✂️ Train / Validation / Test Split
- Deterministic with fixed `random_state`
- Stratified if classification (`stratify=True`)
- Optionally **time-based** if a timestamp column is provided (edit here if needed)


In [10]:
# Time-based split (simplified from prepare_data_for_training in production)
# This is better for time-series data - train on older data, test on newer data

print("Splitting data by time...\n")

# Sort by date column
date_col = 'iddim_date_inicio'
if date_col in df.columns:
    df = df.sort_values(date_col)
    
    # Define cutoff points (80% train, 10% val, 10% test)
    train_size = 0.8
    val_size = 0.1
    
    n = len(df)
    train_end = int(n * train_size)
    val_end = int(n * (train_size + val_size))
    
    # Split by position (time-ordered)
    train_df = df.iloc[:train_end].copy()
    val_df = df.iloc[train_end:val_end].copy()
    test_df = df.iloc[val_end:].copy()
    
    # Show results
    print(f"Train: {len(train_df):,} rows ({len(train_df)/len(df):.0%})")
    print(f"Val:   {len(val_df):,} rows ({len(val_df)/len(df):.0%})")
    print(f"Test:  {len(test_df):,} rows ({len(test_df)/len(df):.0%})")
    
    # Check date ranges
    print(f"\nDate ranges:")
    print(f"Train: {train_df[date_col].min()} to {train_df[date_col].max()}")
    print(f"Val:   {val_df[date_col].min()} to {val_df[date_col].max()}")
    print(f"Test:  {test_df[date_col].min()} to {test_df[date_col].max()}")
    
    # Check churn rates
    target_col = CONFIG['target_col']
    print(f"\nChurn rates:")
    print(f"Train: {train_df[target_col].mean():.1%}")
    print(f"Val:   {val_df[target_col].mean():.1%}")
    print(f"Test:  {test_df[target_col].mean():.1%}")
else:
    print(f"⚠️ Date column '{date_col}' not found, falling back to random split")
    from sklearn.model_selection import train_test_split
    train_df, test_df = train_test_split(
        df, test_size=CONFIG['test_size'], 
        random_state=CONFIG['random_seed'],
        stratify=df[CONFIG['target_col']]
    )
    val_df = None

Splitting data by time...

Train: 80,000 rows (80%)
Val:   10,000 rows (10%)
Test:  10,000 rows (10%)

Date ranges:
Train: 2017-12-22 00:00:00 to 2025-06-23 00:00:00
Val:   2025-06-23 00:00:00 to 2025-08-07 00:00:00
Test:  2025-08-07 00:00:00 to 2025-09-29 00:00:00

Churn rates:
Train: 34.5%
Val:   11.6%
Test:  9.0%



## 💾 Write Artifacts
- **Processed dataset** (Parquet)
- **Feature schema** (`feature_schema.json`)
- **Splits** (`splits.json`) with IDs for deterministic reuse


In [11]:
# Save the processed datasets
print("Saving data...\n")

# Save as Parquet (efficient format for ML)
train_path = os.path.join(ARTIFACT_DIR, "train.parquet")
val_path = os.path.join(ARTIFACT_DIR, "val.parquet")
test_path = os.path.join(ARTIFACT_DIR, "test.parquet")

train_df.to_parquet(train_path, index=False)
val_df.to_parquet(val_path, index=False)
test_df.to_parquet(test_path, index=False)

print(f"✓ Train data saved: {train_path}")
print(f"✓ Val data saved:   {val_path}")
print(f"✓ Test data saved:  {test_path}")

# Save a simple summary
target_col = CONFIG['target_col']
summary = {
    "timestamp": RUN_TS,
    "train_rows": len(train_df),
    "val_rows": len(val_df),
    "test_rows": len(test_df),
    "features": len(train_df.columns),
    "train_churn_rate": float(train_df[target_col].mean()),
    "val_churn_rate": float(val_df[target_col].mean()),
    "test_churn_rate": float(test_df[target_col].mean()),
    "split_type": "time_based"
}

summary_path = os.path.join(ARTIFACT_DIR, "summary.json")
with open(summary_path, "w") as f:
    json.dump(summary, f, indent=2)

print(f"✓ Summary saved: {summary_path}")
print(f"\nAll files saved to: {ARTIFACT_DIR}")

Saving data...

✓ Train data saved: artifacts/run_20251016T180850Z_0e3e59e5/train.parquet
✓ Val data saved:   artifacts/run_20251016T180850Z_0e3e59e5/val.parquet
✓ Test data saved:  artifacts/run_20251016T180850Z_0e3e59e5/test.parquet
✓ Summary saved: artifacts/run_20251016T180850Z_0e3e59e5/summary.json

All files saved to: artifacts/run_20251016T180850Z_0e3e59e5



## 📈 (Optional) MLflow Trace
Record data prep parameters and artifacts for lineage. Enable by setting `CONFIG["mlflow"]["enabled"] = True`.


In [12]:
# Optional: Track with MLflow (skip this if you don't have MLflow set up)
USE_MLFLOW = False  # Set to True to enable

if USE_MLFLOW:
    try:
        import mlflow
        
        mlflow.set_experiment("data-preparation")
        
        with mlflow.start_run():
            # Log basic info
            mlflow.log_params({
                "train_rows": len(train_df),
                "val_rows": len(val_df),
                "test_rows": len(test_df),
                "split_type": "time_based",
                "churn_rate": summary['train_churn_rate']
            })
            
            # Log the data files
            mlflow.log_artifact(train_path)
            mlflow.log_artifact(val_path)
            mlflow.log_artifact(test_path)
            
            print("✓ Logged to MLflow")
    except:
        print("⚠️ MLflow not available - skipping")
else:
    print("MLflow tracking disabled (set USE_MLFLOW=True to enable)")

MLflow tracking disabled (set USE_MLFLOW=True to enable)
