# Skyulf Core: Real World Pipeline Example (Integrated Splitting)

This notebook demonstrates how to use `SkyulfPipeline` where the **Data Splitting is part of the pipeline steps**.
This ensures that raw data goes in, and the pipeline handles leakage prevention internally.

Key concepts covered:
1.  **Integrated Splitting**: Using `TrainTestSplitter` inside the preprocessing config.
2.  **Configuration**: Defining the full lifecycle from raw data -> split -> process -> model.
3.  **Calculator/Applier**: Seeing how Skyulf separates learning from applying.

In [11]:
import pandas as pd
import numpy as np
from skyulf.pipeline import SkyulfPipeline

# 1. Load Real-World Dataset (Titanic)
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

# Select relevant columns for this demo
df = df[['Survived', 'Pclass', 'Sex', 'Age', 'Fare', 'Embarked']]

print("Original Data Info:")
print(df.info())

# Note: We do NOT manually split here. We pass the full dataframe to the pipeline.
print(f"\nTotal Samples: {len(df)}")

Original Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    object 
 3   Age       714 non-null    float64
 4   Fare      891 non-null    float64
 5   Embarked  889 non-null    object 
dtypes: float64(2), int64(2), object(2)
memory usage: 41.9+ KB
None

Total Samples: 891


In [12]:
# 2. Define Pipeline Configuration with Splitting

# Note: The configuration format must match what Skyulf expects internally.
# Preprocessing steps are a list of dicts with: 'name', 'transformer', 'params'.
# Modeling config is a dict with: 'type', 'node_id', 'params'.

pipeline_config = {
    "preprocessing": [
        # Step 0: Split Data internally
        # This converts DataFrame -> SplitDataset
        {
            "name": "splitter_node",
            "transformer": "TrainTestSplitter",
            "params": {
                "test_size": 0.2,
                "random_state": 42,
                "shuffle": True,
                # Stratify by target to ensure balanced classes in splits
                "stratify": True, 
                "target_column": "Survived"
            }
        },
        # Step 1: Impute Missing Age (Numerical)
        {
            "name": "imputer_age",
            "transformer": "SimpleImputer", 
            "params": {
                "columns": ["Age"],
                "strategy": "mean"
            }
        },
        # Step 2: Impute Missing Embarked (Categorical)
        {
            "name": "imputer_embarked",
            "transformer": "SimpleImputer",
            "params": {
                "columns": ["Embarked"],
                "strategy": "most_frequent"
            }
        },
        # Step 3: One-Hot Encode Categorical Columns
        {
            "name": "encoder_sex_embarked",
            "transformer": "OneHotEncoder",
            "params": {
                "columns": ["Sex", "Embarked"],
                "drop_first": False
            }
        },
        # Step 4: Scale Age and Fare
        {
            "name": "scaler_numeric",
            "transformer": "StandardScaler",
            "params": {
                "columns": ["Age", "Fare"]
            }
        }
    ],
    "modeling": {
        "type": "random_forest_classifier",
        "node_id": "rf_model", # Modeling uses 'node_id' 
        "params": { # Modeling uses 'params' for hyperparameters
            "n_estimators": 50,
            "max_depth": 5,
            "random_state": 42
        }
    }
}

print("Configuration defined (including Splitter).")

Configuration defined (including Splitter).


In [13]:
# 3. Initialize & Fit Pipeline

# Create the pipeline instance
pipeline = SkyulfPipeline(pipeline_config)

print("Fitting pipeline on Raw DataFrame...")

# We pass the raw DF here. The first node (TrainTestSplitter) will handle the split.
# Subsequent nodes (Imputer, etc.) will receive a SplitDataset and behave correctly (fit on train, apply on all).
metrics = pipeline.fit(df, target_column="Survived")

print("\n--- Training Complete ---")
if metrics:
    print("\nMetrics from internal evaluation:")
    # Note: Evaluation happens on the 'test' split created by the splitter node
    print(metrics)

Fitting pipeline on Raw DataFrame...

--- Training Complete ---

Metrics from internal evaluation:
{'preprocessing': {'missing_counts': {'Embarked': 2}, 'total_missing': 2, 'fill_values': {'Embarked': 'S'}, 'new_features_count': 5, 'encoded_columns_count': 2, 'mean': [29.807686956521742, 31.819826264044945], 'scale': [13.005910822597324, 48.025343047105174], 'var': [169.1537163253542, 2306.433574792133], 'columns': ['Age', 'Fare']}, 'modeling': {'problem_type': 'classification', 'splits': {'train': ModelEvaluationReport(dataset_name='train', metrics={'accuracy': 0.8497191011235955, 'precision_weighted': 0.8567625119057833, 'recall_weighted': 0.8497191011235955, 'f1_weighted': 0.8445362684754207, 'precision': 0.9029126213592233, 'recall': 0.6813186813186813, 'f1': 0.7766179540709812, 'g_score': 0.8172603449434848, 'roc_auc': 0.9107570485702604, 'pr_auc': 0.8964230411195061}, classification=ClassificationEvaluation(confusion_matrix=ConfusionMatrixData(labels=['0', '1'], matrix=[[0, 0], [

In [14]:
# 4. Inspect Learned Parameters

print("--- Learned Parameters Inspection ---\n")

# FeatureEngineer stores fitted steps in a list of dicts called 'fitted_steps'
# Splitters are structural and typically not stored in fitted_steps (which are for transform/inference).

# Step 0 in fitted_steps corresponds to 'imputer_age' (since Splitter was skipped)
if len(pipeline.feature_engineer.fitted_steps) > 0:
    step_info = pipeline.feature_engineer.fitted_steps[0]
    print(f"Node Name: {step_info['name']}")
    print(f"Node Type: {step_info['type']}")
    print(f"Learned Params: {step_info['artifact']}")
    # Expected: Mean age

    print("-" * 30)

    # Step: OneHotEncoder (Index 2 in fitted_steps, since we have AgeImputer(0), EmbarkedImputer(1), OneHot(2))
    if len(pipeline.feature_engineer.fitted_steps) > 2:
        encoder_step = pipeline.feature_engineer.fitted_steps[2]
        print(f"Node Name: {encoder_step['name']}")
        print(f"Node Type: {encoder_step['type']}")
        
        params = encoder_step['artifact']
        if 'categories_' in params:
            print(f"Learned Categories: {params['categories_']}")
        else:
            print(f"Learned Params Keys: {params.keys()}")
else:
    print("No fitted steps found (Pipeline might not have fit successfully).")

--- Learned Parameters Inspection ---

Node Name: imputer_age
Node Type: SimpleImputer
Learned Params: {'type': 'simple_imputer', 'strategy': 'mean', 'fill_values': {'Age': 29.807686956521735}, 'columns': ['Age'], 'missing_counts': {'Age': 137}, 'total_missing': 137}
------------------------------
Node Name: encoder_sex_embarked
Node Type: OneHotEncoder
Learned Params Keys: dict_keys(['type', 'columns', 'encoder_object', 'feature_names', 'prefix_separator', 'drop_original', 'include_missing'])


In [15]:
# 5. Run Inference on New Data
# To test inference, we'll grab a few rows from the original dataframe 
# (simulating "production" data that arrives later).

production_data_sample = df.sample(5, random_state=999).drop(columns=['Survived'])
print("New data sample (no labels):")
print(production_data_sample)

print("\nRunning inference...")
# The pipeline.predict() method automatically ignores the "TrainTestSplitter" step
# because it's flagged as a training-only step in the engine.
predictions = pipeline.predict(production_data_sample)

print("\nPredictions:")
print(predictions)

New data sample (no labels):
     Pclass   Sex   Age    Fare Embarked
857       1  male  51.0  26.550        S
666       2  male  25.0  13.000        S
350       3  male  23.0   9.225        S
90        3  male  29.0   8.050        S
583       1  male  36.0  40.125        C

Running inference...

Predictions:
857    0
666    0
350    0
90     0
583    0
dtype: int64
