# Skyulf Core: Low-Level Component Example

This notebook demonstrates the **"Component Way"** of using Skyulf. 
Instead of a single `SkyulfPipeline` wrapper, we will manually:
1.  Split the data.
2.  Use `FeatureEngineer` component directly to fit and transform.
3.  Use the `SklearnCalculator` / `SklearnApplier` components directly for modeling.

This approach gives you maximum flexibility for debugging or custom workflows.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Skyulf Low-Level Components
from skyulf.preprocessing.pipeline import FeatureEngineer
from skyulf.modeling.classification import RandomForestClassifierCalculator, RandomForestClassifierApplier

# 1. Load Data
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)
df = df[['Survived', 'Pclass', 'Sex', 'Age', 'Fare', 'Embarked']]

print("Data Loaded.")
print(df.head(2))

Data Loaded.
   Survived  Pclass     Sex   Age     Fare Embarked
0         0       3    male  22.0   7.2500        S
1         1       1  female  38.0  71.2833        C


In [2]:
# 2. Manual Data Separation & Splitting
# Since we aren't using the integrated pipeline, we must split X and y manually.

X = df.drop(columns=['Survived'])
y = df['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Train Shape: {X_train.shape}")
print(f"Test Shape: {X_test.shape}")

Train Shape: (712, 5)
Test Shape: (179, 5)


In [3]:
# 3. Initialize Feature Engineer
# We define the steps list exactly as we did in the pipeline, 
# but now we feed it into the FeatureEngineer class directly.

fe_steps = [
    {
        "name": "imputer_age",
        "transformer": "SimpleImputer", 
        "params": {"columns": ["Age"], "strategy": "mean"}
    },
    {
        "name": "imputer_embarked",
        "transformer": "SimpleImputer",
        "params": {"columns": ["Embarked"], "strategy": "most_frequent"}
    },
    {
        "name": "encoder",
        "transformer": "OneHotEncoder",
        "params": {"columns": ["Sex", "Embarked"], "drop_first": False}
    },
    {
        "name": "scaler",
        "transformer": "StandardScaler",
        "params": {"columns": ["Age", "Fare"]}
    }
]

feature_engineer = FeatureEngineer(fe_steps)

print("Feature Engineer Initialized.")

Feature Engineer Initialized.


# 3b. Deep Dive: The Atomic "Calculator-Applier" Pattern
Before running the full `FeatureEngineer`, let's demonstrate what happens **inside** it for a single step (e.g., Age Imputation).
This is the "Ultra Low-Level" API: using individual Calculator and Applier classes.

In [4]:
from skyulf.preprocessing.imputation import SimpleImputerCalculator, SimpleImputerApplier
from skyulf.preprocessing.encoding import OneHotEncoderCalculator, OneHotEncoderApplier
from skyulf.preprocessing.scaling import StandardScalerCalculator, StandardScalerApplier
import json
import numpy as np

# Define a mapping from string names to actual implementation classes
# In the real Skyulf backend, this is handled by a sophisticated Registry
component_map = {
    "SimpleImputer": (SimpleImputerCalculator, SimpleImputerApplier),
    "OneHotEncoder": (OneHotEncoderCalculator, OneHotEncoderApplier),
    "StandardScaler": (StandardScalerCalculator, StandardScalerApplier)
}

# Helper to serialize numpy types for display
class NumpyEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, np.integer):
            return int(obj)
        if isinstance(obj, np.floating):
            return float(obj)
        if isinstance(obj, np.ndarray):
            return obj.tolist()
        return super(NumpyEncoder, self).default(obj)

print("--- ðŸ”¬ Atomic Execution Loop (The 'Under the Hood' View) ---")

# We start with a copy of X_train so we don't affect other cells
X_current = X_train.copy()

for step in fe_steps:
    step_name = step["name"]
    transformer_type = step["transformer"]
    params = step["params"]
    
    print(f"\n[Step: {step_name}] Type: {transformer_type}")
    
    # 1. Resolve Components
    CalculatorCls, ApplierCls = component_map[transformer_type]
    
    # 2. Instantiate Calculator & Fit
    # The SDK expects a config dictionary wrapping the params
    step_config = {"params": params}
    
    calculator = CalculatorCls()
    # .fit() creates the portable artifact (state)
    fitted_artifact = calculator.fit(X_current, step_config)
    
    print(f"  â”œâ”€â”€ Calculator: {calculator.__class__.__name__}")
    print(f"  â”œâ”€â”€ Config (Input params): {json.dumps(params, indent=2)}")
    
    # Clean up artifact display (remove binary objects if any usually encoding has objects)
    display_artifact = {k: v for k, v in fitted_artifact.items() if k != 'encoder_object'}
    if 'encoder_object' in fitted_artifact:
        display_artifact['encoder_object'] = "<Binary Scikit-Learn Object>"
        
    print(f"  â”œâ”€â”€ Fitted Artifact (Learned State): \n{json.dumps(display_artifact, cls=NumpyEncoder, indent=4)}")
    
    # 3. Instantiate Applier & Apply
    applier = ApplierCls()
    # .apply() takes data + the artifact from the calculator
    X_current = applier.apply(X_current, fitted_artifact)
    
    print(f"  â”œâ”€â”€ Applier: {applier.__class__.__name__}")
    print(f"  â””â”€â”€ Output Shape: {X_current.shape}")

print("\n--- âœ… Final Transformed Data (Matching FeatureEngineer Output) ---")
display(X_current.head())

--- ðŸ”¬ Atomic Execution Loop (The 'Under the Hood' View) ---

[Step: imputer_age] Type: SimpleImputer
  â”œâ”€â”€ Calculator: SimpleImputerCalculator
  â”œâ”€â”€ Config (Input params): {
  "columns": [
    "Age"
  ],
  "strategy": "mean"
}
  â”œâ”€â”€ Fitted Artifact (Learned State): 
{
    "type": "simple_imputer",
    "strategy": "mean",
    "fill_values": {
        "Pclass": 2.330056179775281,
        "Age": 29.498846153846156,
        "Fare": 32.5862761235955
    },
    "columns": [
        "Pclass",
        "Age",
        "Fare"
    ],
    "missing_counts": {
        "Pclass": 0,
        "Age": 140,
        "Fare": 0
    },
    "total_missing": 140
}
  â”œâ”€â”€ Applier: SimpleImputerApplier
  â””â”€â”€ Output Shape: (712, 5)

[Step: imputer_embarked] Type: SimpleImputer
  â”œâ”€â”€ Calculator: SimpleImputerCalculator
  â”œâ”€â”€ Config (Input params): {
  "columns": [
    "Embarked"
  ],
  "strategy": "most_frequent"
}
  â”œâ”€â”€ Fitted Artifact (Learned State): 
{
    "type":

Unnamed: 0,Pclass,Age,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Embarked_nan
331,-1.614136,1.232263,-0.078684,0,1,0,0,1,0
733,-0.400551,-0.500482,-0.377145,0,1,0,0,1,0
382,0.813034,0.192616,-0.474867,0,1,0,0,1,0
704,0.813034,-0.269449,-0.47623,0,1,0,0,1,0
813,0.813034,-1.809667,-0.025249,1,0,0,0,1,0


In [4]:
# 4. Fit & Transform Features
# We call fit_transform on training data.
# This learns the means/categories AND applies them to X_train.

# Note: Skyulf's FeatureEngineer returns (transformed_data, metrics)
print("Fitting Feature Engineer on Train Data...")
X_train_transformed, metrics = feature_engineer.fit_transform(X_train)

# Now we apply the learned transformations to Test Data
print("Transforming Test Data (using learned params)...")
X_test_transformed = feature_engineer.transform(X_test)

print("\nTransformed Train Shape:", X_train_transformed.shape)
print("Transformed Test Shape:", X_test_transformed.shape)
print("\nSample Transformed Data:")
print(X_train_transformed.head(3))

Fitting Feature Engineer on Train Data...
Transforming Test Data (using learned params)...

Transformed Train Shape: (712, 8)
Transformed Test Shape: (179, 8)

Sample Transformed Data:
     Pclass       Age      Fare  Sex_female  Sex_male  Embarked_C  Embarked_Q  \
331       1  1.232263 -0.078684           0         1           0           0   
733       2 -0.500482 -0.377145           0         1           0           0   
382       3  0.192616 -0.474867           0         1           0           0   

     Embarked_S  
331           1  
733           1  
382           1  


In [5]:
# 5. Fit Model (Calculator)
# Now we manually fit the Random Forest Calculator.

rf_config = {
    "n_estimators": 50,
    "max_depth": 5,
    "random_state": 42
}

calculator = RandomForestClassifierCalculator()
applier = RandomForestClassifierApplier()

print("Fitting Model Calculator...")
# fit() returns the model artifact (dictionary containing the sklearn object or params)
model_artifact = calculator.fit(X_train_transformed, y_train, rf_config)

# Verify we got an artifact
print("\nModel Artifact Type:", type(model_artifact))
# In Skyulf's sklearn wrapper, this is the actual fitted sklearn object
print(model_artifact)

Fitting Model Calculator...

Model Artifact Type: <class 'sklearn.ensemble._forest.RandomForestClassifier'>
RandomForestClassifier(max_depth=5, min_samples_leaf=2, min_samples_split=5,
                       n_estimators=50, n_jobs=-1, random_state=42)


In [6]:
# 6. Apply Model (Inference)
# Use the Applier with the fitted artifact to predict on Test Data.

print("Applying Model to Test Data...")
predictions = applier.predict(X_test_transformed, model_artifact)

# Calculate Accuracy
acc = accuracy_score(y_test, predictions)

print("\n--- Results ---")
print(f"Test Set Accuracy: {acc:.4f}")
print("Sample Predictions:", predictions[:5].tolist())

Applying Model to Test Data...

--- Results ---
Test Set Accuracy: 0.8101
Sample Predictions: [0, 0, 0, 1, 1]


### 4. The 'State' in Action: Saving and Loading
This section demonstrates concept.
1. We **Simulate the Tailor** (Fit) to generate a JSON artifact.
2. We **Save** that JSON to disk (`scaler_state.json`).
3. We **Simulate the Factory** (Apply) by loading that JSON from disk and using it to transform data.
**Note:** No Python objects are pickled. Only data (JSON) is exchanged.

In [None]:
import json
import os
from skyulf.preprocessing.scaling import StandardScalerCalculator, StandardScalerApplier

# --- 1. The Tailor (Learning State) ---
print("1. Measuring the data (Fitting Scaler)...")
tailor = StandardScalerCalculator()
# Learning the mean/std of Age and Fare
tailor_config = {"params": {"columns": ["Age", "Fare"]}}
learned_state_artifact = tailor.fit(X_train, tailor_config)

# --- 2. The Shipment (Saving to JSON) ---
print(f"2. Saving state to 'scaler_state.json'...")
# Note: We use the NumpyEncoder helper we defined earlier to handle numpy float types
with open("scaler_state.json", "w") as f:
    json.dump(learned_state_artifact, f, cls=NumpyEncoder, indent=2)

# Verify the file exists
print(f"   File created: {os.path.abspath('scaler_state.json')}")
print(f"   File Content Preview: {json.dumps(learned_state_artifact, cls=NumpyEncoder)[:100]}...")

# --- 3. The Factory (Loading & Applying) ---
print("\n3. [Factory] Loading state and starting production...")

# Simulate a clean slate - we don't need the 'tailor' object anymore!
del tailor 

with open("scaler_state.json", "r") as f:
    loaded_state_artifact = json.load(f)

worker = StandardScalerApplier()
# The worker takes the LOADED state and applies it to new data
# Note: We use X_test here to show it working on unseen data
X_test_scaled = worker.apply(X_test, loaded_state_artifact)

print("   Factory Output (First 3 rows of scaled Age/Fare):")
display(X_test_scaled[["Age", "Fare"]].head(3))

1. [Tailor] Measuring the data (Fitting Scaler)...
2. [Shipping] Saving state to 'scaler_state.json'...
   File created: c:\Users\Murat\Desktop\skyulf-mlflow\docs\examples\notebooks\scaler_state.json
   File Content Preview: {"type": "standard_scaler", "mean": [2.330056179775281, 29.498846153846156, 32.5862761235955], "scal...

3. [Factory] Loading state and starting production...
   Factory Output (First 3 rows of scaled Age/Fare):


Unnamed: 0,Age,Fare
709,,-0.333901
439,0.103618,-0.425284
840,-0.655664,-0.474867
