# 🚀 Simple Pipeline - Interactive Demo

Este notebook muestra cómo usar Simple Pipeline para crear datasets sintéticos con Ollama.

## 📋 Contenido
1. Setup e imports
2. Pipeline básico
3. Pipeline con filtros y transformaciones
4. Pipeline robusto con manejo de errores
5. Análisis de resultados

## 1. Setup e Imports

In [30]:
import pandas as pd
import sys
from pathlib import Path

# Agregar el directorio padre al path para imports
sys.path.insert(0, str(Path.cwd().parent))

from simple_pipeline import SimplePipeline
from simple_pipeline.steps import (
    LoadDataFrame,
    OllamaLLMStep,
    RobustOllamaStep,
    FilterRows,
    SortRows,
    SampleRows,
    AddColumn,
    KeepColumns
)

print("✅ Imports completados")

✅ Imports completados


## 2. Pipeline Básico

Empezamos con un pipeline simple que genera explicaciones de conceptos.

In [31]:
# Datos de entrada
concepts = pd.DataFrame({
    'concept': ['Machine Learning', 'Blockchain', 'Quantum Computing'],
    'audience': ['beginner', 'intermediate', 'advanced']
})

print("📊 Datos de entrada:")
display(concepts)

📊 Datos de entrada:


Unnamed: 0,concept,audience
0,Machine Learning,beginner
1,Blockchain,intermediate
2,Quantum Computing,advanced


In [37]:
# Crear pipeline
pipeline = SimplePipeline(
    name="basic-demo",
    description="Demo básico de explicaciones",
    log_level="DEBUG"
)

# Agregar steps
pipeline.add_step(
    LoadDataFrame(name="load", df=concepts)
)

pipeline.add_step(
    OllamaLLMStep(
        name="explain",
        model_name="deepseek-r1:8b",
        prompt_column="concept",
        output_column="explanation",
        prompt_template=lambda row: f"Explain {row['concept']} to a {row['audience']} audience in 2-3 sentences.",
        system_prompt="You are a clear, concise technical educator.",
        batch_size=3,
        generation_kwargs={"temperature": 0.7, "num_predict": 100}
    )
)

print("✅ Pipeline configurado")

2025-10-16 16:15:32 - SimplePipeline.basic-demo - INFO - Added step: load
2025-10-16 16:15:32 - SimplePipeline.basic-demo - INFO - Added step: explain


✅ Pipeline configurado


In [38]:
# Ejecutar pipeline
result = pipeline.run(use_cache=False)

print("\n📊 Resultados:")
display(result)

2025-10-16 16:15:38 - SimplePipeline.basic-demo - INFO - Starting pipeline: basic-demo
2025-10-16 16:15:38 - SimplePipeline.basic-demo - INFO - Number of steps: 2
2025-10-16 16:15:38 - SimplePipeline.basic-demo - INFO - Executing generator step: load
2025-10-16 16:15:38 - SimplePipeline.basic-demo - INFO - Executing step: explain
Processing explain: 100%|██████████| 3/3 [00:16<00:00,  5.35s/it]
2025-10-16 16:15:54 - SimplePipeline.basic-demo - INFO -   ✓ Complete (3 rows, 4 columns)
2025-10-16 16:15:54 - SimplePipeline.basic-demo - INFO - Pipeline execution complete!



📊 Resultados:


Unnamed: 0,concept,audience,explanation,model_name
0,Machine Learning,beginner,"<think>\nOkay, I need to recall the user's que...",deepseek-r1:8b
1,Blockchain,intermediate,<think>\nThe user is asking me to explain bloc...,deepseek-r1:8b
2,Quantum Computing,advanced,"<think>\nOkay, let's tackle this user wants a ...",deepseek-r1:8b


## 3. Pipeline con Filtros y Transformaciones

Ahora vamos a crear un pipeline más complejo con varios steps de transformación.

In [18]:
# Dataset más grande
topics_data = pd.DataFrame({
    'topic': [
        'Python', 'JavaScript', 'Rust', 'Go', 'Java',
        'C++', 'Ruby', 'Swift', 'Kotlin', 'TypeScript'
    ],
    'type': [
        'interpreted', 'interpreted', 'compiled', 'compiled', 'compiled',
        'compiled', 'interpreted', 'compiled', 'compiled', 'interpreted'
    ],
    'popularity': [95, 85, 75, 80, 90, 70, 60, 65, 70, 85]
})

print("📊 Dataset inicial:")
display(topics_data)

📊 Dataset inicial:


Unnamed: 0,topic,type,popularity
0,Python,interpreted,95
1,JavaScript,interpreted,85
2,Rust,compiled,75
3,Go,compiled,80
4,Java,compiled,90
5,C++,compiled,70
6,Ruby,interpreted,60
7,Swift,compiled,65
8,Kotlin,compiled,70
9,TypeScript,interpreted,85


In [19]:
# Pipeline con transformaciones
transform_pipeline = SimplePipeline(
    name="transform-demo",
    description="Demo con filtros y transformaciones"
)

# Step 1: Cargar datos
transform_pipeline.add_step(
    LoadDataFrame(name="load", df=topics_data)
)

# Step 2: Filtrar solo lenguajes populares (> 75)
transform_pipeline.add_step(
    FilterRows(
        name="filter_popular",
        filter_column="popularity",
        condition="> 75"
    )
)

# Step 3: Ordenar por popularidad
transform_pipeline.add_step(
    SortRows(
        name="sort",
        by="popularity",
        ascending=False
    )
)

# Step 4: Tomar top 3
transform_pipeline.add_step(
    SampleRows(
        name="top_3",
        n=3
    )
)

# Step 5: Generar características
transform_pipeline.add_step(
    OllamaLLMStep(
        name="generate_features",
        model_name="deepseek-r1:8b",
        prompt_column="topic",
        output_column="key_features",
        prompt_template=lambda row: f"List 3 key features of {row['topic']} programming language. Be concise.",
        batch_size=3
    )
)

print("✅ Pipeline de transformación configurado")

2025-10-16 15:10:22 - SimplePipeline.transform-demo - INFO - Added step: load
2025-10-16 15:10:22 - SimplePipeline.transform-demo - INFO - Added step: filter_popular
2025-10-16 15:10:22 - SimplePipeline.transform-demo - INFO - Added step: sort
2025-10-16 15:10:22 - SimplePipeline.transform-demo - INFO - Added step: top_3
2025-10-16 15:10:22 - SimplePipeline.transform-demo - INFO - Added step: generate_features


✅ Pipeline de transformación configurado


In [26]:
# Ejecutar
result_transform = transform_pipeline.run(use_cache=False)

print("\n📊 Top 3 lenguajes más populares con features:")
display(result_transform)

2025-10-16 15:12:13 - SimplePipeline.transform-demo - INFO - Starting pipeline: transform-demo
2025-10-16 15:12:13 - SimplePipeline.transform-demo - INFO - Number of steps: 5
2025-10-16 15:12:13 - SimplePipeline.transform-demo - INFO - Executing generator step: load
2025-10-16 15:12:13 - SimplePipeline.transform-demo - INFO - Executing step: filter_popular
2025-10-16 15:12:13 - SimplePipeline.transform-demo - INFO -   ✓ Complete (5 rows, 3 columns)
2025-10-16 15:12:13 - SimplePipeline.transform-demo - INFO - Executing step: sort
2025-10-16 15:12:13 - SimplePipeline.transform-demo - INFO -   ✓ Complete (5 rows, 3 columns)
2025-10-16 15:12:13 - SimplePipeline.transform-demo - INFO - Executing step: top_3
2025-10-16 15:12:13 - SimplePipeline.transform-demo - INFO -   ✓ Complete (3 rows, 3 columns)
2025-10-16 15:12:13 - SimplePipeline.transform-demo - INFO - Executing step: generate_features


  Filtered 5 rows (10 → 5)
  Sorted by: popularity
  Sampled 3 rows from 5 (60.0%)


Processing generate_features: 100%|██████████| 3/3 [07:33<00:00, 151.04s/it]
2025-10-16 15:19:46 - SimplePipeline.transform-demo - INFO -   ✓ Complete (3 rows, 5 columns)
2025-10-16 15:19:46 - SimplePipeline.transform-demo - INFO - Pipeline execution complete!



📊 Top 3 lenguajes más populares con features:


Unnamed: 0,topic,type,popularity,key_features,model_name
0,Python,interpreted,95,"<think>\nOkay, let me think about this query: ...",deepseek-r1:8b
1,Go,compiled,80,<think>\nWe are going to list three key featur...,deepseek-r1:8b
2,TypeScript,interpreted,85,<think>\nWe are given a list three key feature...,deepseek-r1:8b


## 4. Pipeline Robusto con Manejo de Errores

Usamos `RobustOllamaStep` para manejar errores de forma elegante.

In [21]:
# Dataset que puede causar problemas
tricky_data = pd.DataFrame({
    'task': [
        'Write a function to sort a list',
        'Explain recursion',
        'This is an intentionally bad prompt ###@@@',  # Puede fallar
        'Implement binary search',
    ]
})

display(tricky_data)

Unnamed: 0,task
0,Write a function to sort a list
1,Explain recursion
2,This is an intentionally bad prompt ###@@@
3,Implement binary search


In [22]:
# Pipeline robusto
robust_pipeline = SimplePipeline(
    name="robust-demo",
    description="Demo de manejo robusto de errores"
)

robust_pipeline.add_step(
    LoadDataFrame(name="load", df=tricky_data)
)

robust_pipeline.add_step(
    RobustOllamaStep(
        name="generate_code",
        model_name="deepseek-r1:8b",
        prompt_column="task",
        output_column="code",
        system_prompt="You are a coding assistant. Generate Python code.",
        batch_size=2,
        max_retries=2,
        save_failures=True,
        continue_on_error=True  # Continuar aunque haya errores
    )
)

print("✅ Pipeline robusto configurado")

2025-10-16 15:10:22 - SimplePipeline.robust-demo - INFO - Added step: load
2025-10-16 15:10:22 - SimplePipeline.robust-demo - INFO - Added step: generate_code


✅ Pipeline robusto configurado


In [27]:
# Ejecutar
result_robust = robust_pipeline.run(use_cache=False)

print("\n📊 Resultados (incluyendo errores):")
display(result_robust[['task', 'status', 'error']])

2025-10-16 15:20:59 - SimplePipeline.robust-demo - INFO - Starting pipeline: robust-demo
2025-10-16 15:20:59 - SimplePipeline.robust-demo - INFO - Number of steps: 2
2025-10-16 15:20:59 - SimplePipeline.robust-demo - INFO - Executing generator step: load
2025-10-16 15:20:59 - SimplePipeline.robust-demo - INFO - Executing step: generate_code
Processing generate_code:   0%|          | 0/4 [31:56<?, ?it/s]


KeyboardInterrupt: 

In [24]:
# Ver solo resultados exitosos
successful = result_robust[result_robust['status'] == 'success']
print(f"\n✅ {len(successful)} de {len(result_robust)} fueron exitosos")
display(successful[['task', 'code']])


✅ 0 de 4 fueron exitosos


Unnamed: 0,task,code


## 5. Análisis de Resultados

Análisis más detallado de los resultados generados.

In [25]:
# Estadísticas del pipeline básico
print("📊 Análisis del Pipeline Básico:")
print(f"Total de filas: {len(result)}")
print(f"Columnas: {list(result.columns)}")

# Longitud de explicaciones
result['explanation_length'] = result['explanation'].str.len()
print(f"\nLongitud promedio de explicaciones: {result['explanation_length'].mean():.0f} caracteres")

# Visualizar
import matplotlib.pyplot as plt

result['explanation_length'].plot(kind='bar', title='Longitud de Explicaciones')
plt.ylabel('Caracteres')
plt.xlabel('Concepto')
plt.xticks(range(len(result)), result['concept'], rotation=45)
plt.tight_layout()
plt.show()

📊 Análisis del Pipeline Básico:
Total de filas: 3
Columnas: ['concept', 'audience', 'explanation', 'model_name']

Longitud promedio de explicaciones: 539 caracteres


ModuleNotFoundError: No module named 'matplotlib'

## 🎯 Conclusión

Has aprendido a:
- ✅ Crear pipelines básicos con SimplePipeline
- ✅ Usar filtros y transformaciones
- ✅ Manejar errores robustamente
- ✅ Analizar resultados

### 📚 Próximos pasos:
1. Experimenta con diferentes modelos de Ollama
2. Crea tus propios steps personalizados
3. Prueba con datasets más grandes
4. Explora el caching para acelerar iteraciones

## 🧹 Limpieza

Opcional: limpiar cache de los pipelines

In [None]:
# Descomentar para limpiar cache
# pipeline.clear_cache()
# transform_pipeline.clear_cache()
# robust_pipeline.clear_cache()
# print("✅ Cache limpiada")