# Unit 5 - Example 16: Production Pipelines

## üìö Learning Objectives

By completing this notebook, you will:
- Understand the key concepts of this topic
- Apply the topic using Python code examples
- Practice with small, realistic datasets or scenarios

## üîó Prerequisites

- ‚úÖ Basic Python
- ‚úÖ Basic NumPy/Pandas (when applicable)

---

## Official Structure Reference

This notebook supports **Course 05, Unit 5** requirements from `DETAILED_UNIT_DESCRIPTIONS.md`.

---


# Unit 5 - Example 16: Production Pipelines

## üîó Solving the Problem from Example 15 | ÿ≠ŸÑ ÿßŸÑŸÖÿ¥ŸÉŸÑÿ© ŸÖŸÜ ÿßŸÑŸÖÿ´ÿßŸÑ 15

**Remember the dead end from Example 15?**
- We learned RAPIDS for GPU-accelerated workflows
- But we needed to put these workflows into production
- We needed automation, scheduling, and monitoring

**This notebook solves that problem!**
- We'll learn **production pipeline design**
- We'll learn **automation and scheduling**
- We'll learn **error handling and monitoring**

**This solves the production deployment problem from Example 15!**


In [1]:
# Step 1: Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import logging
import json
from datetime import datetime

# Configure logging
# Note: Using just 'pipeline.log' since notebook runs from examples/ directory
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s',
                    handlers=[logging.FileHandler('pipeline.log'), logging.StreamHandler()])
logger = logging.getLogger(__name__)

print("=" * 70)
print("Example 16: Production Pipelines | ÿÆÿ∑Ÿàÿ∑ ÿßŸÑÿ•ŸÜÿ™ÿßÿ¨")
print("=" * 70)
print("\nüìö Prerequisites: Examples 14-15 completed, pipeline knowledge")
print("üîó This is the THIRD example in Unit 5 - production pipelines")
print("üéØ Goal: Master building production-ready ML pipelines")
print("Reference: Study 16.pdf before running this code example.\n")


Example 16: Production Pipelines | ÿÆÿ∑Ÿàÿ∑ ÿßŸÑÿ•ŸÜÿ™ÿßÿ¨

üìö Prerequisites: Examples 14-15 completed, pipeline knowledge
üîó This is the THIRD example in Unit 5 - production pipelines
üéØ Goal: Master building production-ready ML pipelines
Reference: Study 16.pdf before running this code example.



## 


# 1. CREATE SAMPLE DATA


## 


In [2]:
print("\n1. Creating Sample Data")
print("-" * 70)
np.random.seed(42)
n_samples = 1000
data = {
'feature1': np.random.randn(n_samples), 'feature2': np.random.randn(n_samples),
'feature3': np.random.randn(n_samples), 'target': np.random.randn(n_samples)
}
df = pd.DataFrame(data)
# Introduce some missing values
missing_indices = np.random.choice(df.index, size=50, replace=False)
df.loc[missing_indices[:25], 'feature1'] = np.nan
df.loc[missing_indices[25:], 'feature2'] = np.nan
print(f"‚úì Created dataset with {len(df)} rows")
print(f"‚úì Missing values: {df.isnull().sum().sum()}")


1. Creating Sample Data
----------------------------------------------------------------------
‚úì Created dataset with 1000 rows
‚úì Missing values: 50


## 


In [3]:
# 2. BUILD PRODUCTION PIPELINE


## 


In [4]:
print("\n\n2. Building Production Pipeline")
print("-" * 70)
try:
    logger.info("Starting pipeline execution")
    # Prepare data
    X = df[['feature1', 'feature2', 'feature3']]
    y = df['target']
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    logger.info(f"Train set: {len(X_train)} samples, Test set: {len(X_test)} samples")
    # Handle missing values first (must be done before pipeline)
    X_train = X_train.fillna(X_train.mean())
    X_test = X_test.fillna(X_train.mean())
    # Verify no NaN values remain
    if X_train.isnull().sum().sum() > 0 or X_test.isnull().sum().sum() > 0:
        logger.warning("Some NaN values remain after fillna, filling with 0")
        X_train = X_train.fillna(0)
        X_test = X_test.fillna(0)
    # Create pipeline
    pipeline = Pipeline([
('scaler', StandardScaler()), ('model', LinearRegression())
    ])
    # Train pipeline
    logger.info("Training pipeline...")
    pipeline.fit(X_train, y_train)
    # Make predictions
    y_pred = pipeline.predict(X_test)
    # Evaluate
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    logger.info(f"Pipeline trained successfully")
    logger.info(f"MSE: {mse:.4f}, R¬≤: {r2:.4f}")
    print(f"\n‚úì Pipeline executed successfully")
    print(f"MSE: {mse:.4f}, R¬≤ Score: {r2:.4f}")
except Exception as e:
    logger.error(f"Pipeline execution failed: {str(e)}", exc_info=True)
    raise e


2026-01-15 23:08:32,689 - INFO - Starting pipeline execution


2026-01-15 23:08:32,693 - INFO - Train set: 800 samples, Test set: 200 samples


2026-01-15 23:08:32,694 - INFO - Training pipeline...


2026-01-15 23:08:32,696 - INFO - Pipeline trained successfully


2026-01-15 23:08:32,696 - INFO - MSE: 1.0702, R¬≤: -0.0327




2. Building Production Pipeline
----------------------------------------------------------------------

‚úì Pipeline executed successfully
MSE: 1.0702, R¬≤ Score: -0.0327


## 


In [5]:
# 3. SAVE PIPELINE METADATA


## 


In [6]:
print("\n\n3. Saving Pipeline Metadata")
print("-" * 70)
metadata = {
'pipeline_version': '1.0', 'created_at': datetime.now().isoformat(), 'train_samples': len(X_train),
'test_samples': len(X_test), 'metrics': {
'mse': float(mse),
'r2': float(r2)
}, 'features': list(X.columns)
}
with open('pipeline_metadata.json', 'w') as f:
    json.dump(metadata, f, indent=2)
print("‚úì Pipeline metadata saved")
print("‚úì     ")



3. Saving Pipeline Metadata
----------------------------------------------------------------------
‚úì Pipeline metadata saved
‚úì     


## 


# 4. SUMMARY


## 


In [7]:
print("\n" + "=" * 70)
print("Summary")
print("=" * 70)
print("\nKey Concepts Covered:")
print("1. Pipeline design and structure")
print("2. Error handling and logging")
print("3. Metadata and versioning")
print("4. Production best practices")
print("\nNext Steps: Continue to Example 17 for Performance Optimization")
print(" :    17  ")


Summary

Key Concepts Covered:
1. Pipeline design and structure
2. Error handling and logging
3. Metadata and versioning
4. Production best practices

Next Steps: Continue to Example 17 for Performance Optimization
 :    17  


## üö´ When Production Pipelines Hit a Dead End | ÿπŸÜÿØŸÖÿß ÿ™Ÿàÿßÿ¨Ÿá ÿÆÿ∑Ÿàÿ∑ ÿßŸÑÿ•ŸÜÿ™ÿßÿ¨ ÿ∑ÿ±ŸäŸÇ ŸÖÿ≥ÿØŸàÿØ

**BEFORE**: We've learned to build production pipelines.

**AFTER**: We discover pipelines work but are slow - we need optimization!

**Why this matters**: Production pipelines must be fast and efficient - optimization is essential!

---

### The Problem We've Discovered

We've learned:
- ‚úÖ How to build production pipelines
- ‚úÖ How to handle errors and logging
- ‚úÖ How to save metadata and versioning

**But we have a problem:**
- ‚ùì **What if the pipeline is too slow?**
- ‚ùì **What if we need to optimize performance?**
- ‚ùì **What if we need to reduce resource usage?**

**The Dead End:**
- Pipelines work correctly
- But they may be slow or inefficient
- We need performance optimization techniques

---

### Demonstrating the Problem

Let's see why optimization is needed:


In [8]:
print("\n" + "=" * 70)
print("üö´ DEMONSTRATING THE DEAD END: Need for Performance Optimization")
print("=" * 70)

import time

print(f"\nüìä Current Pipeline Performance:")
print(f"   ‚úì Pipeline works correctly")
print(f"   ‚úì Error handling in place")
print(f"   ‚úì Logging and metadata saved")

# Simulate slow pipeline
print(f"\n‚ö†Ô∏è  Performance Issues:")
print(f"   - Pipeline execution time: May be slow for large datasets")
print(f"   - Resource usage: May consume too much memory/CPU")
print(f"   - Scalability: May not scale well with data size")

# Time a simple operation to show potential slowness
start_time = time.time()
# Simulate some processing
time.sleep(0.1)  # Simulate processing time
processing_time = time.time() - start_time

print(f"\nüí° The Problem:")
print(f"   - Pipelines work, but may be inefficient")
print(f"   - Need to optimize:")
print(f"     ‚Ä¢ Reduce execution time")
print(f"     ‚Ä¢ Reduce memory usage")
print(f"     ‚Ä¢ Improve scalability")
print(f"     ‚Ä¢ Optimize data processing steps")

print(f"\nüìã Optimization Needs:")
print(f"   1. Code optimization: Faster algorithms, vectorization")
print(f"   2. Memory optimization: Reduce memory footprint")
print(f"   3. Parallel processing: Use multiple cores/GPUs")
print(f"   4. Caching: Avoid redundant computations")
print(f"   5. Profiling: Identify bottlenecks")

print(f"\n‚û°Ô∏è  Solution Needed:")
print(f"   - We need performance optimization techniques")
print(f"   - We need profiling tools to find bottlenecks")
print(f"   - We need optimization strategies")
print(f"   - This leads us to Example 17: Performance Optimization")

print("\n" + "=" * 70)



üö´ DEMONSTRATING THE DEAD END: Need for Performance Optimization

üìä Current Pipeline Performance:
   ‚úì Pipeline works correctly
   ‚úì Error handling in place
   ‚úì Logging and metadata saved

‚ö†Ô∏è  Performance Issues:
   - Pipeline execution time: May be slow for large datasets
   - Resource usage: May consume too much memory/CPU
   - Scalability: May not scale well with data size

üí° The Problem:
   - Pipelines work, but may be inefficient
   - Need to optimize:
     ‚Ä¢ Reduce execution time
     ‚Ä¢ Reduce memory usage
     ‚Ä¢ Improve scalability
     ‚Ä¢ Optimize data processing steps

üìã Optimization Needs:
   1. Code optimization: Faster algorithms, vectorization
   2. Memory optimization: Reduce memory footprint
   3. Parallel processing: Use multiple cores/GPUs
   4. Caching: Avoid redundant computations
   5. Profiling: Identify bottlenecks

‚û°Ô∏è  Solution Needed:
   - We need performance optimization techniques
   - We need profiling tools to find bottlenec

### What We Need Next

**The Solution**: We need performance optimization:
- **Code optimization**: Faster algorithms, vectorization, efficient data structures
- **Memory optimization**: Reduce memory footprint, efficient data types
- **Parallel processing**: Use multiple cores/GPUs effectively
- **Caching**: Avoid redundant computations
- **Profiling**: Identify and fix bottlenecks

**This dead end leads us to Example 17: Performance Optimization**
- Example 17 will teach us optimization techniques
- We'll learn profiling and bottleneck identification
- This solves the performance problem for production pipelines!
