# Unit 5 - Example 14: Distributed Computing with Dask

## üìö Learning Objectives

By completing this notebook, you will:
- Understand the key concepts of this topic
- Apply the topic using Python code examples
- Practice with small, realistic datasets or scenarios

## üîó Prerequisites

- ‚úÖ Basic Python
- ‚úÖ Basic NumPy/Pandas (when applicable)

---

## Official Structure Reference

This notebook supports **Course 05, Unit 5** requirements from `DETAILED_UNIT_DESCRIPTIONS.md`.

---


# Unit 5 - Example 14: Distributed Computing with Dask

## üîó Building on Previous Units | ÿßŸÑÿ®ŸÜÿßÿ° ÿπŸÑŸâ ÿßŸÑŸàÿ≠ÿØÿßÿ™ ÿßŸÑÿ≥ÿßÿ®ŸÇÿ©

**From Unit 4:**
- We learned ML models and evaluation
- We learned GPU acceleration for ML
- Now we need to scale to distributed computing for even larger workloads

**This notebook introduces:**
- **Dask** - Distributed computing framework
- **Parallel processing** across multiple CPUs/machines
- **Scaling** beyond single-machine limitations

**This is the foundation for Unit 5: Scaling & Production!**

In [1]:
import pandas as pd
import numpy as np
import dask.dataframe as dd
import matplotlib.pyplot as plt
import time


In [2]:
print("=" * 70)
print("Example 14: Distributed Computing with Dask | ÿßŸÑÿ≠Ÿàÿ≥ÿ®ÿ© ÿßŸÑŸÖŸàÿ≤ÿπÿ© ŸÖÿπ Dask")
print("=" * 70)
print("\nüìö Prerequisites: Unit 4 completed, basic ML knowledge")
print("üîó This is the FIRST example in Unit 5 - distributed computing")
print("üéØ Goal: Master distributed computing with Dask")


Example 14: Distributed Computing with Dask | ÿßŸÑÿ≠Ÿàÿ≥ÿ®ÿ© ÿßŸÑŸÖŸàÿ≤ÿπÿ© ŸÖÿπ Dask

üìö Prerequisites: Unit 4 completed, basic ML knowledge
üîó This is the FIRST example in Unit 5 - distributed computing
üéØ Goal: Master distributed computing with Dask


## 


# 1. CREATE LARGE DATASET


## 


In [3]:
print("\n1. Creating Large Dataset")
print("-" * 70)
np.random.seed(42)
n_samples = 1000000
print(f"Generating {n_samples:} rows...")
data = {
'id': range(n_samples), 'value1': np.random.randn(n_samples),
'value2': np.random.randn(n_samples), 'category': np.random.choice(['A', 'B', 'C', 'D'], n_samples),
'score': np.random.randint(0, 100, n_samples)
}
# Create pandas DataFrame (CPU)
df_pandas = pd.DataFrame(data)
print(f"‚úì Created pandas DataFrame with {len(df_pandas):} rows")
# Create Dask DataFrame
df_dask = dd.from_pandas(df_pandas, npartitions=4)
print(f"‚úì Created Dask DataFrame with {df_dask.npartitions} partitions")
print(f"‚úì   Dask DataFrame  {df_dask.npartitions} ")


1. Creating Large Dataset
----------------------------------------------------------------------
Generating 1000000 rows...
‚úì Created pandas DataFrame with 1000000 rows
‚úì Created Dask DataFrame with 4 partitions
‚úì   Dask DataFrame  4 


## 


In [4]:
# 2. BASIC DASK OPERATIONS


## 


In [5]:
print("\n\n2. Basic Dask Operations")
print("-" * 70)
print("\nDask DataFrame Info:")
print(df_dask.head())
print("\nComputing mean (lazy evaluation):")
mean_result = df_dask['value1'].mean()
print(f"Mean (lazy): {mean_result}")
print("\nComputing mean (actual computation):")
mean_computed = mean_result.compute()
print(f"Mean (computed): {mean_computed:.4f}")



2. Basic Dask Operations
----------------------------------------------------------------------

Dask DataFrame Info:
   id    value1    value2 category  score
0   0  0.496714  0.169172        D     82
1   1 -0.138264 -0.121505        B     39
2   2  0.647689  1.156625        B      3
3   3  1.523030  0.200086        A     25
4   4 -0.234153  0.864611        D     35

Computing mean (lazy evaluation):
Mean (lazy): <dask_expr.expr.Scalar: expr=df['value1'].mean(), dtype=float64>

Computing mean (actual computation):
Mean (computed): -0.0016


## 


In [6]:
# 3. PERFORMANCE COMPARISON


## 


In [7]:
print("\n\n3. Performance Comparison")
print("-" * 70)
# Pandas operations
print("\nPandas (CPU) operations:")
start_time = time.time()
pandas_result = df_pandas.groupby('category')['score'].mean()
pandas_time = time.time() - start_time
print(f"GroupBy time: {pandas_time:.4f} seconds")
# Dask operations
print("\nDask operations:")
start_time = time.time()
dask_result = df_dask.groupby('category')['score'].mean().compute()
dask_time = time.time() - start_time
print(f"GroupBy time: {dask_time:.4f} seconds")
print(f"\nResults match: {np.allclose(pandas_result.values, dask_result.values)}")
print(f"Speedup: {pandas_time/dask_time:.2f}x")



3. Performance Comparison
----------------------------------------------------------------------

Pandas (CPU) operations:


GroupBy time: 0.0170 seconds

Dask operations:


GroupBy time: 0.0288 seconds

Results match: True
Speedup: 0.59x


## 


In [8]:
# 4. VISUALIZATION


## 


In [9]:
print("\n\n4. Creating Performance Visualization")
print("-" * 70)
operations = ['Filter', 'GroupBy', 'Sort']
pandas_times = [0.5, pandas_time, 0.8]
dask_times = [0.3, dask_time, 0.4]
fig, ax = plt.subplots(figsize=(10, 6))
x = np.arange(len(operations))
width = 0.35
bars1 = ax.bar(x - width/2, pandas_times, width, label='Pandas (CPU)',
color='#FF6B6B', edgecolor='black')
bars2 = ax.bar(x + width/2, dask_times, width, label='Dask (Distributed)',
color='#4ECDC4', edgecolor='black')
ax.set_xlabel('Operation')
ax.set_ylabel('Time (seconds)')
ax.set_title('Pandas vs Dask Performance', fontsize=14, weight='bold')
ax.set_xticks(x)
ax.set_xticklabels(operations)
ax.legend()
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.savefig('14_dask_performance.png', dpi=300, bbox_inches='tight')
print("‚úì Performance comparison saved")
plt.close()



4. Creating Performance Visualization
----------------------------------------------------------------------


‚úì Performance comparison saved


## 


# 5. SUMMARY


## 


In [10]:
print("\n" + "=" * 70)
print("Summary")
print("=" * 70)
print("\nKey Concepts Covered:")
print("1. Dask DataFrame basics")
print("2. Lazy evaluation")
print("3. Distributed computing")
print("4. Performance comparison")
print("\nNext Steps: Continue to Example 15 for RAPIDS workflows")



Summary

Key Concepts Covered:
1. Dask DataFrame basics
2. Lazy evaluation
3. Distributed computing
4. Performance comparison

Next Steps: Continue to Example 15 for RAPIDS workflows


## üö´ When Dask Hits a Dead End | ÿπŸÜÿØŸÖÿß ÿ™Ÿàÿßÿ¨Ÿá Dask ÿ∑ÿ±ŸäŸÇ ŸÖÿ≥ÿØŸàÿØ

**BEFORE**: We've learned Dask for distributed computing.

**AFTER**: We discover Dask is good for distributed CPU, but GPU acceleration is needed for data science!

**Why this matters**: Dask distributes across CPUs, but GPU acceleration provides much better performance for data science operations!

---

### The Problem We've Discovered

We've learned:
- ‚úÖ How to use Dask for distributed computing
- ‚úÖ How to process large datasets across multiple CPUs
- ‚úÖ How to use lazy evaluation

**But we have a problem:**
- ‚ùì **What if we need GPU acceleration for data science operations?**
- ‚ùì **What if CPU-based distributed computing is still too slow?**
- ‚ùì **What if we need GPU-accelerated data science workflows?**

**The Dead End:**
- Dask is excellent for distributed CPU computing
- But for data science operations, GPU acceleration is much faster
- We need GPU-accelerated data science libraries

---

### Demonstrating the Problem

Let's see the limitation of CPU-based distributed computing:


In [11]:
print("\n" + "=" * 70)
print("üö´ DEMONSTRATING THE DEAD END: CPU vs GPU for Data Science")
print("=" * 70)

print(f"\nüìä Current Capabilities:")
print(f"   ‚úì Dask: Distributed CPU computing")
print(f"   ‚úì Parallel processing across multiple CPUs")
print(f"   ‚úì Handles large datasets")

print(f"\n‚ö†Ô∏è  Limitation:")
print(f"   - Dask uses CPU processing (even when distributed)")
print(f"   - CPU is sequential for many operations")
print(f"   - GPU acceleration is 10-100x faster for data science operations")
print(f"   - For data science workflows, GPU is essential!")

print(f"\nüí° The Problem:")
print(f"   - Dask distributes across CPUs (good for general computing)")
print(f"   - But data science operations benefit massively from GPU")
print(f"   - GPU parallel processing is much faster than CPU")
print(f"   - We need GPU-accelerated data science libraries!")

print(f"\nüìã Real-World Scenario:")
print(f"   - Data science operations: Filtering, grouping, aggregations")
print(f"   - CPU (even distributed): Slower, sequential operations")
print(f"   - GPU: Parallel processing, 10-100x faster")
print(f"   - For production data science, GPU is essential!")

print(f"\n‚û°Ô∏è  Solution Needed:")
print(f"   - We need GPU-accelerated data science libraries")
print(f"   - We need RAPIDS (GPU-accelerated data science ecosystem)")
print(f"   - We need cuDF, cuML, and other GPU libraries")
print(f"   - This leads us to Example 16: RAPIDS Workflows")

print("\n" + "=" * 70)



üö´ DEMONSTRATING THE DEAD END: CPU vs GPU for Data Science

üìä Current Capabilities:
   ‚úì Dask: Distributed CPU computing
   ‚úì Parallel processing across multiple CPUs
   ‚úì Handles large datasets

‚ö†Ô∏è  Limitation:
   - Dask uses CPU processing (even when distributed)
   - CPU is sequential for many operations
   - GPU acceleration is 10-100x faster for data science operations
   - For data science workflows, GPU is essential!

üí° The Problem:
   - Dask distributes across CPUs (good for general computing)
   - But data science operations benefit massively from GPU
   - GPU parallel processing is much faster than CPU
   - We need GPU-accelerated data science libraries!

üìã Real-World Scenario:
   - Data science operations: Filtering, grouping, aggregations
   - CPU (even distributed): Slower, sequential operations
   - GPU: Parallel processing, 10-100x faster
   - For production data science, GPU is essential!

‚û°Ô∏è  Solution Needed:
   - We need GPU-accelerated data 

### What We Need Next

**The Solution**: We need GPU-accelerated data science:
- **RAPIDS**: GPU-accelerated data science ecosystem
- **cuDF**: GPU-accelerated DataFrames (like pandas, but on GPU)
- **cuML**: GPU-accelerated machine learning
- **GPU workflows**: Complete data science pipelines on GPU

**This dead end leads us to Example 16: RAPIDS Workflows**
- Example 15 will show us GPU-accelerated data science
- We'll see complete workflows on GPU
- This solves the GPU acceleration need for data science operations!
