# Unit 5 - Example 07: Large Dataset Handling

## ðŸ“š Learning Objectives

By completing this notebook, you will:
- Understand the key concepts of this topic
- Apply the topic using Python code examples
- Practice with small, realistic datasets or scenarios

## ðŸ”— Prerequisites

- âœ… Basic Python
- âœ… Basic NumPy/Pandas (when applicable)

---

## Official Structure Reference

This notebook supports **Course 05, Unit 5** requirements from `DETAILED_UNIT_DESCRIPTIONS.md`.

---


# Unit 5 - Example 07: Large Dataset Handling

## ðŸ”— Solving the Problem from Example 06 | Ø­Ù„ Ø§Ù„Ù…Ø´ÙƒÙ„Ø© Ù…Ù† Ø§Ù„Ù…Ø«Ø§Ù„ 18

**Remember the dead end from Example 06?**
- We learned performance optimization techniques
- But even optimized, very large datasets require special handling
- We needed strategies for handling massive datasets

**This notebook solves that problem!**
- We'll learn **large dataset handling strategies**
- We'll learn **chunking, streaming, and memory-efficient processing**
- We'll learn **techniques for datasets that don't fit in memory**

**This solves the large dataset problem from Example 06!**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time


In [2]:
print("=" * 70)
print("Example 07: Large Dataset Handling | Ø§Ù„ØªØ¹Ø§Ù…Ù„ Ù…Ø¹ Ù…Ø¬Ù…ÙˆØ¹Ø§Øª Ø§Ù„Ø¨ÙŠØ§Ù†Ø§Øª Ø§Ù„ÙƒØ¨ÙŠØ±Ø©")
print("=" * 70)
print("\nðŸ“š Prerequisites: Examples 02-06 completed, memory management knowledge")
print("ðŸ”— This is the FIFTH example in Unit 5 - large dataset handling")
print("ðŸŽ¯ Goal: Master processing large datasets efficiently")
print("Reference: Study 18.pdf before running this code example.\n")


Example 07: Large Dataset Handling | Ø§Ù„ØªØ¹Ø§Ù…Ù„ Ù…Ø¹ Ù…Ø¬Ù…ÙˆØ¹Ø§Øª Ø§Ù„Ø¨ÙŠØ§Ù†Ø§Øª Ø§Ù„ÙƒØ¨ÙŠØ±Ø©

ðŸ“š Prerequisites: Examples 02-06 completed, memory management knowledge
ðŸ”— This is the FIFTH example in Unit 5 - large dataset handling
ðŸŽ¯ Goal: Master processing large datasets efficiently
Reference: Study 18.pdf before running this code example.



## 


# 1. CREATE SIMULATED LARGE DATASET


## 


In [3]:
print("\n1. Simulating Large Dataset")
print("-" * 70)
# Create large CSV file in chunks
np.random.seed(42)
chunk_size = 100000
total_rows = 1000000
n_chunks = total_rows // chunk_size
print(f"Creating {total_rows:,} rows in {n_chunks} chunks...")

large_file = 'large_dataset.csv'
# Remove existing file if it exists to ensure clean creation
import os
if os.path.exists(large_file):
    os.remove(large_file)
    print(f"Removed existing {large_file} to recreate with proper headers")

chunks = []
for i in range(n_chunks):
    chunk_data = {
        'id': range(i * chunk_size, (i + 1) * chunk_size),
        'value1': np.random.randn(chunk_size), 'value2': np.random.randn(chunk_size),
        'category': np.random.choice(['A', 'B', 'C'], chunk_size),
        'score': np.random.randint(0, 100, chunk_size)
    }
    chunk_df = pd.DataFrame(chunk_data)
    chunks.append(chunk_df)
    
    # Append to CSV - write header only on first chunk
    mode = 'w' if i == 0 else 'a'
    header = (i == 0)  # Write header only for first chunk
    chunk_df.to_csv(large_file, mode=mode, header=header, index=False)
    
    if (i + 1) % 2 == 0 or i == 0:
        print(f"  Created chunk {i+1}/{n_chunks}...")

print(f"\nâœ“ Created large CSV file: {large_file} ({total_rows:,} rows)")
# Verify the file was created correctly
verify_df = pd.read_csv(large_file, nrows=5)
print(f"âœ“ Verified: CSV has columns: {list(verify_df.columns)}")
print(f"âœ“ Sample data:\n{verify_df.head()}")


1. Simulating Large Dataset
----------------------------------------------------------------------
Creating 1,000,000 rows in 10 chunks...
Removed existing large_dataset.csv to recreate with proper headers
  Created chunk 1/10...


  Created chunk 2/10...


  Created chunk 4/10...


  Created chunk 6/10...


  Created chunk 8/10...


  Created chunk 10/10...

âœ“ Created large CSV file: large_dataset.csv (1,000,000 rows)
âœ“ Verified: CSV has columns: ['id', 'value1', 'value2', 'category', 'score']
âœ“ Sample data:
   id    value1    value2 category  score
0   0  0.496714  1.030595        A     34
1   1 -0.138264 -1.155355        C     83
2   2  0.647689  0.575437        A     75
3   3  1.523030 -0.619238        C     90
4   4 -0.234153 -0.327403        A      6


## 


In [4]:
# 2. PROCESSING IN CHUNKS


## 


In [5]:
print("\n\n2. Processing in Chunks")
print("-" * 70)

# Verify file exists and check its structure
import os
if not os.path.exists(large_file):
    print(f"Error: {large_file} not found. Please run the previous cell first to create the file.")
else:
    # Check first few rows to verify structure
    sample_df = pd.read_csv(large_file, nrows=5)
    print(f"File exists. Columns in CSV: {list(sample_df.columns)}")
    if 'category' not in sample_df.columns or 'score' not in sample_df.columns:
        print("Warning: Required columns not found. Recreating file...")
        # This will be handled by running the previous cell again

results = []
chunk_processing_times = []
start_total = time.time()

# Read CSV in chunks
start_total = time.time()
chunk_reader = pd.read_csv(large_file, chunksize=chunk_size)
for i, chunk_df in enumerate(chunk_reader, 1):
    chunk_start = time.time()
    
    # Verify columns exist
    if 'category' not in chunk_df.columns or 'score' not in chunk_df.columns:
        print(f"Warning: Chunk {i} missing required columns. Available columns: {list(chunk_df.columns)}")
        # Try to fix column names (remove whitespace)
        chunk_df.columns = chunk_df.columns.str.strip()
        if 'category' not in chunk_df.columns or 'score' not in chunk_df.columns:
            print(f"Skipping chunk {i} due to missing columns")
            continue
    
    # Process chunk
    chunk_result = chunk_df.groupby('category')['score'].mean()
    results.append(chunk_result)
    chunk_time = time.time() - chunk_start
    chunk_processing_times.append(chunk_time)
    
    if i % 5 == 0:
        print(f"Processed chunk {i}, time: {chunk_time:.4f}s")

total_time = time.time() - start_total

# Combine results
if results:
    final_result = pd.concat(results).groupby(level=0).mean()
    print(f"\nâœ“ Processed {len(results)} chunks in {total_time:.4f} seconds")
    print(f"âœ“ Average chunk processing time: {np.mean(chunk_processing_times):.4f} seconds")
    print(f"\nFinal aggregated result:")
    print(final_result)
else:
    print("\nâœ— No chunks were processed successfully")



2. Processing in Chunks
----------------------------------------------------------------------
File exists. Columns in CSV: ['id', 'value1', 'value2', 'category', 'score']
Processed chunk 5, time: 0.0018s
Processed chunk 10, time: 0.0016s

âœ“ Processed 10 chunks in 0.1866 seconds
âœ“ Average chunk processing time: 0.0018 seconds

Final aggregated result:
category
A    49.489571
B    49.472483
C    49.501032
Name: score, dtype: float64


## 


In [6]:
# 3. MEMORY-EFFICIENT PROCESSING


## 


In [7]:
print("\n\n3. Memory Efficient Processing")
print("-" * 70)
# Use iterator to process without loading all into memory
total_sum = 0
total_count = 0
print("Processing with iterator (memory-efficient)...")
start_time = time.time()

chunk_reader = pd.read_csv(large_file, chunksize=chunk_size)
chunk_num = 0
for chunk_df in chunk_reader:
    chunk_num += 1
    # Verify columns exist
    chunk_df.columns = chunk_df.columns.str.strip()  # Remove any whitespace
    if 'score' not in chunk_df.columns:
        print(f"Warning: Chunk {chunk_num} missing 'score' column. Available: {list(chunk_df.columns)}")
        continue
    
    chunk_sum = chunk_df['score'].sum()
    chunk_count = len(chunk_df)
    total_sum += chunk_sum
    total_count += len(chunk_df)

if total_count > 0:
    avg_score = total_sum / total_count
    iterator_time = time.time() - start_time
    print(f"Average score (computed incrementally): {avg_score:.2f}")
    print(f"Processing time: {iterator_time:.4f} seconds")
    print(f"Memory used: Minimal (one chunk at a time)")
    print(f"Processed {chunk_num} chunks")
else:
    print("Error: No valid chunks were processed. Please check the CSV file structure.")



3. Memory Efficient Processing
----------------------------------------------------------------------
Processing with iterator (memory-efficient)...


Average score (computed incrementally): 49.49
Processing time: 0.1678 seconds
Memory used: Minimal (one chunk at a time)
Processed 10 chunks


## 


In [8]:
# 4. VISUALIZATION


## 


In [9]:
print("\n\n4. Creating Visualization")
print("-" * 70)
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
fig.suptitle('Large Dataset Processing', fontsize=16, weight='bold')
# Chunk processing times
axes[0].plot(range(1, len(chunk_processing_times) + 1), chunk_processing_times,
marker='o', color='#4ECDC4', linewidth=2, markersize=4)
axes[0].axhline(y=np.mean(chunk_processing_times), color='r', linestyle='--',
label=f'Mean: {np.mean(chunk_processing_times):.4f}s')
axes[0].set_xlabel('Chunk Number')
axes[0].set_ylabel('Processing Time (s)')
axes[0].set_title('Chunk Processing Time', fontsize=12, weight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Memory comparison
methods = ['Load All\n ', 'Chunking\n']
memory_usage = [500, 50]  # MB (simulated)
axes[1].bar(methods, memory_usage, color=['#FF6B6B', '#4ECDC4'], edgecolor='black')
axes[1].set_ylabel('Memory (MB)')
axes[1].set_title('Memory Usage Comparison', fontsize=12, weight='bold')
axes[1].grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.savefig('18_large_dataset.png', dpi=300, bbox_inches = 'tight')
print("âœ“ Visualization saved")
plt.close()



4. Creating Visualization
----------------------------------------------------------------------


âœ“ Visualization saved


## 


# 5. SUMMARY


## 


In [10]:
print("\n" + "=" * 70)
print("Summary")
print("=" * 70)
print("\nKey Concepts Covered:")
print("1. Chunking strategies")
print("2. Streaming processing")
print("3. Memory-efficient operations")
print("4. Incremental aggregation")
print("\nNext Steps: Continue to Example 08 for Deployment")



Summary

Key Concepts Covered:
1. Chunking strategies
2. Streaming processing
3. Memory-efficient operations
4. Incremental aggregation

Next Steps: Continue to Example 08 for Deployment
