# Notebook 2: Advanced Dictionary Serialization & Performance

In Notebook 1, we covered the basics of serializing dictionaries using JSON, Pickle, YAML, and MessagePack. Now, let's tackle more complex scenarios and explore how to optimize serialization performance.

**Learning Objectives:**
*   Handle custom objects within dictionaries during JSON serialization.
*   Understand challenges with circular references.
*   Compare performance (speed, size) of different serialization formats.
*   Learn techniques like incremental serialization and compression.
*   See how serialization is used for caching.
*   Compare Pickle and HDF5 for specific high-performance use cases.

## Part 2: Advanced Dictionary Serialization Techniques

### Handling Complex Dictionary Structures

**1. Custom Objects in Dictionaries with JSON**

Standard JSON serializers (`json.dumps`) don't know how to handle custom Python objects or types like `datetime`. We need to provide custom encoders and decoders.

**Encoding:** Subclass `json.JSONEncoder` and override the `default` method.
**Decoding:** Use the `object_hook` parameter in `json.loads`.

In [None]:
import json
from datetime import datetime
from pprint import pprint

# Example dictionary with a datetime object (not directly JSON serializable)
event_dict = {
    'event_id': 'conf_2025',
    'name': 'Annual Tech Conference',
    'start_date': datetime(2025, 10, 20, 9, 0, 0),
    'attendees': ['Alice', 'Bob', 'Charlie']
}

print("Original Dictionary with datetime:")
pprint(event_dict)

# --- Attempt standard JSON serialization (will fail) ---
try:
    json.dumps(event_dict)
except TypeError as e:
    print(f"\nStandard JSON failed as expected: {e}")

# --- Custom Encoder --- 
class DateTimeEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, datetime):
            # Represent datetime as an ISO 8601 string
            return {'__datetime__': obj.isoformat()} # Add hint for decoder
        # Let the base class default method raise the TypeError for other types
        return super().default(obj)

print("\nSerializing with custom encoder...")
json_string_custom = json.dumps(event_dict, cls=DateTimeEncoder, indent=2)
print(json_string_custom)

# --- Custom Decoder (using object_hook) --- 
def datetime_decoder(dct):
    if '__datetime__' in dct:
        # Convert back to datetime object
        return datetime.fromisoformat(dct['__datetime__'])
    return dct # Return other dictionaries unchanged

print("\nDeserializing with custom decoder...")
reconstructed_custom = json.loads(json_string_custom, object_hook=datetime_decoder)
pprint(reconstructed_custom)

# Verify the datetime object was correctly reconstructed
assert isinstance(reconstructed_custom['start_date'], datetime)
assert event_dict == reconstructed_custom

print("\nSuccessfully serialized and deserialized dictionary with datetime object!")

**2. Nested Dictionaries and Circular References**

Dictionaries containing references to themselves or other objects that refer back create circular references. Standard recursive serializers (like `json.dumps`) will enter an infinite loop and raise a `RecursionError`.

*   **Pickle:** Handles circular references automatically.
*   **JSON/YAML/Others:** Require custom logic to detect and handle cycles, often by replacing repeated object references with a special marker or ID.

In [None]:
import json
import pickle

# Create a dictionary with a circular reference
circular_dict = {'name': 'self-referential', 'value': 42}
circular_dict['myself'] = circular_dict # The circular reference!

# --- Attempt standard JSON serialization (will fail) ---
print("Attempting JSON serialization on circular dictionary:")
try:
    json.dumps(circular_dict)
except RecursionError as e:
    print(f"  >>> Failed with RecursionError (or ValueError in some versions) as expected: {e}")
except ValueError as e: # Newer Python versions might raise ValueError
    print(f"  >>> Failed with ValueError as expected: {e}")
    
# --- Pickle serialization (works!) ---
print("\nAttempting Pickle serialization on circular dictionary:")
try:
    pickled_circular = pickle.dumps(circular_dict)
    print(f"  >>> Pickle succeeded! Size: {len(pickled_circular)} bytes")
    
    # Deserialization also works
    unpickled_circular = pickle.loads(pickled_circular)
    assert unpickled_circular['name'] == 'self-referential'
    assert unpickled_circular['myself'] is unpickled_circular # The reference is restored
    print("  >>> Pickle deserialization restored the circular reference.")
except Exception as e:
    print(f"  >>> Pickle failed unexpectedly: {e}")
    
# Note: Handling circular refs in JSON requires complex custom logic (beyond scope here)
# involving tracking object ids ('seen' set) and using placeholders.

### Performance Optimization for Dictionary Serialization

When dealing with large dictionaries or frequent serialization, performance matters.

**1. Binary Serialization for Speed and Size**

Binary formats like MessagePack (or Protocol Buffers, not shown here) are generally faster and produce smaller output than text-based formats like JSON, especially for numerical data.

In [None]:
# Note: Requires msgpack: pip install msgpack-python
import json
import msgpack
import time
import sys

# Create a reasonably large dictionary
num_items = 100000
large_dict = {f'key_{i}': f'value_long_string_{i}' for i in range(num_items)}

print(f"Comparing performance for a dictionary with {num_items} items:\n")

# --- Measure JSON --- 
start_time = time.perf_counter()
json_data = json.dumps(large_dict).encode('utf-8') # Encode to bytes for fair size comparison
json_time = time.perf_counter() - start_time
json_size = sys.getsizeof(json_data) # More accurate memory size

start_time_des = time.perf_counter()
json_des = json.loads(json_data.decode('utf-8'))
json_time_des = time.perf_counter() - start_time_des

print(f"JSON Serialization:   {json_time:.4f}s, Size: {json_size / 1024:.2f} KB")
print(f"JSON Deserialization: {json_time_des:.4f}s")

# --- Measure MessagePack --- 
try:
    start_time = time.perf_counter()
    msgpack_data = msgpack.packb(large_dict)
    msgpack_time = time.perf_counter() - start_time
    msgpack_size = sys.getsizeof(msgpack_data)
    
    start_time_des = time.perf_counter()
    msgpack_des = msgpack.unpackb(msgpack_data, raw=False)
    msgpack_time_des = time.perf_counter() - start_time_des

    print(f"\nMessagePack Serialization:   {msgpack_time:.4f}s, Size: {msgpack_size / 1024:.2f} KB")
    print(f"MessagePack Deserialization: {msgpack_time_des:.4f}s")

    # --- Measure Pickle --- 
    # Use highest protocol for best performance
    start_time = time.perf_counter()
    pickle_data = pickle.dumps(large_dict, protocol=pickle.HIGHEST_PROTOCOL)
    pickle_time = time.perf_counter() - start_time
    pickle_size = sys.getsizeof(pickle_data)
    
    start_time_des = time.perf_counter()
    pickle_des = pickle.loads(pickle_data)
    pickle_time_des = time.perf_counter() - start_time_des

    print(f"\nPickle Serialization:   {pickle_time:.4f}s, Size: {pickle_size / 1024:.2f} KB")
    print(f"Pickle Deserialization: {pickle_time_des:.4f}s")
    
except ImportError:
    print("\nMessagePack test skipped: msgpack not installed.")
except Exception as e:
     print(f"\nAn error occurred during performance test: {e}")

# Note: Actual performance varies greatly based on data complexity, Python version, and hardware.

**2. Incremental Serialization for Large Dictionaries**

If a dictionary is too large to fit into memory all at once during serialization (e.g., loading from a huge database or generator), you might need to serialize it incrementally. This is easier with formats that support streaming.

For JSON, this often means manually constructing the JSON structure piece by piece.

In [None]:
import json
import os

# Simulate generating key-value pairs for a huge dictionary
def generate_large_dict_items(n):
    print(f"Simulating generation of {n} items...")
    for i in range(n):
        # Simulate some work to get the item
        yield f"item_key_{i}", {'value': i*i, 'status': 'generated'}
        if (i+1) % 10000 == 0:
            print(f"  ...generated {i+1} items")
            
# File to write to
output_filename = 'large_incremental_data.json'
num_items_to_generate = 50000 # Smaller number for demo

print(f"Writing {num_items_to_generate} items incrementally to {output_filename}\n")

try:
    with open(output_filename, 'w') as f:
        f.write('{') # Start JSON object
        first_item = True
        for key, value in generate_large_dict_items(num_items_to_generate):
            if not first_item:
                f.write(',\n ') # Add comma and newline for readability
            else:
                f.write('\n ') # Start first item on new line
                first_item = False
            
            # Manually serialize key and value
            # Ensure key is properly escaped JSON string
            f.write(f'{json.dumps(key)}: {json.dumps(value)}') 
            
        f.write('\n}') # End JSON object
    print(f"\nSuccessfully wrote {output_filename}")

    # Clean up the file afterwards for demo purposes
    # os.remove(output_filename)
    # print(f"Cleaned up {output_filename}")

except Exception as e:
    print(f"An error occurred during incremental write: {e}")

# Note: Deserializing such a large file might also require streaming parsers (e.g., ijson library)

**3. Compression for Serialized Data**

Combine serialization with compression libraries like `gzip` or `bz2` to save disk space or reduce network bandwidth, especially for text-based formats like JSON or verbose binary formats.

*Trade-off:* Compression adds CPU overhead during both serialization and deserialization.

In [None]:
import pickle
import gzip
import os
import sys

# Use the large dictionary from the performance test
num_items = 100000
large_dict = {f'key_{i}': f'value_long_string_{i}' for i in range(num_items)}

output_file_pkl = 'large_dict.pkl'
output_file_pkl_gz = 'large_dict.pkl.gz'

# --- Standard Pickle --- 
print("Serializing with standard Pickle...")
start = time.perf_counter()
with open(output_file_pkl, 'wb') as f:
    pickle.dump(large_dict, f, protocol=pickle.HIGHEST_PROTOCOL)
time_pkl = time.perf_counter() - start
size_pkl = os.path.getsize(output_file_pkl)
print(f"  Pickle: {time_pkl:.4f}s, Size: {size_pkl / 1024:.2f} KB")

# --- Pickle with Gzip Compression --- 
print("Serializing with Pickle + Gzip...")
start = time.perf_counter()
with gzip.open(output_file_pkl_gz, 'wb') as f:
    # Gzip works with file-like objects, pickle.dump writes to it
    pickle.dump(large_dict, f, protocol=pickle.HIGHEST_PROTOCOL)
time_pkl_gz = time.perf_counter() - start
size_pkl_gz = os.path.getsize(output_file_pkl_gz)
print(f"  Pickle+Gzip: {time_pkl_gz:.4f}s, Size: {size_pkl_gz / 1024:.2f} KB")

# --- Decompression and Deserialization ---
print("\nDeserializing compressed file...")
start = time.perf_counter()
with gzip.open(output_file_pkl_gz, 'rb') as f:
    reconstructed_dict = pickle.load(f)
time_des_gz = time.perf_counter() - start
print(f"  Pickle+Gzip Deserialization: {time_des_gz:.4f}s")

assert len(reconstructed_dict) == num_items
print("  Data integrity verified.")

# --- Clean up files ---
os.remove(output_file_pkl)
os.remove(output_file_pkl_gz)
print(f"\nCleaned up {output_file_pkl} and {output_file_pkl_gz}")

### Serialization for Specific Use Cases: Caching

Dictionaries are perfect for caching results of expensive operations. Serialize the cache dictionary to disk (e.g., using Pickle) to persist it across script runs.

In [None]:
import pickle
import os
import time

cache_file = 'computation_cache.pkl'

def expensive_calculation(param1, param2):
    print(f"  -> Performing expensive calculation for ({param1}, {param2})...")
    time.sleep(1) # Simulate work
    return (param1 * param2) + param1 + param2

def load_cache(filename):
    if os.path.exists(filename):
        try:
            with open(filename, 'rb') as f:
                print(f"Loading cache from {filename}")
                return pickle.load(f)
        except (pickle.UnpicklingError, EOFError, FileNotFoundError) as e:
             print(f"Cache file {filename} corrupted or empty, starting fresh. Error: {e}")
             return {}
    else:
        print("Cache file not found, creating new cache.")
        return {}

def save_cache(cache, filename):
     try:
        with open(filename, 'wb') as f:
            pickle.dump(cache, f, protocol=pickle.HIGHEST_PROTOCOL)
            print(f"Cache saved to {filename}")
     except Exception as e:
        print(f"Error saving cache to {filename}: {e}")

def cached_calculation(param1, param2, cache):
    # Use a tuple of parameters as the dictionary key
    cache_key = (param1, param2)
    
    if cache_key in cache:
        print(f"Cache hit for {cache_key}!")
        return cache[cache_key]
    else:
        print(f"Cache miss for {cache_key}. Calculating...")
        result = expensive_calculation(param1, param2)
        cache[cache_key] = result # Store result in cache
        # In a real app, you might save the cache less frequently
        # save_cache(cache, cache_file) 
        return result

# --- Main execution --- 
calculation_cache = load_cache(cache_file)

print("\nFirst call (10, 5):")
res1 = cached_calculation(10, 5, calculation_cache)
print(f"Result: {res1}")

print("\nSecond call (10, 5):")
res2 = cached_calculation(10, 5, calculation_cache)
print(f"Result: {res2}")

print("\nFirst call (7, 3):")
res3 = cached_calculation(7, 3, calculation_cache)
print(f"Result: {res3}")

# Save the updated cache at the end
save_cache(calculation_cache, cache_file)

# Optional: Clean up cache file
if os.path.exists(cache_file):
    os.remove(cache_file)
    print(f"\nCleaned up {cache_file}")

### Pickle vs. HDF5: Performance Comparison

While Pickle is fast for general Python objects, HDF5 (Hierarchical Data Format) is often used in scientific computing, especially for large numerical arrays (like NumPy arrays), potentially stored within dictionaries. It offers features Pickle lacks, like partial I/O and better memory efficiency for huge datasets.

*Requires Installation:* `pip install h5py` (and potentially `numpy`)

**Key Differences Summary (from provided text):**

*   **Serialization Speed:** Pickle often faster, especially for non-numerical or mixed data.
*   **Deserialization Speed:** Pickle often faster, but HDF5 can be faster for very large datasets or partial reads.
*   **Memory Usage (Serialization):** HDF5 significantly better (doesn't need full copy in memory).
*   **File Size:** HDF5 with compression can be much smaller for large numerical data.
*   **Features:** HDF5 allows partial reads/writes, metadata handling, better cross-platform/language support. Pickle is Python-specific.

**When to Consider HDF5 (over Pickle for dictionary values):**
*   Dictionaries contain *very large* NumPy arrays.
*   Memory is constrained during serialization.
*   Need to read/write only parts of the data.
*   Cross-language compatibility or long-term archiving is needed.

**Note:** Directly serializing a complex *dictionary structure itself* might be better handled by Pickle or formats like MessagePack unless the *values* within the dictionary are large numerical arrays suited for HDF5.

In [None]:
pip install h5py numpy

In [None]:
# Note: Requires h5py and numpy: pip install h5py numpy
try:
    import h5py
    import numpy as np
    import pickle
    import time
    import os

    # Create a dictionary where values are large NumPy arrays
    array_size = (1000, 1000) # 1 million elements per array
    num_arrays = 5
    dict_with_arrays = {f'array_{i}': np.random.rand(*array_size) for i in range(num_arrays)}

    print(f"Created dictionary with {num_arrays} arrays of size {array_size}\n")

    pickle_file = 'large_arrays.pkl'
    hdf5_file = 'large_arrays.h5'

    # --- Pickle Timing ---
    start = time.perf_counter()
    with open(pickle_file, 'wb') as f:
        pickle.dump(dict_with_arrays, f, protocol=pickle.HIGHEST_PROTOCOL)
    pickle_write_time = time.perf_counter() - start
    pickle_size = os.path.getsize(pickle_file)

    start = time.perf_counter()
    with open(pickle_file, 'rb') as f:
        loaded_pickle = pickle.load(f)
    pickle_read_time = time.perf_counter() - start
    print(f"Pickle Write: {pickle_write_time:.4f}s, Read: {pickle_read_time:.4f}s, Size: {pickle_size / (1024*1024):.2f} MB")
    
    # --- HDF5 Timing ---
    start = time.perf_counter()
    with h5py.File(hdf5_file, 'w') as f:
        for key, value in dict_with_arrays.items():
            f.create_dataset(key, data=value) # Store each array as a dataset
            # Could also store metadata: f[key].attrs['timestamp'] = time.time()
    hdf5_write_time = time.perf_counter() - start
    hdf5_size = os.path.getsize(hdf5_file)

    start = time.perf_counter()
    loaded_hdf5 = {}
    with h5py.File(hdf5_file, 'r') as f:
        for key in f.keys():
            loaded_hdf5[key] = f[key][:] # Load the full array data
    hdf5_read_time = time.perf_counter() - start
    print(f"HDF5 Write:   {hdf5_write_time:.4f}s, Read: {hdf5_read_time:.4f}s, Size: {hdf5_size / (1024*1024):.2f} MB")

    # --- HDF5 Partial Read Example ---
    start = time.perf_counter()
    with h5py.File(hdf5_file, 'r') as f:
        # Read only the first 10x10 slice of the first array
        partial_data = f['array_0'][:10, :10]
    hdf5_partial_read_time = time.perf_counter() - start
    print(f"\nHDF5 Partial Read (10x10 slice): {hdf5_partial_read_time:.6f}s")
    print(f"  Shape of partial data: {partial_data.shape}")

    # Clean up
    os.remove(pickle_file)
    os.remove(hdf5_file)
    print(f"\nCleaned up {pickle_file} and {hdf5_file}")

except ImportError:
    print("h5py or numpy not installed. Run 'pip install h5py numpy' to run this cell.")
except Exception as e:
    print(f"An error occurred during HDF5 comparison: {e}")

# Note: HDF5 shines more with even larger arrays or when memory during write is very constrained.

### Summary

We've explored advanced serialization topics:

*   Handling custom objects in JSON requires custom encoders/decoders.
*   Circular references are handled by Pickle but problematic for others.
*   Binary formats (Pickle, MessagePack, HDF5) often offer better performance/size than JSON for large data, with trade-offs.
*   Techniques like incremental serialization and compression help manage very large datasets.
*   Serialization is key for use cases like caching.
*   HDF5 provides advantages over Pickle for specific scenarios involving large numerical arrays and partial I/O needs.

**Next Steps:** The final notebook focuses on real-world applications, security considerations, and best practices for robust serialization.