<a href="https://colab.research.google.com/github/comparativechrono/Principles-of-Data-Science/blob/main/Week_6/Section_8_Python_Example__Optimizing_Data_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Section 8 - Python example: optimizing data processing

In data science, optimizing data processing can significantly enhance the performance and scalability of data analysis workflows. This section demonstrates practical Python techniques for optimizing data processing tasks using libraries such as Pandas, NumPy, and Dask. These examples highlight methods to improve computational efficiency, manage memory usage effectively, and reduce processing time, especially when dealing with large datasets.

1. Setting Up the Environment:

Before diving into data processing optimization, ensure your Python environment is equipped with the necessary libraries. If Pandas, NumPy, or Dask are not installed, they can be added using pip:

In [None]:
pip install pandas numpy dask time

2. Importing Required Libraries:

Start by importing the libraries that will be used throughout the examples:

In [None]:
import pandas as pd
import numpy as np
import dask.array as da
import time

3. Using Vectorization in NumPy:

Vectorization is a powerful method for minimizing loop usage and optimizing computations. Here’s how you can utilize NumPy for vectorized operations:

In [None]:
# Create large numpy arrays
a = np.random.rand(1000000)
b = np.random.rand(1000000)
# Vectorized addition
result = a + b
# Much faster than iterating through arrays

4. Efficient Data Loading and Processing with Pandas:

Handling large datasets efficiently in Pandas involves optimizing how data is loaded and manipulated. When loading large files it is helpful to consider what we would do if the file was too big for our system. A good option is to chunk the file, although this can take longer, it is a necessary trade-off. Here is an example of how we can use the time module to look at the loading:

In [None]:
# Option 1: Attempt to load the entire large file at once
try:
    print("Attempting to load the entire file at once...")
    start_time = time.time()
    large_df = pd.read_csv('large_dataset.csv')
    load_time_entire = time.time() - start_time
    print(f"Loaded entire file successfully in {load_time_entire:.2f} seconds. Shape:", large_df.shape)
except MemoryError:
    print("MemoryError: The file is too large to load into memory all at once.")

# Option 2: Load the file in chunks and measure the time taken
print("\nProcessing the file in chunks...")
start_time = time.time()

# Read the large CSV file in chunks
iterator = pd.read_csv('large_dataset.csv', chunksize=10000)

# Process each chunk
for i, chunk in enumerate(iterator):
    # Process data within each chunk
    chunk['new_column'] = chunk['existing_column'] * 10

    # Optionally display the first processed chunk
    if i == 0:
        print("Sample of the first processed chunk:")
        print(chunk.head())  # Display the first few rows as an example

# Calculate time taken for chunk processing
load_time_chunks = time.time() - start_time
print(f"\nChunk processing complete in {load_time_chunks:.2f} seconds.")


5. Parallel Processing with Dask:

Dask provides advanced parallel computing capabilities, making it ideal for working with large data sets efficiently. We'll look at this again in section 10. In this section, let's learn how to make a dask DataFrame for performing parallel operations:

In [None]:
pip install "dask[dataframe]"


In [None]:
import dask.dataframe as dd

# Create a Dask DataFrame from a Pandas DataFrame
dask_df = dd.from_pandas(pd.DataFrame({'x': range(100000), 'y': range(100000)}), npartitions=10)
# Perform operations in parallel
result = dask_df.x + dask_df.y
# This operation is lazy and computed in parallel
computed_result = result.compute() # Trigger computation

How could we use the time module to measure the compute time of this operation? Have a go at adapting the previous example that used time to look at the speed of running tasks in parallel with dask. Is dask quicker?

In [None]:
# your code here

6. Memory Management:

Effective memory management is crucial for handling large datasets. Techniques such as using smaller data types and cleaning up dataframes can help:

In [None]:
# Optimize data types
df = pd.DataFrame({'A': pd.Series(np.random.randint(1, 100, size=1000000))})

# Print memory usage before optimization
print(df['A'].memory_usage(deep=True), 'bytes')

# Convert the column to a smaller integer type to save memory
df['A'] = df['A'].astype('int8')

# Print memory usage after optimization
print(df['A'].memory_usage(deep=True), 'bytes')  # Memory usage is reduced

# Explicitly delete dataframes when they're no longer needed
del df


7. Utilizing Efficient File Formats:

Using efficient file formats can speed up read and write operations significantly. HDF5 is a modern format that can greatly increase efficiency when dealing with big data. This can still be loaded into pandas:

In [None]:
# Using HDF5 format
df = pd.DataFrame({'A': np.random.rand(1000000)})
df.to_hdf('data.h5', key='df', mode='w')
# Fast loading
df = pd.read_hdf('data.h5')

Optimizing data processing tasks in Python involves a combination of efficient coding practices, leveraging powerful libraries like Pandas, NumPy, and Dask, and employing strategies for effective memory management and parallel processing. By implementing these techniques, data scientists can handle larger datasets more effectively, perform faster analyses, and scale their data processing workflows to meet the demands of increasingly data-intensive applications. These optimizations not only save time but also reduce computational costs, making them indispensable in modern data science.

In this notebook, we’ve explored how Pandas and Dask handle large datasets and why Dask is often chosen for "big data" operations. By examining each approach through the lens of Big O notation, we can better understand the theoretical efficiency and trade-offs involved.

When loading a dataset and performing operations such as addition on columns, Pandas operates with **linear time complexity, \( O(N) \)**. This means that as the dataset size \( N \) grows, the time required to load and process the dataset scales linearly. Pandas is efficient for datasets that fit in memory, as it doesn’t introduce the overhead required for parallel processing, making it the faster option in this case.

Dask, on the other hand, also has **linear complexity** for basic operations, but it introduces additional overhead due to partitioning the dataset and managing parallel tasks. The overall complexity of Dask is **\( O(N) + O(P) \)**, where \( P \) represents the number of partitions. Although this added overhead can make Dask slower for smaller datasets, it’s a necessary trade-off for managing larger-than-memory datasets. By partitioning the dataset and processing each partition in parallel, Dask makes it possible to work with massive datasets that would otherwise be infeasible in Pandas alone.

### Summary Table

| Approach        | Time Complexity | Practical Use Case                                 |
|-----------------|-----------------|----------------------------------------------------|
| **Pandas**      | \( O(N) \)      | Best for datasets that fit comfortably in memory, where linear processing can happen without parallel overhead. |
| **Dask**        | \( O(N) + O(P) \) | Best for large, memory-intensive datasets, as Dask's partitioning and parallel processing enable scalable operations, despite some overhead.|

This comparison highlights that **Dask doesn’t necessarily reduce complexity** but instead provides a way to handle large data efficiently by using parallelism. Through this analysis, we see how Big O notation not only reveals the efficiency of an algorithm but also highlights trade-offs related to practical memory constraints and computational scalability.