# Loading Large CSV File with Dask

This notebook demonstrates how to load a large CSV file (`HI-Large_Trans.csv`) using Dask, which is designed for parallel computing and handling datasets larger than memory.

## Install required packages

If you haven't installed Dask and its dependencies yet, uncomment and run the cell below:

In [1]:
# !pip install dask[dataframe] dask[distributed] dask-ml matplotlib

## Import necessary libraries

In [4]:
import dask.dataframe as dd
import pandas as pd
import matplotlib.pyplot as plt
import os
from dask.distributed import Client, progress

# Set up a local Dask client for parallel processing
client = Client()
client

Matplotlib is building the font cache; this may take a moment.


0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 5
Total threads: 10,Total memory: 32.00 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:63004,Workers: 5
Dashboard: http://127.0.0.1:8787/status,Total threads: 10
Started: Just now,Total memory: 32.00 GiB

0,1
Comm: tcp://127.0.0.1:63024,Total threads: 2
Dashboard: http://127.0.0.1:63028/status,Memory: 6.40 GiB
Nanny: tcp://127.0.0.1:63007,
Local directory: /var/folders/9p/n_0wt7rd5f5bjsyzp0wy4mfr0000gn/T/dask-scratch-space/worker-lx15g4z4,Local directory: /var/folders/9p/n_0wt7rd5f5bjsyzp0wy4mfr0000gn/T/dask-scratch-space/worker-lx15g4z4

0,1
Comm: tcp://127.0.0.1:63017,Total threads: 2
Dashboard: http://127.0.0.1:63018/status,Memory: 6.40 GiB
Nanny: tcp://127.0.0.1:63009,
Local directory: /var/folders/9p/n_0wt7rd5f5bjsyzp0wy4mfr0000gn/T/dask-scratch-space/worker-k9lheagp,Local directory: /var/folders/9p/n_0wt7rd5f5bjsyzp0wy4mfr0000gn/T/dask-scratch-space/worker-k9lheagp

0,1
Comm: tcp://127.0.0.1:63021,Total threads: 2
Dashboard: http://127.0.0.1:63025/status,Memory: 6.40 GiB
Nanny: tcp://127.0.0.1:63011,
Local directory: /var/folders/9p/n_0wt7rd5f5bjsyzp0wy4mfr0000gn/T/dask-scratch-space/worker-jigdqbr0,Local directory: /var/folders/9p/n_0wt7rd5f5bjsyzp0wy4mfr0000gn/T/dask-scratch-space/worker-jigdqbr0

0,1
Comm: tcp://127.0.0.1:63020,Total threads: 2
Dashboard: http://127.0.0.1:63022/status,Memory: 6.40 GiB
Nanny: tcp://127.0.0.1:63013,
Local directory: /var/folders/9p/n_0wt7rd5f5bjsyzp0wy4mfr0000gn/T/dask-scratch-space/worker-wdbfp6ut,Local directory: /var/folders/9p/n_0wt7rd5f5bjsyzp0wy4mfr0000gn/T/dask-scratch-space/worker-wdbfp6ut

0,1
Comm: tcp://127.0.0.1:63027,Total threads: 2
Dashboard: http://127.0.0.1:63030/status,Memory: 6.40 GiB
Nanny: tcp://127.0.0.1:63015,
Local directory: /var/folders/9p/n_0wt7rd5f5bjsyzp0wy4mfr0000gn/T/dask-scratch-space/worker-22czqnov,Local directory: /var/folders/9p/n_0wt7rd5f5bjsyzp0wy4mfr0000gn/T/dask-scratch-space/worker-22czqnov


## Define file path and examine file size

In [5]:
file_path = 'HI-Large_Trans.csv'

# Check file size
file_size_bytes = os.path.getsize(file_path)
file_size_gb = file_size_bytes / (1024**3)

print(f"File size: {file_size_gb:.2f} GB")

File size: 15.88 GB


## Load the CSV file using Dask

Dask will read the file in chunks, allowing us to work with data larger than memory.

In [7]:
# Load the CSV file into a Dask DataFrame
# We're using 'blocksize' to control the partition size
# Adjust the blocksize based on your available memory
ddf = dd.read_csv(file_path, blocksize="100MB")

# Display information about the Dask DataFrame
print(f"Number of partitions: {ddf.npartitions}")
print("\nDataFrame structure:")
ddf

ImportError: An error occurred while calling the read_csv method registered to the pandas backend.
Original Message: pyarrow>=10.0.1 is required for PyArrow backed StringArray.

## Examine the DataFrame schema

In [None]:
# Get column names and data types
print("Column names and data types:")
ddf.dtypes

## Preview the first few rows

In [None]:
# Preview the first 5 rows
# This triggers computation only for the first partition
ddf.head()

## Basic data analysis

Let's perform some basic analysis on the data without loading the entire dataset into memory.

In [None]:
# Calculate basic statistics for numeric columns
# This will trigger computation across all partitions
print("Computing basic statistics (this may take a while for a large file)...")
ddf.describe().compute()

## Count the number of rows

In [None]:
# Count the total number of rows
print("Counting rows (this may take a while)...")
row_count = len(ddf)
print(f"Total number of rows: {row_count}")

## Working with specific columns

You can select specific columns to work with, which reduces memory usage.

In [None]:
# Select the first few columns (adjust based on your actual column names)
# Replace with your actual column names after examining the dataframe
try:
    # This is a placeholder - replace with actual column names after you see the structure
    first_columns = ddf.iloc[:, :3]  # First 3 columns
    print("First few columns:")
    first_columns.head()
except Exception as e:
    print(f"Error selecting columns: {e}")
    print("Please adjust the column selection after examining the dataframe structure.")

## Performing operations on the DataFrame

With Dask, you can perform many of the same operations as with pandas, but they're executed in parallel and can handle larger-than-memory datasets.

In [None]:
# Example: Group by operation (adjust column names based on your data)
# This is just a placeholder - modify after examining your data
try:
    # Replace 'category_column' with an actual column name from your data
    # group_result = ddf.groupby('category_column').size().compute()
    # print("Group by results:")
    # group_result
    print("Uncomment and modify the groupby operation with actual column names from your data")
except Exception as e:
    print(f"Error in groupby operation: {e}")
    print("Please adjust the groupby operation after examining the dataframe structure.")

## Saving results

You can save processed results to various formats.

In [None]:
# Example: Save a sample of the data to a new CSV file
# Uncomment and adjust as needed

# sample_size = 1000  # Adjust based on your needs
# sample_df = ddf.head(sample_size)
# sample_df.to_csv('sample_data.csv', single_file=True, index=False)
# print(f"Saved {sample_size} rows to 'sample_data.csv'")

## Clean up

Close the Dask client when you're done.

In [None]:
# Close the Dask client
client.close()