### Dask

- A flexible library for parallel computing in Python, designed to scale up computations for handling large datasets. It works well with existing Python libraries like NumPy and Pandas

In [8]:
import dask.dataframe as dd
import pandas as pd
import numpy as np

# Generate a large dataset (as before)
np.random.seed(42)
data = {
    'Numeric_0': np.random.rand(1000000),
    'Numeric_1': np.random.rand(1000000),
    'Numeric_2': np.random.rand(1000000),
    'Numeric_3': np.random.rand(1000000),
    'Numeric_4': np.random.rand(1000000),
    'Numeric_5': np.random.rand(1000000),
    'Numeric_6': np.random.rand(1000000),
    'Numeric_7': np.random.rand(1000000),
    'Numeric_8': np.random.rand(1000000),
    'Numeric_9': np.random.rand(1000000),
    'Categorical_3': np.random.choice(['P', 'Q', 'R', 'S', 'T'], size=1000000)
}

# Create a pandas DataFrame
pdf = pd.DataFrame(data)

# Save the DataFrame to a CSV file
pdf.to_csv('large_dataset.csv', index=False)

# Now use Dask to read the CSV file
df = dd.read_csv('large_dataset.csv')

# Perform computations
result = (df['Numeric_0'] + df['Numeric_9'] * df['Numeric_3'])

# Display the first few results
print(result.compute().head())

# Calculate summary statistics
summary = df.describe().compute()
print(summary)

# Count value occurrences in a categorical column
value_counts = df['Categorical_3'].value_counts().compute()
print(value_counts)


0    0.881853
1    0.969647
2    1.161214
3    0.617403
4    0.818888
dtype: float64
          Numeric_0     Numeric_1     Numeric_2     Numeric_3     Numeric_4  \
count  1.000000e+06  1.000000e+06  1.000000e+06  1.000000e+06  1.000000e+06   
mean   5.003345e-01  4.994787e-01  5.001205e-01  5.000972e-01  4.999022e-01   
std    2.885911e-01  2.885151e-01  2.887497e-01  2.887905e-01  2.886225e-01   
min    5.188446e-07  3.774576e-07  9.958058e-07  3.907137e-07  3.869667e-07   
25%    2.505941e-01  2.505341e-01  2.506572e-01  2.504032e-01  2.508170e-01   
50%    5.012175e-01  4.998475e-01  5.016284e-01  5.006099e-01  5.003154e-01   
75%    7.508154e-01  7.495893e-01  7.504384e-01  7.507173e-01  7.500634e-01   
max    9.999983e-01  9.999994e-01  9.999979e-01  9.999999e-01  9.999966e-01   

          Numeric_5     Numeric_6     Numeric_7     Numeric_8     Numeric_9  
count  1.000000e+06  1.000000e+06  1.000000e+06  1.000000e+06  1.000000e+06  
mean   4.999513e-01  4.998491e-01  4.998747e-01

## Step by step 


- Reading large CSV files efficiently using dd.read_csv().

- Performing element-wise operations on columns.

- Using compute() to execute lazy operations and retrieve results.

- Calculating summary statistics with describe().

- Counting value occurrences in categorical columns.
Dask allows for these operations to be performed on datasets that may not fit into memory, by breaking them into smaller chunks and processing them in parallel. This makes it possible to work with large datasets on a single machine or across a cluster