# Pandas tip #4: Chunking your CSV directly in Pandas
When data get close to your memory size you might get into trouble. One way to solve this is to apply chunking. Chunking works by cutting your dataset in smaller bites after after processing combining the results, a bit like a tiny map-reduce. Creating a chunking algoritm is not difficult, but when using large CSV files, it is not even nescecary as Pandas has chunking build into read_csv.

Chunking can be applied using the `chunksize=<size>` parameter that takes care of the whole chunking process. Instead of the full DataFrame, read_csv returns a generator that can be iterated. As with all generators in Python, you can only iterated over it once. You process each chunk and store only the result. At the end of the iteration you combine the result. This can be done using Pandas or using a reducer.

Lets generate some data in a file:

In [23]:
import numpy as np
import pandas as pd

n_rows = 10_000  # number of rows
groups = ['A', 'B', 'C', 'D']  # Groups

rng = np.random.default_rng()
pd.DataFrame({
    'value1': rng.integers(0, 100, size=n_rows),
    'value2': rng.integers(0, 100, size=n_rows),
    'category': rng.choice(groups, size=n_rows),
}).to_csv('large_data.csv', index=False)

In [39]:
# get a generator with chunks of data
dfs = pd.read_csv('large_data.csv', chunksize=1000)

`pd.read_csv` returns a generator and not a DataFrame.

In [40]:
dfs

<pandas.io.parsers.TextFileReader at 0x7f104962cd10>

When we process the chunks we need to iterate over them and collect the results:

In [43]:
result = []
for chunk in dfs:
     result.append(chunk.groupby('category').sum())

There are many ways to combine the results, one very efficient one is the reducer:

In [28]:
from functools import reduce

reduce(lambda a, b: a.add(b), result)

Unnamed: 0_level_0,value1,value2
category,Unnamed: 1_level_1,Unnamed: 2_level_1
A,126460,127562
B,122934,122848
C,127314,127761
D,122390,121614


This results is exactly as if we would have worked with the full DataFrame:

In [33]:
(pd
    .read_csv('large_data.csv')
    .groupby('category')
    .sum()
)

Unnamed: 0_level_0,value1,value2
category,Unnamed: 1_level_1,Unnamed: 2_level_1
A,126460,127562
B,122934,122848
C,127314,127761
D,122390,121614


In [46]:
!rm ./large_data.csv

If you have any questions, comments, or requests, feel free to [contact me on LinkedIn](https://linkedin.com/in/dennisbakhuis).