## Reading larger-than-memory CSV files in batches

To work with larger-than-memory datasets we must:
- process the dataset in chunks
- combine the chunks into a single output

We refer to each chunk of a dataset as a *batch*.

In [1]:
import polars as pl

pl.Config.set_tbl_rows(8)

polars.config.Config

In [2]:
csv_file = "data/titanic.csv"

## Batched reader

Read a CSV in batches by calling `pl.read_csv_batched`

Tell Polars how many lines we want each batch `DataFrame` to be with the `batch_size` argument

In [5]:
reader = pl.read_csv_batched(
    csv_file,
    batch_size=10
)

reader

<polars.io.csv.batched_reader.BatchedCsvReader at 0x1508f7f65d0>

The `pl.read_csv_batched` function accept all the standard arguments for CSV processing such as setting delimiters or changing column names

Extract additional batches via `next_batch`

In [7]:
batches = reader.next_batches(n=2)

The output of `next_batches` is a `list` of `DataFrames`

In [8]:
type(batches)

list

In [10]:
batches[0]

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""
4,1,1,"""Futrelle, Mrs. Jacques Heath (…","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S"""
…,…,…,…,…,…,…,…,…,…,…,…
7,0,1,"""McCarthy, Mr. Timothy J""","""male""",54.0,0,0,"""17463""",51.8625,"""E46""","""S"""
8,0,3,"""Palsson, Master. Gosta Leonard""","""male""",2.0,3,1,"""349909""",21.075,,"""S"""
9,1,3,"""Johnson, Mrs. Oscar W (Elisabe…","""female""",27.0,0,2,"""347742""",11.1333,,"""S"""
10,1,2,"""Nasser, Mrs. Nicholas (Adele A…","""female""",14.0,1,0,"""237736""",30.0708,,"""C"""


In [12]:
batches[1]

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
11,1,3,"""Sandstrom, Miss. Marguerite Ru…","""female""",4.0,1,1,"""PP 9549""",16.7,"""G6""","""S"""
12,1,1,"""Bonnell, Miss. Elizabeth""","""female""",58.0,0,0,"""113783""",26.55,"""C103""","""S"""
13,0,3,"""Saundercock, Mr. William Henry""","""male""",20.0,0,0,"""A/5. 2151""",8.05,,"""S"""
14,0,3,"""Andersson, Mr. Anders Johan""","""male""",39.0,1,5,"""347082""",31.275,,"""S"""
…,…,…,…,…,…,…,…,…,…,…,…
17,0,3,"""Rice, Master. Eugene""","""male""",2.0,4,1,"""382652""",29.125,,"""Q"""
18,1,2,"""Williams, Mr. Charles Eugene""","""male""",,0,0,"""244373""",13.0,,"""S"""
19,0,3,"""Vander Planke, Mrs. Julius (Em…","""female""",31.0,1,0,"""345763""",18.0,,"""S"""
20,1,3,"""Masselmani, Mrs. Fatima""","""female""",,0,0,"""2649""",7.225,,"""C"""


The number of rows in each batch is not guaranteed to equal the `batch_size` argument, because Polars has to estimate how large a batch will be in bytes before reading it.

### Estimating batch size

Polars makes an estimate by first reading a sample of lines to get the mean and standard deviation of their length in bytes. 

It uses this to estimate the total number of bytes per line.

If a CSV has text data with variable length then the number of bytes per row will vary considerable and the actual batch size will differ from `batch_size`.

Typically the relative difference between the actual batch size and `batch_size` will be smaller for larger datasets.

## Processing batches
If we keep calling `reader.next_batches` it eventually returns a `NoneType` instead of a `list` when it has gone through all the batches

In [13]:
reader = pl.read_csv_batched(csv_file,batch_size=70)
batches0 = reader.next_batches(5)
batches1 = reader.next_batches(5)
batches2 = reader.next_batches(5)
batches3 = reader.next_batches(5)
[type(batches0),type(batches1),type(batches2),type(batches3)]

[list, list, list, NoneType]

In [14]:
[el.shape[0] for el in batches0]

[70, 70, 70, 70, 70]

In [None]:
(
    pl.read_csv(csv_file)
    .select(["Age","Fare"])
    .mean()
)

Age,Fare
f64,f64
29.699118,32.204208
