# Working with DataFrames.jl beyond CSV files

# Part 3: Important limitations of Parquet

## Bogumił Kamiński
### June 25, 2023

What is covered in part 3:
* Limitations of `RowGroup` size
* Avoid excessive copying of data

## Setup

In [1]:
using DataFrames

In [2]:
using Parquet2

## Handling tables with large number of rows

In [3]:
isfile("large_df.parquet") && rm("large_df.parquet")

In [4]:
large_df = DataFrame(x=rand(3*10^8))

Row,x
Unnamed: 0_level_1,Float64
1,0.970803
2,0.841825
3,0.0484231
4,0.764995
5,0.438937
6,0.972333
7,0.357896
8,0.830436
9,0.090152
10,0.780723


This table has too many rows and cannot be stored in Parquet as one `RowGroup`.

In [5]:
Parquet2.writefile("large_df.parquet", large_df)

LoadError: InexactError: trunc(Int32, 2400000000)

We need to split it into partitions of smaller size using `Iterators.partition`:

In [6]:
Parquet2.writefile("large_df.parquet", Iterators.partition(large_df, 10^8))

[34m✏ [39mParquet2.FileWriter{IOStream}(large_df.parquet)

Drop original data to save memory:

In [7]:
large_df = nothing

## Impact of `copycols` keyword argument when fetching data to a `DataFrame`

`copycols=true` option:

In [8]:
DataFrame(Parquet2.readfile("large_df.parquet"))
GC.gc(); GC.gc(); GC.gc(); GC.gc()
@allocated DataFrame(Parquet2.readfile("large_df.parquet"))

12000208304

`copycols=false` option:

In [9]:
DataFrame(Parquet2.readfile("large_df.parquet"), copycols=false)
GC.gc(); GC.gc(); GC.gc(); GC.gc()
@allocated DataFrame(Parquet2.readfile("large_df.parquet"), copycols=false)

9600208336

**This issue has been fixed in Parquet2.jl version 0.2.18. Since this version you can omit passing `copycols=false`. Excessive copying is automatically avoided.**

*Preparation of this worksop has been supported by the Polish National Agency for Academic Exchange under the Strategic Partnerships programme, grant number BPI/PST/2021/1/00069/U/00001.*

![SGH & NAWA](logo.png)