# Working with DataFrames.jl beyond CSV files

# Part 3: Important limitations of Parquet

## Bogumił Kamiński
### June 25, 2023

What is covered in part 3:
* Limitations of `RowGroup` size
* Avoid excessive copying of data

## Setup

In [1]:
using DataFrames

In [2]:
using Parquet2

## Handling tables with large number of rows

In [3]:
isfile("large_df.parquet") && rm("large_df.parquet")

false

In [4]:
large_df = DataFrame(x=rand(3*10^8))

Row,x
Unnamed: 0_level_1,Float64
1,0.715886
2,0.906553
3,0.428327
4,0.734823
5,0.428111
6,0.24559
7,0.894124
8,0.897556
9,0.874006
10,0.548504


This table has too many rows and cannot be stored in Parquet as one `RowGroup`.

In [5]:
Parquet2.writefile("large_df.parquet", large_df)

LoadError: InexactError: trunc(Int32, 2400000000)

We need to split it into partitions of smaller size using `Iterators.partition`:

In [6]:
Parquet2.writefile("large_df.parquet", Iterators.partition(large_df, 10^8))

[34m✏ [39mParquet2.FileWriter{IOStream}(large_df.parquet)

Drop original data to save memory:

In [7]:
large_df = nothing

## Impact of `copycols` keyword argument when fetching data to a `DataFrame`

`copycols=true` option:

In [8]:
DataFrame(Parquet2.readfile("large_df.parquet"))
GC.gc(); GC.gc(); GC.gc(); GC.gc()
@time DataFrame(Parquet2.readfile("large_df.parquet"));

  9.391013 seconds (1.29 k allocations: 11.176 GiB, 5.37% gc time)


`copycols=false` option:

In [9]:
DataFrame(Parquet2.readfile("large_df.parquet"), copycols=false)
GC.gc(); GC.gc(); GC.gc(); GC.gc()
@time DataFrame(Parquet2.readfile("large_df.parquet"), copycols=false);

 8.985564 seconds (1.29 k allocations: 8.941 GiB, 1.64% gc time)


*Preparation of this worksop has been supported by the Polish National Agency for Academic Exchange under the Strategic Partnerships programme, grant number BPI/PST/2021/1/00069/U/00001.*

![SGH & NAWA](logo.png)