[Apache Parquet](https://arrow.apache.org/docs/python/parquet.html) is an efficient columnar storage format. Compared to saving this dataset in csvs using parquet:
- Greatly reduces the necessary disk space
- Loads the data into Pandas with memory efficient datatypes
- Enables fast reads from disk
- Allows us to easily work with partitions of the data

Pandas has a parquet integration that makes loading data into a dataframe trivial; we'll try that now.

In [1]:
import pandas as pd

In [2]:
book_train = pd.read_parquet('../input/optiver-realized-volatility-prediction/book_train.parquet')

If this data were stored as a csv, the numeric types would all default to the 64 bit versions. Parquet retains the more efficient types I specified while saving the data.

**Expect memory usage to spike to roughly double the final dataframe size while parquet loads a file. Consider loading your largest dataset first or using partitions to mitigate this.**

In [3]:
book_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 167253289 entries, 0 to 167253288
Data columns (total 11 columns):
 #   Column             Dtype   
---  ------             -----   
 0   time_id            int16   
 1   seconds_in_bucket  int16   
 2   bid_price1         float32 
 3   ask_price1         float32 
 4   bid_price2         float32 
 5   ask_price2         float32 
 6   bid_size1          int32   
 7   ask_size1          int32   
 8   bid_size2          int32   
 9   ask_size2          int32   
 10  stock_id           category
dtypes: category(1), float32(4), int16(2), int32(4)
memory usage: 5.8 GB


The one exception is the `stock_id` column, which has been converted to the category type as it is [the partition column](https://arrow.apache.org/docs/python/parquet.html#reading-from-partitioned-datasets). The parquet files in this dataset are all paritioned by `stock_id` so that it's not necessary to load the entire file at once. In fact, if you examine the parquet files you'll see that they are actually directories.

In [4]:
! ls ../input/optiver-realized-volatility-prediction/book_train.parquet | head -n 5

stock_id=0
stock_id=1
stock_id=10
stock_id=100
stock_id=101


Those are in turn also directories, which would be relevant if the data were partitioned by more than one column.

In [5]:
! ls ../input/optiver-realized-volatility-prediction/book_train.parquet/stock_id=0/

c439ef22282f412ba39e9137a3fdabac.parquet


In [6]:
book_train_0 = pd.read_parquet('../input/optiver-realized-volatility-prediction/book_train.parquet/stock_id=0/c439ef22282f412ba39e9137a3fdabac.parquet')
book_train_0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 917553 entries, 0 to 917552
Data columns (total 10 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   time_id            917553 non-null  int16  
 1   seconds_in_bucket  917553 non-null  int16  
 2   bid_price1         917553 non-null  float32
 3   ask_price1         917553 non-null  float32
 4   bid_price2         917553 non-null  float32
 5   ask_price2         917553 non-null  float32
 6   bid_size1          917553 non-null  int32  
 7   ask_size1          917553 non-null  int32  
 8   bid_size2          917553 non-null  int32  
 9   ask_size2          917553 non-null  int32  
dtypes: float32(4), int16(2), int32(4)
memory usage: 31.5 MB


Note that because we loaded a single partition, **the partition column was not included**. We could remedy that manually if we need the stock ID or just load a larger subset of the data by passing a list of paths. This will load all of the stock IDs 110-119, reducing memory usesage without implicitly dropping the partition column:

In [7]:
import glob
subset_paths = glob.glob('../input/optiver-realized-volatility-prediction/book_train.parquet/stock_id=11*/*')
book_train_subset = pd.read_parquet(subset_paths)
book_train_subset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14006182 entries, 0 to 14006181
Data columns (total 11 columns):
 #   Column             Dtype   
---  ------             -----   
 0   time_id            int16   
 1   seconds_in_bucket  int16   
 2   bid_price1         float32 
 3   ask_price1         float32 
 4   bid_price2         float32 
 5   ask_price2         float32 
 6   bid_size1          int32   
 7   ask_size1          int32   
 8   bid_size2          int32   
 9   ask_size2          int32   
 10  stock_id           category
dtypes: category(1), float32(4), int16(2), int32(4)
memory usage: 494.2 MB
