# Summary

Attempt to recreate an issue with reading large Pandas DataFrames from parquet files.  Found the issue where reading some small parquet files to dataframe (~25MB at rest, ~250MB while in a dataframe) led to very high memory usage (>11GB) during intake.  Original problem was on a non-internet connected machine with non-public data, so trying to recreate it.. 

Can't recreate on linux with 
* pandas==1.0.3
* pyarrow==0.16.0


In [1]:
%load_ext autoreload
%autoreload 2

%load_ext memory_profiler

In [2]:
import numpy as np
import pandas as pd
import string
import time
import tempfile
import os
import random

In [3]:
LETTER_POOL = string.ascii_letters + string.punctuation + ' '

In [4]:
def create_data(n_rows, n_int_cols=0, n_float_cols=0, str_cols=None):
    data = {}
    for i in range(n_int_cols):
        col_name = f"int_{i}"
        this_data = np.random.randint(0, 65000, size=n_rows)
        data[col_name] = this_data

    for i in range(n_float_cols):
        col_name = f"float_{i}"
        this_data = np.random.rand(n_rows)
        data[col_name] = this_data
    
    if str_cols is not None:
        for i, width in enumerate(str_cols):
            col_name = f"str_{i}"
            this_data = [random_string(width) for i in range(n_rows)]
            this_data = np.array(this_data)
            data[col_name] = this_data
            
    return data

def random_string(length, letter_pool=LETTER_POOL):
    return ''.join(random.choices(letter_pool, k=length))

In [5]:
n_rows = 100000
n_int_cols = 5
n_float_cols = 5
str_cols = [300, 300, 300]
df = pd.DataFrame(create_data(n_rows, n_int_cols, n_float_cols, str_cols))

In [7]:
n_dfs = 20
test_df = pd.concat([df for i in range(n_dfs)])
fname = f'df_{n_dfs}.parquet'
test_df.to_parquet(fname)
print(f"Output file size = {os.path.getsize(fname) / 1000000:.2f}MB")

Output file size = 1876.36MB


In [8]:
%%memit
df_in = pd.read_parquet(fname)

peak memory: 4625.16 MiB, increment: 2815.89 MiB


Seems way less than I was experiencing elsewhere...