# Compressio basics

This notebook demonstrates the basic usage of compressio on synthetic data.

In [7]:
%load_ext autoreload
%autoreload 2

import random
import datetime

import numpy as np
import pandas as pd
from visions import StandardSet

from compressio import Compress, storage_size, compress_report, savings_report

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Generate a DataFrame with various types:

In [8]:
n = 1000000
df = pd.DataFrame({
    'integer': [random.choice([0, 1, 2]) for _ in range(n)], 
    'integer_missing': pd.Series([random.choice([0, 1, 2, np.nan]) for _ in range(n)], dtype="Int32"),
    'float': [3.0 for i in range(n)],
    'complex': pd.Series([complex(0, 2) for i in range(n)], dtype='complex128'),
    'object': ['strings' for i in range(n)],
    'datetime': pd.Series([datetime.datetime(2020, 10, 10) for i in range(n)])
})

Initialize the `Compress` object:

In [9]:
compress = Compress(with_type_inference=True)

We start off with around 53 MB of data:

In [10]:
original_size = storage_size(df).to('megabyte')
print(f'Original DataFrame size: {original_size}')

Original DataFrame size: 53.000128 megabyte


This line of code automatically compresses the DataFrame:

In [11]:
df_compressed = compress.it(df)

In [17]:
compress.typeset.infer_type(df)

{'integer': Integer,
 'integer_missing': Integer,
 'float': Integer,
 'complex': Complex,
 'object': String,
 'datetime': DateTime}

Let's see what has changed:

In [12]:
compress_report(df, compress.typeset, compress.compressor)

integer: was int64 compressed uint8 savings 7.0 megabyte
integer_missing: was Int32 compressed UInt8 savings 3.0 megabyte
float: was float64 compressed float16 savings 6.0 megabyte
complex: was complex128 compressed complex64 savings 8.0 megabyte
object: was object compressed category savings 6.999911999999999 megabyte
datetime: was datetime64[ns] compressed category savings 6.999911999999999 megabyte


In [7]:
savings_report(df, df_compressed)

Original size: 53.000128 megabyte
Compressed size: 15.000304 megabyte
Savings: 37.999824 megabyte
Reduction percentage: 71.70%


In [8]:
df_compressed.memory_usage()

Index                  128
integer            1000000
integer_missing    2000000
float              2000000
complex            8000000
object             1000088
datetime           1000088
dtype: int64

In [10]:
df.memory_usage()

Index                   128
integer             8000000
integer_missing     5000000
float               8000000
complex            16000000
object              8000000
datetime            8000000
dtype: int64