# Compressing sparse datasets

Values in real-world data are often not evenly distributed.
When there is a predominant value in your data, such as 0 or "missing", it's often more memory efficient to store only the different values. 

In this notebook we will let compressio consider sparse data structures for compression.

You can find more information on how the SparseArray in pandas works on [this page](https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html).

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import numpy as np

from compressio import Compress, storage_size, SparseCompressor, compress_report, savings_report

We create a fictional cars dataset to illustrate sparse data structures can reduce memory usage with pandas. The amount of missing data is exaggerated for demonstation purposes: 95% of the data is missing for the sparse columns. You can play around with the parameters below to see for which amounts of missing values the sparse representation is more memory efficient.

In [3]:
import random

n_missing = 19000
n_present = 1000
n_total = n_missing + n_present

data = pd.DataFrame({
    'primary_color': pd.Series(random.choices(['green', 'blue', 'red', 'yellow', 'white', 'pink'], k=n_total), dtype=str),
    'secondary_color': pd.Series([None] * n_missing + random.choices(['gold', 'black', 'silver'], k=n_present), dtype=str),
    'date_registered': pd.Series([1] * n_total, dtype="datetime64[ns]"),
    'date_scrapped': pd.Series([pd.NaT] * n_missing + random.choices([1,2,3,4,5,6,7,8,9,10],k=n_present), dtype="datetime64[ns]"),
    'number_of_modifications': pd.Series([pd.NA] * n_missing + random.choices([0,1,2,3,4],k=n_present), dtype="Int64"),
    'imported': pd.Series([pd.NA] * n_missing + random.choices([True, False], k=n_present), dtype="boolean"),
})

data

Unnamed: 0,primary_color,secondary_color,date_registered,date_scrapped,number_of_modifications,imported
0,red,,1970-01-01 00:00:00.000000001,NaT,,
1,pink,,1970-01-01 00:00:00.000000001,NaT,,
2,green,,1970-01-01 00:00:00.000000001,NaT,,
3,red,,1970-01-01 00:00:00.000000001,NaT,,
4,red,,1970-01-01 00:00:00.000000001,NaT,,
...,...,...,...,...,...,...
19995,red,silver,1970-01-01 00:00:00.000000001,1970-01-01 00:00:00.000000008,4,False
19996,green,black,1970-01-01 00:00:00.000000001,1970-01-01 00:00:00.000000007,4,True
19997,pink,gold,1970-01-01 00:00:00.000000001,1970-01-01 00:00:00.000000009,0,True
19998,yellow,black,1970-01-01 00:00:00.000000001,1970-01-01 00:00:00.000000004,2,True


In [4]:
compress = Compress(compressor=SparseCompressor())

In [5]:
original_size = storage_size(data).to('megabyte')
print(f'Original DataFrame size: {original_size}')

Original DataFrame size: 0.860128 megabyte


In [6]:
compress.typeset.detect_type(data)

{'primary_color': String,
 'secondary_color': String,
 'date_registered': DateTime,
 'date_scrapped': DateTime,
 'number_of_modifications': Integer,
 'imported': Boolean}

In [7]:
data_compressed = compress.it(data)
data_compressed

Unnamed: 0,primary_color,secondary_color,date_registered,date_scrapped,number_of_modifications,imported
0,red,,1970-01-01 00:00:00.000000001,NaT,,
1,pink,,1970-01-01 00:00:00.000000001,NaT,,
2,green,,1970-01-01 00:00:00.000000001,NaT,,
3,red,,1970-01-01 00:00:00.000000001,NaT,,
4,red,,1970-01-01 00:00:00.000000001,NaT,,
...,...,...,...,...,...,...
19995,red,silver,1970-01-01 00:00:00.000000001,1970-01-01 00:00:00.000000008,4,False
19996,green,black,1970-01-01 00:00:00.000000001,1970-01-01 00:00:00.000000007,4,True
19997,pink,gold,1970-01-01 00:00:00.000000001,1970-01-01 00:00:00.000000009,0,True
19998,yellow,black,1970-01-01 00:00:00.000000001,1970-01-01 00:00:00.000000004,2,True


In [8]:
compress_report(data, compress.typeset, compress.compressor, with_inference=False, units="kilobytes")

primary_color: converting from object to category saves 139.792 kilobyte (use `data[primary_color].astype("category")`)
secondary_color: converting from object to Sparse[object, nan] saves 148.0 kilobyte (use `data[secondary_color].astype("Sparse[object, nan]")`)
date_registered: converting from datetime64[ns] to category saves 139.912 kilobyte (use `data[date_registered].astype("category")`)
date_scrapped: converting from datetime64[ns] to category saves 139.6 kilobyte (use `data[date_scrapped].astype("category")`)
number_of_modifications: converting from Int64 to Sparse[int8, <NA>] saves 175.0 kilobyte (use `data[number_of_modifications].astype("Sparse[int8, <NA>]")`)
imported: converting from boolean to Sparse[bool, <NA>] saves 35.0 kilobyte (use `data[imported].astype("Sparse[bool, <NA>]")`)


In [9]:
savings_report(data, data_compressed)

Original size: 0.860128 megabyte
Compressed size: 0.082824 megabyte
Savings: 0.777304 megabyte
Reduction percentage: 90.37%


In [10]:
data_compressed["secondary_color"].sparse.density

0.05

In [11]:
data_compressed["number_of_modifications"].sparse.density

0.05

In [12]:
data_compressed["imported"].sparse.density

0.05