# Working with Sparse Data

Values in data are often not evenly distributed ... 
When there is a predominant value in your data ...

https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html

In [54]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [68]:
import pandas as pd
import numpy as np

from compressio import Compress, storage_size, SparseCompressor, compress_report, savings_report

We create a fictional cars dataset to illustrate sparse datastructures can reduce memory usage with pandas.

In [63]:
import random

data = pd.DataFrame({
    'primary_color': pd.Series(random.choices(['green', 'blue', 'red', 'yellow', 'white', 'pink'], k=10010), dtype=str),
    'secondary_color': pd.Series([None] * 10000 + random.choices(['gold', 'black', 'silver'], k=10), dtype=str),
    'date_registered': pd.Series([1] * 10010, dtype="datetime64[ns]"),
    'date_scrapped': pd.Series([pd.NaT] * 10000 + [1,2,3,4,5,6,7,8,9,10], dtype="datetime64[ns]"),
    'number_of_modifications': pd.Series([pd.NA] * 100 + [0] * 9000 + [1] * 500 + [2] * 400 + [3] * 10, dtype="Int64"),
})

data

Unnamed: 0,primary_color,secondary_color,date_registered,date_scrapped,number_of_modifications
0,white,,1970-01-01 00:00:00.000000001,NaT,
1,blue,,1970-01-01 00:00:00.000000001,NaT,
2,pink,,1970-01-01 00:00:00.000000001,NaT,
3,red,,1970-01-01 00:00:00.000000001,NaT,
4,red,,1970-01-01 00:00:00.000000001,NaT,
...,...,...,...,...,...
10005,red,gold,1970-01-01 00:00:00.000000001,1970-01-01 00:00:00.000000006,3
10006,pink,black,1970-01-01 00:00:00.000000001,1970-01-01 00:00:00.000000007,3
10007,white,silver,1970-01-01 00:00:00.000000001,1970-01-01 00:00:00.000000008,3
10008,pink,silver,1970-01-01 00:00:00.000000001,1970-01-01 00:00:00.000000009,3


In [64]:
compress = Compress(compressor=SparseCompressor())

In [65]:
original_size = storage_size(data).to('megabyte')
print(f'Original DataFrame size: {original_size}')

Original DataFrame size: 0.41053799999999996 megabyte


In [66]:
compress.typeset.detect_type(data)

{'primary_color': String,
 'secondary_color': String,
 'date_registered': DateTime,
 'date_scrapped': DateTime,
 'number_of_modifications': Integer}

In [69]:
data_compressed = compress.it(data)
data_compressed

ValueError: cannot convert to 'Sparse[UInt8, nan]'-dtype NumPy array with missing values. Specify an appropriate 'na_value' for this dtype.

In [62]:
compress_report(data, compress.typeset, compress.compressor, units="kilobytes")

primary_color: converting from object to category saves 69.862 kilobyte
secondary_color: converting from object to Sparse[object, nan] saves 79.96000000000001 kilobyte
date_registered: converting from datetime64[ns] to category saves 69.982 kilobyte
date_scrapped: converting from datetime64[ns] to category saves 69.67 kilobyte
number_of_modifications: converting from float64 to float16 saves 60.06 kilobyte


In [52]:
savings_report(data, data_compressed)

Original size: 0.400528 megabyte
Compressed size: 0.050994 megabyte
Savings: 0.349534 megabyte
Reduction percentage: 87.27%


In [53]:
data_compressed["secondary_color"].sparse.density

0.000999000999000999

In [29]:
# data_compressed["number_of_modifications"].sparse.density