# Real-world dataset example

In this example we demonstrate ...

In [1]:
%load_ext autoreload
%autoreload 2

import random
import datetime

import numpy as np
import pandas as pd
from visions import StandardSet

from compressio import Compress, storage_size, savings, savings_report

In [2]:
url = "https://data.cityofchicago.org/api/views/xzkq-xp2w/rows.csv?accessType=DOWNLOAD"

# Load dataset
df = pd.read_csv(url)

In [3]:
compress = Compress()

In [4]:
original_size = storage_size(df, deep=True).to('megabyte')
print(f'Original DataFrame size: {original_size}')

Original DataFrame size: 11.826331999999999 megabyte


In [5]:
df_compressed = compress.it(df)

In [6]:
savings(df, df_compressed, deep=True)

In [7]:
savings_report(df, df_compressed, deep=True)

Original size: 11.826331999999999 megabyte
Compressed size: 3.00723 megabyte
Savings: 8.819101999999999 megabyte
Reduction percentage: 74.57%


Memory reduction of around 90%, nice. But why stop there? We can go beyond. Up until now we have leveraged `visions` to detect dtypes and then compress accordingly. However, visions enables inference and coercion of types as well. It's possible to design relations between types that automatically coerse. For instance integers stored as floats (`[1.0, 2.0, 3.0]` => `[1, 2, 3]`) . 

In [8]:
df_compressed.memory_usage(deep=True)

Index                    128
Name                 2443548
Job Titles            197207
Department             36665
Full or Part-Time      33124
Salary or Hourly       33134
Typical Hours          65856
Annual Salary         131712
Hourly Rate            65856
dtype: int64

In [9]:
df.memory_usage(deep=True)

Index                    128
Name                 2443548
Job Titles           2487171
Department           2120925
Full or Part-Time    1909824
Salary or Hourly     2074464
Typical Hours         263424
Annual Salary         263424
Hourly Rate           263424
dtype: int64