**Conclusion:** column 'uniid' takes 50% of total memory of EPA CEMS. It is string dtype with about 1500 unique values. Casting to 'categorical' dtype squashes almost all memory usage, reducing total EPA CEMS memory footprint by almost 50%.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from src.data.load_dataset import load_epacems
import pudl

In [3]:
import pandas as pd

In [4]:
cems = load_epacems(states=None, columns=None) # all states, one year, all columns

In [5]:
cems.dtypes

plant_id_eia                               Int32
unitid                                    string
operating_datetime_utc       datetime64[ns, UTC]
operating_time_hours                     float32
gross_load_mw                            float32
steam_load_1000_lbs                      float32
so2_mass_lbs                             float32
so2_mass_measurement_code               category
nox_rate_lbs_mmbtu                       float32
nox_rate_measurement_code               category
nox_mass_lbs                             float32
nox_mass_measurement_code               category
co2_mass_tons                            float32
co2_mass_measurement_code               category
heat_content_mmbtu                       float32
facility_id                                Int32
unit_id_epa                                Int32
year                                    category
state                                   category
dtype: object

Half of total memory is from a single string column: unitid

In [6]:
cems.memory_usage(deep=True) / 2**20 # megabytes

Index                           0.000122
plant_id_eia                  172.931328
unitid                       2069.946968
operating_datetime_utc        276.690125
operating_time_hours          138.345062
gross_load_mw                 138.345062
steam_load_1000_lbs           138.345062
so2_mass_lbs                  138.345062
so2_mass_measurement_code      34.587056
nox_rate_lbs_mmbtu            138.345062
nox_rate_measurement_code      34.587120
nox_mass_lbs                  138.345062
nox_mass_measurement_code      34.587120
co2_mass_tons                 138.345062
co2_mass_measurement_code      34.587056
heat_content_mmbtu            138.345062
facility_id                   172.931328
unit_id_epa                   172.931328
year                           34.586987
state                          34.590045
dtype: float64

In [7]:
cems.memory_usage(deep=True).sum() / 2**20 # total megabytes

4179.717080116272

In [8]:
# check cardinality - can this be categorical?
cems.unitid.unique().shape

(1472,)

In [9]:
# yes it can
cems.unitid.astype('category').memory_usage(deep=True) / 2**20

69.29010486602783

save 2GB / year, almost half of total memory

In [10]:
2 * len(pudl.constants.data_years['epacems'])

50

50GB takes this to workstation scale analysis

In [11]:
del cems

How does this scale? Does cardinality get out of control with more years?

In [12]:
cems = load_epacems(states=None, years=[2018, 2019], columns=None) # all states, 2 years, all columns

In [13]:
cems.memory_usage(deep=True) / 2**20 # megabytes

Index                           0.000122
plant_id_eia                  348.258591
unitid                       4167.796188
operating_datetime_utc        557.213745
operating_time_hours          278.606873
gross_load_mw                 278.606873
steam_load_1000_lbs           278.606873
so2_mass_lbs                  278.606873
so2_mass_measurement_code      69.652509
nox_rate_lbs_mmbtu            278.606873
nox_rate_measurement_code      69.652573
nox_mass_lbs                  278.606873
nox_mass_measurement_code      69.652573
co2_mass_tons                 278.606873
co2_mass_measurement_code      69.652509
heat_content_mmbtu            278.606873
facility_id                   348.258591
unit_id_epa                   348.258591
year                           69.652439
state                          69.655498
dtype: float64

In [14]:
cems.memory_usage(deep=True).sum() / 2**20 # total megabytes

8416.55890750885

In [15]:
cems.unitid.astype('category').memory_usage(deep=True) / 2**20

139.42235374450684

In [16]:
# barely changes
cems.unitid.unique().shape

(1495,)