# Data type optimization and parquet storage

- Data types automatically chosen by `pandas.read_csv()` may not always be optimal.
  - leading zeros in ZIP codes
  - 8 bytes per value where 1 byte would suffice
- String columns use up a lot of memory, convert them to categoricals when number of unique values is not too big relative to number of observations.
- Parquet storage format preserves dtype information and enables partitioning.

# Data types

Pandas columns are internally stored as numpy arrays, and so [NumPy data types](https://numpy.org/doc/stable/user/basics.types.html) are used.

**Boolean**

`np.bool_` takes 1 byte per item, but can not hold missing values. Logical operations on columns return series of this dtype, unless some of the element-wise tests results in NA value, in which case result is of `object` dtype.


**Integer**

Limits and other details can be looked up with `numpy.iinfo()`.

Storing value outside of limits creates overflow.

|  dtype | size (bytes) |             min            |             max            |
|:------:|:------------:|:--------------------------:|:--------------------------:|
| uint8  |       1      | 0                          | 255                        |
| uint16 |       2      | 0                          | 65,535                     |
| uint32 |       4      | 0                          | 4,294,967,295              |
| uint64 |       8      | 0                          | 18,446,744,073,709,551,615 |
| int8   |       1      | -128                       | 127                        |
| int16  |       2      | -32,768                    | 32,767                     |
| int32  |       4      | -2,147,483,648             | 2,147,483,647              |
| int64  |       8      | -9,223,372,036,854,775,808 | 9,223,372,036,854,775,807  |

Integer dtypes provide wide range of options, but the biggest constraint is that in standard pandas these dtypes do not allow for missing values in them.

**Floating point**

Wikipedia: [float16](https://en.wikipedia.org/wiki/Half-precision_floating-point_format), [float32](https://en.wikipedia.org/wiki/Single-precision_floating-point_format), [float64](https://en.wikipedia.org/wiki/Double-precision_floating-point_format).

Limits and other details can be looked up with `numpy.finfo()`.

Spacing between a number and it's adjacent neighbor (`numpy.spacing()`) increases with number absolute magnitude. Therefore care should be taken when storing large integers as floats.


|  dtype  | size (bytes) |           max           | precision (significant decimal digits) |         max exact integer        |
|:-------:|:------------:|:-----------------------:|:--------------------------------------:|:--------------------------------:|
| float16 |       2      |       6.55040e+04       |                 3 to 4                 |         $2^{11}$ = 2,048         |
| float32 |       4      |      3.4028235e+38      |                 6 to 9                 |       $2^{24}$ = 16,777,216      |
| float64 |       8      | 1.7976931348623157e+308 |                15 to 17                | $2^{53}$ = 9,007,199,254,740,992 |

Even though `float16` might have good use cases (notably booleans with missing data), it is not always fully supported.

*Floats are used by pandas to store integers with missing values.*

**Date and time**

Turn string dates, times and time intervals into 64-bit dtypes `np.datetime64` and `np.timedelta64` that support wide range of [specialized functions](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html).

**Strings**

Although there are fixed length Unicode string dtype in NumPy (e.g. `np.dtype('U3')`), pandas uses `np.object_`. This is an array of pointers (item size of 32 or 64 bits, depending on platform architecture) to memory locations where actual strings are stored.



### Categoricals

[Categorical variables](https://pandas.pydata.org/docs/user_guide/categorical.html) are a great way to reduce memory usage when working with string data. Instead of storing one string per observation (and a pointer to it), column will only contain category codes (which in most use cases fit into `np.int8`) and an overhead with codes-to-labels mapping. Operations on categorical columns (select, groupby) will also be faster than on string columns. If you perform an operation on two categorical columns (compare, merge), make sure that their categories are the same in order to realize potential performance gains.

### Experimental nullable dtypes

Recent versions of pandas have new data types that support new form of [missing values](https://pandas.pydata.org/docs/user_guide/integer_na.html). With them, we won't need to use floats to store integers. The feature is still experimental though.

## Example

Load a subset of SynIG columns and compare memory usage under different conversion regimes.

In [None]:
import pandas as pd

cols = ['STATE', 'SECTOR', 'NAICS', 'EMPLOYEES', 'EMPLOYEES_CODE', 'COUNTY_CODE', 'LONGITUDE', 'LATITUDE']

# default conversion
df0 = pd.read_csv('data/synig/2001.csv', usecols=cols, nrows=100_000)

# no conversion
df1 = pd.read_csv('data/synig/2001.csv', usecols=cols, nrows=100_000, dtype='str')

# custom conversion
from tools import state_00_aa
sectors = ['11', '21', '22', '23', '31', '42', '44', '48', '51', '52',
           '53', '54', '55', '56', '61', '62', '71', '72', '81', '92', '99']
states = list(state_00_aa.values())
dt = {
    'STATE': pd.CategoricalDtype(sectors),
    'SECTOR': pd.CategoricalDtype(states),
    'NAICS': 'str',
    'EMPLOYEES': 'float32',
    'EMPLOYEES_CODE': pd.CategoricalDtype(list('ABCDEFGHIJK'), ordered=True),
    'COUNTY_CODE': 'str',
    'LONGITUDE': 'float64',
    'LATITUDE': 'float64'
}
df2 = pd.read_csv('data/synig/2001.csv', usecols=cols, nrows=100_000, dtype=dt)

def dt_mem(df):
    mem = (df.memory_usage(index=False, deep=True) / 1e6).round(1)
    return pd.concat([df.dtypes, mem], 1).rename(columns={0: 'dtype', 1: 'mem, MB'})

pd.concat({'default': dt_mem(df0), 
           'no conversion': dt_mem(df1),
           'custom': dt_mem(df2)}, 1)

# Parquet

- Binary format: data type is preserved
- Columnar storage: efficient reading of subset of columns and dtype-specific compression
- Partitioning: only read chunks that satisfy a given condition
  - Every partition adds metadata overhead. With too many partitions, this can incur significant performance cost. For example, if SynIG is partitioned by YEAR, STATE and SECTOR (about 17,000 partitions), it becomes much slower.

## Save single dataframe

In [None]:
import pandas as pd
from tools import convert_synig_dtypes

df = pd.read_csv('data/synig/2001.csv', dtype='str')
convert_synig_dtypes(df)
df.to_parquet('data/synig_2001.pq', index=False, partition_cols=['STATE'])

df0 = df
df1 = pd.read_parquet('data/synig_2001.pq')

pd.concat({'before storage': dt_mem(df0), 
           'after loading': dt_mem(df1)}, 1)

We can now efficienly load subsets of the data.

In [None]:
df = pd.read_parquet('data/synig_2001.pq', columns=['SECTOR', 'EMPLOYEES'],
                     filters=[('STATE', 'in', ['WI', 'CT'])])
(df.groupby(['SECTOR', 'STATE'])['EMPLOYEES'].sum()
 .dropna().unstack().fillna(0).astype(int).style.format('{:,d}'))

## Convert SynIG from CSV to parquet

We want to store entire SynIG dataset (all years and columns) to parquet, partitioned by YEAR and STATE. We can not save it all in one go, because doing so would require loading it all in memory at once. So we use split-apply-combine strategy to process year by year.

In [None]:
%%time
# remove old 'data/synig.pq' and restart kernel before running

import pandas as pd
import fastparquet
from tools import convert_synig_dtypes

years = range(2001, 2021)
# reduce number of years for faster demonstration
years = years[:3]
paths = []
for year in years:
    print(year, end=' ')
    df = pd.read_csv(f'data/synig/{year}.csv', dtype=str)
    del df['YEAR']
    convert_synig_dtypes(df)
    path = f'data/synig.pq/YEAR={year}'
    fastparquet.write(path, df, file_scheme='hive', write_index=False, partition_on=['STATE'])
    paths.append(path)
pf = fastparquet.writer.merge(paths)
print()

## Compare performance

In [None]:
import pandas as pd
from tools import convert_synig_dtypes, ResourceMonitor
from time import sleep

### Read one year

In [None]:
mon = ResourceMonitor(interval=0.3)
def read_csv():
    mon.tag('read csv')
    df = pd.read_csv('data/synig/2001.csv', dtype=str)
    mon.tag('convert')
    convert_synig_dtypes(df)
    print(df.shape)
def read_pq():
    mon.tag('read pq')
    df = pd.read_parquet('data/synig.pq', filters=[('YEAR', '==', 2001)])
    print(df.shape)

mon.start()
sleep(1)
read_csv()
sleep(1)
read_pq()
sleep(1)
mon.stop()
mon.plot()

## Read one state

Subset of columns

In [None]:
mon = ResourceMonitor(interval=0.3)
years = range(2001, 2021)
# reduce number of years for faster demonstration
years = years[:3]
state = 'WI'
cols = ['YEAR', 'STATE', 'SECTOR', 'EMPLOYEES', 'NAICS', 'LONGITUDE', 'LATITUDE']

def read_csv():
    mon.tag('read csv')
    df = []
    for year in years:
        print(year, end=' ')
        d = pd.read_csv(f'data/synig/{year}.csv', dtype=str, usecols=cols)
        convert_synig_dtypes(d)
        d = d[d['STATE'] == state]
        df.append(d)
    df = pd.concat(df, ignore_index=True)
    print()
    print(df.shape)
    sleep(1)
    
def read_pq():
    mon.tag('read pq')
    df = pd.read_parquet('data/synig.pq', columns=cols, 
                         filters=[('YEAR', 'in', years), ('STATE', '==', state)])
    print(df.shape)
    sleep(1)

mon.start()
sleep(1)
read_csv()
sleep(1)
read_pq()
mon.stop()
mon.plot()