# Data types optimization

> Convert columns to use more memory efficient dtypes.

In [None]:
import random
import itertools
import string
import timeit

import numpy as np
import pandas as pd

# NumPy data types

Pandas columns are internally stored as numpy arrays, and so [NumPy data types](https://numpy.org/doc/stable/user/basics.types.html) are used.

**Boolean**

`np.bool_` takes 1 byte per item, but can not hold missing values. Logical operations on columns return series of this dtype, unless some of the element-wise tests results in NA value, in which case result is of `object` dtype.


**Integer**

Limits and other details can be looked up with `numpy.iinfo()`.

Storing value outside of limits creates overflow.

|  dtype | size (bytes) |             min            |             max            |
|:------:|:------------:|:--------------------------:|:--------------------------:|
| uint8  |       1      | 0                          | 255                        |
| uint16 |       2      | 0                          | 65,535                     |
| uint32 |       4      | 0                          | 4,294,967,295              |
| uint64 |       8      | 0                          | 18,446,744,073,709,551,615 |
| int8   |       1      | -128                       | 127                        |
| int16  |       2      | -32,768                    | 32,767                     |
| int32  |       4      | -2,147,483,648             | 2,147,483,647              |
| int64  |       8      | -9,223,372,036,854,775,808 | 9,223,372,036,854,775,807  |

Integer dtypes provide wide range of options, but the biggest constraint is that in standard pandas these dtypes do not allow for missing values in them.

**Floating point**

Wikipedia: [float16](https://en.wikipedia.org/wiki/Half-precision_floating-point_format), [float32](https://en.wikipedia.org/wiki/Single-precision_floating-point_format), [float64](https://en.wikipedia.org/wiki/Double-precision_floating-point_format).

Limits and other details can be looked up with `numpy.finfo()`.

Spacing between a number and it's adjacent neighbor (`numpy.spacing()`) increases with number absolute magnitude. Therefore care should be taken when storing large integers as floats.


|  dtype  | size (bytes) |           max           |       max exact integer       |
|:-------:|:------------:|:-----------------------:|:-----------------------------:|
| float16 |       2      |       6.55040e+04       |         $2^{11}$ = 2,048         |
| float32 |       4      |      3.4028235e+38      |       $2^{24}$ = 16,777,216      |
| float64 |       8      | 1.7976931348623157e+308 | $2^{53}$ = 9,007,199,254,740,992 |

Even though `float16` might have good use cases (notably booleans with missing data), it looks like it is not always fully supported.

**String**

Although there are fixed length Unicode string dtype in NumPy (e.g. `np.dtype('U3')`), pandas uses `np.object_`. This is an array of pointers (item size of 32 or 64 bits, depending on platform architecture) to memory locations where actual strings are stored.

In [None]:
# floats spacing increases with number magnitude
dt = np.float16
info = np.finfo(dt)
for exp in range(-16, 17):
    x = 2**exp
    npx = dt(x)
    print(exp, x, npx, np.spacing(npx))

In [None]:
# integer precision limits on floats
for dt in [np.float16, np.float32, np.float64]:
    info = np.finfo(dt)
    max_int = 2**(info.nmant + 1)
    print(dt.__name__, info.nmant, max_int)
    assert dt(max_int - 1) != dt(max_int)
    assert dt(max_int) == dt(max_int + 1)

## Automatic conversion

Be mindful of possible overflow when performing operations with numerical series, as dtypes will not always automatically convert to higher types.

Result of series aggregation is numpy scalar with certain numpy dtype. 
Summation of ints results in `int64`, regardless of input dtype.
Summation of `float32` remains `float32`, so precision may be lost.

In [None]:
# going beyond float32 integer precision
s = pd.Series([2**24] * 3, dtype='float32')
assert ((s + 1) == s).all()

In [None]:
# 127 is max int8, but s.sum() does not overflow, because result is stored in int64
s = pd.Series([127, 127, 127], dtype='int8')
ss = s.sum()
print(ss.dtype, ss, 127 * 3)

In [None]:
# 2**24 is largest int that can be exactly represented by float32
s = pd.Series([2**24] * 3, dtype='float32')
# s.sum() is float32, but number is not exact integer
ss = s.sum()
print(ss.dtype, ss, 2**24 * 3)

# Categorical

[User guide](https://pandas.pydata.org/docs/user_guide/categorical.html)

[#](https://pandas.pydata.org/docs/user_guide/categorical.html#missing-data)
> Missing values should not be included in the Categorical’s categories, only in the values. Instead, it is understood that NaN is different, and is always a possibility. When working with the Categorical’s codes, missing values will always have a code of -1.

In [None]:
def gen_unique_str(n, l, alphabet=None):
    """Return list of `n` random unique strings of lenght `l`."""
    if alphabet is None:
        alphabet = string.ascii_lowercase
    assert len(alphabet) ** l >= n, f'Can not generate {n} unique strings of length {l} from alphabet of length {len(alphabet)}.'
    str_set = set()
    while len(str_set) < n:
        str_set.add(''.join(random.choices(alphabet, k=l)))
    return list(str_set)
    

def gen_mock_data(n_rows, num=None, str_=None, cat=None):
    """Return dataframe with random data.
    
    `num`: number of columns.
    
    `str_`: {'n': number of colums, 'len': string length, 'nuni': number of uniques}.
    If `str_` is number, it is interpreted as number of columns with defaults for other options.
    
    
    `cat`: {'n': number of colums, 'len': string length, 'nuni': number of uniques}.
    If `cat` is number, it is interpreted as number of columns with defaults for other options.
    """
    
    def str_df(par, categorical):
        if isinstance(par, int):
            par = {
                'n': par,
                'len': 8,
                'nuni': n_rows // 10
            }
        df = pd.DataFrame()
        for i in range(par['n']):
            uniques = gen_unique_str(par['nuni'], par['len'])
            if categorical: 
                df[f'cat{i}'] = pd.Categorical(random.choices(uniques, k=n_rows), uniques)
            else:
                df[f'str{i}'] = random.choices(uniques, k=n_rows)
        return df
    
    dfs = [pd.DataFrame({'id': range(n_rows)})]
    
    if num is not None:
        dfs.append(pd.DataFrame(np.random.rand(n_rows, num), columns=[f'num{i}' for i in range(num)]))
    if str_ is not None:
        dfs.append(str_df(str_, False))
    if cat is not None:
        dfs.append(str_df(cat, True))
    return pd.concat(dfs, 1)

## Select and groupby

Selection by equality test is x220 faster with categoricals.

Groupby aggregation is x27 faster with categoricals.

In [None]:
df = gen_mock_data(1_000_000, str_=1, cat=1)
print('select str')
needle = df['str0'][0]
%timeit _ = (df['str0'] == needle)
print('select cat')
needle = df['cat0'].cat.categories[0]
%timeit _ = (df['cat0'] == needle)

In [None]:
df = gen_mock_data(1_000_000, num=1, str_=1, cat=1)
print('groupby str')
%timeit _ = df.groupby('str0')['num0'].sum()
print('groupby cat')
%timeit _ = df.groupby('cat0')['num0'].sum()

## String methods

[#](https://pandas.pydata.org/docs/user_guide/categorical.html#string-and-datetime-accessors)

`.str` and `.dt` accessors work on categoricals if categories are of an appropriate type.

> The work is done on the categories and then a new Series is constructed. This has some performance implication if you have a Series of type string, where lots of elements are repeated (i.e. the number of unique elements in the Series is a lot smaller than the length of the Series). In this case it can be faster to convert the original Series to one of type category and use .str.\<method\> or .dt.\<property\> on that.

About x8 speedup in `startswith()` and `contains()`, but the gain naturally declines as the share of unique values increases.

In [None]:
df = gen_mock_data(1_000_000, str_=1, cat=1)

print('str: startswith')
%timeit _ = df.str0.str.startswith('a')
print('cat: startswith')
%timeit _ = df.cat0.str.startswith('a')

print('str: contains non-regex')
%timeit _ = df.str0.str.contains('a', regex=False)
print('cat: contains non-regex')
%timeit _ = df.cat0.str.contains('a', regex=False)

print('str: contains regex')
%timeit _ = df.str0.str.contains('[ab]', regex=True)
print('cat: contains regex')
%timeit _ = df.cat0.str.contains('[ab]', regex=True)

In [None]:
n_rows = 10_000_000
df = gen_mock_data(n_rows, str_=dict(n=1, len=10, nuni=n_rows//2), cat=dict(n=1, len=10, nuni=n_rows//2))

print('str: startswith')
%timeit _ = df.str0.str.startswith('a')
print('cat: startswith')
%timeit _ = df.cat0.str.startswith('a')

print('str: contains non-regex')
%timeit _ = df.str0.str.contains('a', regex=False)
print('cat: contains non-regex')
%timeit _ = df.cat0.str.contains('a', regex=False)

print('str: contains regex')
%timeit _ = df.str0.str.contains('[ab]', regex=True)
print('cat: contains regex')
%timeit _ = df.cat0.str.contains('[ab]', regex=True)

## Merge

[#](https://pandas.pydata.org/docs/user_guide/categorical.html#merging-concatenation)
[#](https://pandas.pydata.org/docs/user_guide/merging.html#merge-dtypes)

> By default, combining Series or DataFrames which contain the same categories results in category dtype, otherwise results will depend on the dtype of the underlying categories. **Merges that result in non-categorical dtypes will likely have higher memory usage.** Use .astype or union_categoricals to ensure category results.

> The category dtypes must be exactly the same, meaning the same categories and the ordered attribute. Otherwise the result will coerce to the categories’ dtype.

> Merging on category dtypes that are the same can be quite performant compared to object dtype merging.

In [None]:
import random
import itertools
import string
import timeit

import numpy as np
import pandas as pd

def gen_cat_data(n_rows, n_cats, cat_len, cat):
    cat_gen = itertools.product(string.ascii_lowercase, repeat=cat_len)
    cats = [''.join(next(cat_gen)) for _ in range(n_cats)]
    assert len(cats) == len(set(cats))
    df = pd.DataFrame({'key': random.choices(cats, k=n_rows),
                       'val': np.random.rand(n_rows)})
    if cat: df['key'] = pd.Categorical(df['key'], cats)
    agg = df.groupby('key')['val'].sum().rename('sum').reset_index()    
    return df, agg

In [None]:
times = {}
for rows_order in range(4, 9):
    nr = 10**rows_order
    for nc in [10, 1000]:
        for cl in [5, 200]:
            for c in [False, True]:
                t = timeit.Timer('df.merge(agg)',
                                 f'df, agg = gen_cat_data({nr}, {nc}, {cl}, {c})',
                                 globals=globals())
                repeats, time = t.autorange()
                times[(nr, nc, cl, c)] = time / repeats

df = pd.Series(times).rename_axis(index=['n_rows', 'n_cats', 'cat_len', 'cats'])

Length of strings does not matter

In [None]:
x = df.unstack('cat_len')
x.iloc[:, 1] / x.iloc[:, 0]

Categoricals improve performance with 100k+ rows, up to x2 speedup with 100M rows

In [None]:
x = df.unstack('cat_len').mean(1)
x = x.unstack('cats')
x.iloc[:, 1] / x.iloc[:, 0]

Number of categories slows down.

In [None]:
x = df.unstack('cat_len').mean(1)
x = x.unstack('n_cats')
(x.iloc[:, 1] / x.iloc[:, 0]).unstack('cats')

`time / n_rows` declines when few categoricals are used. No clear pattern otherwise.

In [None]:
x = df.unstack('cat_len').mean(1)
x /= x.index.get_level_values('n_rows')
x.unstack(['cats', 'n_cats'])

### If dataframe becomes wide

When many columns are to be merged on one or both sides, merge starts taking significantly more time, mainly because data are to be copied to a new object.

Merge on cat keys becomes slower than on str keys with wide dataframes, although difference is small compared to overall merge time.

In [None]:
# merge on strings
df, agg = gen_cat_data(10_000_000, 100, 10, False)
print('merge few columns')
%time _ = df.merge(agg)
for i in range(100):
    df[f'var{i}'] = np.random.rand(len(df))
print('merge many columns')
%time _ = df.merge(agg)
for i in range(100):
    agg[f'agg{i}'] = np.random.rand(len(agg))
print('merge many-many columns')
%time _ = df.merge(agg)

In [None]:
# merge on categoricals
df, agg = gen_cat_data(10_000_000, 100, 10, True)
print('merge few columns')
%time _ = df.merge(agg)
for i in range(100):
    df[f'var{i}'] = np.random.rand(len(df))
print('merge many columns')
%time _ = df.merge(agg)
for i in range(100):
    agg[f'agg{i}'] = np.random.rand(len(agg))
print('merge many-many columns')
%time _ = df.merge(agg)

## Container for boolean with NA

This might be a better solution than using `float32` (or less supported `float16`). Each item will only occupy one byte, and NA-related methods will work as expected.

`fillna()` will not accept values outside of preset categories, so need to `add_categories()` first.

Categories can be `[0, 1]` or `[False, True]`, but the latter is not supported by fastparquet writer.

In [None]:
df = gen_mock_data(100_000, num=1)
df['boo'] = (df.num0 > 0.8)
df.loc[df.sample(frac=0.1).index, 'boo'] = np.nan
print(df.boo.value_counts(dropna=False))
df['boo_cat'] = df.boo.astype('category').cat.rename_categories({0: False, 1: True})
print(df.boo_cat.value_counts(dropna=False))
print(df.memory_usage())

# Date and time

To be added later when use case arises.

# Parquet support

Integer and float types are stored and converted automatically with exception of `float16`.

In [None]:
import numpy as np
import pandas as pd
import fastparquet as pq

data = list(range(100))
df = pd.DataFrame()
for dt in ['uint8', 'uint16', 'uint32', 'uint64',
           'int8', 'int16', 'int32', 'int64',
           'float16', 'float32', 'float64']:
    df[dt] = pd.Series(data, dtype=dt)

dfpq_path = '/tmp/dataframe.pq'
df.to_parquet(dfpq_path, 'fastparquet', None, False)

dfpq = pd.read_parquet(dfpq_path, 'fastparquet')
pd.concat([df.dtypes, dfpq.dtypes], 1)

Categories can not be `[False, True]`. Maybe fastparquet bug.

In [None]:
# df = pd.Series([False, False, True], dtype=pd.CategoricalDtype([False, True])).to_frame('col') # <-- this fails
df = pd.Series([False, False, True], dtype=pd.CategoricalDtype([False, True, 2])).to_frame('col') # <-- this works
df.to_parquet('/tmp/dataframe.pq', 'fastparquet', None, False)

# Extension types

Newer versions of pandas introduced [extension dtypes](https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extending-extension-types), although they are still in experimental stage as of pandas 1.1. These include support for nullable [integers](https://pandas.pydata.org/docs/user_guide/integer_na.html) and [strings](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.StringDtype.html).

Test performance before using.