The Pandas data frame package has a number of interesting features including complex indices (that allow implicit joins), and time oriented features.

However, a number of Pandas design decisions pose problems for classic (non-finance) data science. Some undesirable sharp edges include:


  * Separate types for atomic columns (such as `int`, `bool`, and `float`) and columns of objects (such as `str`).
  * No out-of-band representation of missing values. Instead, missingness must be signaled by the insertion of a value representing missingness. This causes problems for types that don't have such a representation such as `int` and `bool`.

To work around the above the Pandas data frame have a number of non-avoidable column type promotion rules and cell type promotion rules. These promotion rules can introduce their own complexity.

Let's take a look at a Pandas data frame.

In [1]:
# import modules
from pprint import pprint
import numpy as np
import pandas as pd
from typing import Dict, Optional, Set, Type


In [2]:
# show an example data frame
# notice types in the frame are often types from the input data
d =  pd.DataFrame({
        'b': [1, 3, 4],
        'q': np.nan,
        'r': [1, None, 3],
        's': [np.nan, 2.0, 3.0],
        'x': [1, 7.0, 2],
        'y': ["a", None, np.nan],
        'z': [1, 1.0, False],
    })

d

Unnamed: 0,b,q,r,s,x,y,z
0,1,,1.0,,1.0,a,1
1,3,,,2.0,7.0,,1.0
2,4,,3.0,3.0,2.0,,False


Notice that `None` has been converted to `NaN` in column `r`, but not in column `y`. The declared column types tell part of the story.

In [3]:
# column types, not same as cell value types
d.dtypes

b      int64
q    float64
r    float64
s    float64
x    float64
y     object
z     object
dtype: object

To deal with mixed types Pandas must promote the column declarations to something that can contain both the original intended non-null values and the missingness indicators. For integers the column is promoted to floating point, as that allows the of floating point `nan` and use of a non-object column. The floating point promotion causes cell types to be changed from integer to floating point For more complicated cases the column must be promoted to object (a more expensive proposition). The object promotion is used to allow both object cell types and heterogeneous cell types (such as both `float` and `bool`). Without full knowledge of the cell values, the user can not anticipate the chosen conversions and resulting types.

Directly inspecting the types found in the data frame cells shows a bit more detail.

In [4]:
def non_null_types_in_frame(d) -> Dict[str, Optional[Set[Type]]]:
    """
    Return dictionary of non-null types seen in dataframe columns.

    :param d: Pandas or Polars data frame.
    """
    result = dict()
    for col_name in d.columns:
        types_seen = {type(vi) for vi in d[col_name] if not pd.isnull(vi)}
        if len(types_seen) < 1:
            result[col_name] = None
        else:
            result[col_name] = types_seen
    return result

In [5]:
# report non-null (not None, NaN, or NaT) found in cells
pprint(non_null_types_in_frame(d))

{'b': {<class 'int'>},
 'q': None,
 'r': {<class 'float'>},
 's': {<class 'float'>},
 'x': {<class 'float'>},
 'y': {<class 'str'>},
 'z': {<class 'bool'>, <class 'float'>, <class 'int'>}}


There is a risk of "nearly compatible types" such as `float` v.s. `numpy.float64` and `int` v.s. `numpy.int64`. In fact, notice the types found by inspecting the entries do not match what is found in the column declarations.

This could be related to the following. Values taken through the `.values` attribute can have different types than those taken through the column interface!

In [6]:
x0_type_from_values_access = type(list(d["x"].values)[0])

x0_type_from_values_access

numpy.float64

In [7]:
x0_type_from_column_access = type(list(d["x"])[0])

x0_type_from_column_access

float

In [8]:
x0_type_from_values_access == x0_type_from_column_access

False

If you are not sure all of your code base (and its dependencies) are consistently only using columns or only using the values attribute, you may experience incompatible mixed types even on uniform data. We know one is not supposed to use "`.values`" [from the Pandas documention](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.values.html):


<dd>
pandas.DataFrame.values
property DataFrame.values
<dd><p>Return a Numpy representation of the DataFrame.</p>
<div class="admonition warning">
<p class="admonition-title">Warning</p>
<p>We recommend using <a class="reference internal" href="pandas.DataFrame.to_numpy.html#pandas.DataFrame.to_numpy" title="pandas.DataFrame.to_numpy"><code class="xref py py-meth docutils literal notranslate"><span class="pre">DataFrame.to_numpy()</span></code></a> instead.</p>
</div>
<p>Only the values in the DataFrame will be returned, the axes labels
will be removed.</p>
<dl class="field-list simple">
<dt class="field-odd">Returns<span class="colon">:</span></dt>
<dd class="field-odd"><dl class="simple">
<dt>numpy.ndarray</dt><dd><p>The values of the DataFrame.</p>
</dd></dd></dd></dd>

So, presumably, Pandas `.values` is not in fact the attribute it syntactically presents as, but in fact a method interface.

The type the recommended method `.to_numpy()` seems to return the same `numpy.float64`, which presumably is *not* what is inside the Pandas data frame columns or Series representations. In any case, what types you see in a cell is dependent on what types are in related cells, and what path you use to access the value.

In [9]:
type(list(d["x"].to_numpy())[0])

numpy.float64

Any and all of the above inconsistencies can be fairly hazardous to any insufficiently careful system that tries to export Pandas to other type sensitive systems (such as databases, JSON, arrow and so on).

In [10]:
pprint({'np': np.__version__, 'pd': pd.__version__})

{'np': '1.25.2', 'pd': '2.0.3'}
