# Chapter 5

Pandas has multiple possible underlying type systems. There are the classic NumPy types, but also the new PyArrow types. These can coexist in the same data frame, but seemingly similar types can have different properties:

In [1]:
import pandas as pd

s1 = pd.Series([1.0, 2.0, 3.0], dtype="float64")
s2 = pd.Series([0.3, 1.3, 2.7], dtype="float64[pyarrow]")

df = pd.DataFrame({"first": s1, "second": s2})

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype          
---  ------  --------------  -----          
 0   first   3 non-null      float64        
 1   second  3 non-null      double[pyarrow]
dtypes: double[pyarrow](1), float64(1)
memory usage: 180.0 bytes


Properties of NumPy types:
- Automatic type conversion to Python `int` when dealing with large integers
- Automatic type conversion to `float64` when `NaN` values appear
- Conversion to "smaller" types is possible; overflows will happen without warning
- NumPy doesn't have a string type; strings are treated as general `object`s

Properties of PyArrow types:
- No automatic type conversions for large integers; an error is thrown instead
- No automatic overflows will happen when converting to smaller types; an error is thrown if one would occur
- Integer types can handle `<NA>` values
- No direct conversion from strings to floating point types; you must go through NumPy
- PyArrow does have a dedicated string type, but you need to create it with `pd.ArrowDtype(pa.string())`

## Categorical data

Use the `category` type for categorical data with a *low* number of categories. If there are too many categories, this becomes less efficient than treating the values as strings. PyArrow has a `dictionary` type for this kind of data, but the author ignores it in favor of the Pandas 1.x `category` type. `dictionary` is not exposed directly in Pandas unlike other PyArrow data types.

Pandas also supports ordered categories in particular.

In [3]:
states = ['CA', 'NY', 'TX']
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
          'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

pd.Series(states, dtype='category')

0    CA
1    NY
2    TX
dtype: category
Categories (3, object): ['CA', 'NY', 'TX']

In [4]:
month_cat = pd.CategoricalDtype(categories=months, ordered=True)
pd.Series(months, dtype=month_cat).sort_values()

0     Jan
1     Feb
2     Mar
3     Apr
4     May
5     Jun
6     Jul
7     Aug
8     Sep
9     Oct
10    Nov
11    Dec
dtype: category
Categories (12, object): ['Jan' < 'Feb' < 'Mar' < 'Apr' ... 'Sep' < 'Oct' < 'Nov' < 'Dec']

In [5]:
import pyarrow as pa

string_pa = pd.ArrowDtype(pa.string())
pd.Series(months, dtype=string_pa).astype(month_cat)

0     Jan
1     Feb
2     Mar
3     Apr
4     May
5     Jun
6     Jul
7     Aug
8     Sep
9     Oct
10    Nov
11    Dec
dtype: category
Categories (12, object): ['Jan' < 'Feb' < 'Mar' < 'Apr' ... 'Sep' < 'Oct' < 'Nov' < 'Dec']

In [6]:
# Not included in book, but this is how to use the 1.x categorical type with an
# underlying 2.x PyArrow data type.
month_cat = pd.CategoricalDtype(categories=pd.Series(months, dtype=string_pa), ordered=True)
pd.Series(months, dtype=string_pa).astype(month_cat)

0     Jan
1     Feb
2     Mar
3     Apr
4     May
5     Jun
6     Jul
7     Aug
8     Sep
9     Oct
10    Nov
11    Dec
dtype: category
Categories (12, string[pyarrow]): [Jan < Feb < Mar < Apr ... Sep < Oct < Nov < Dec]

## Dates and times

In [7]:
import datetime as dt
dt_list = [dt.datetime(2020, 1, 1, 4, 30), dt.datetime(2020, 1, 2), dt.datetime(2020, 1, 3)]
string_dates = ['2020-01-01 04:30:00', '2020-01-02 00:00:00', '2020-01-03 00:00:00']
string_dates_missing = ['2020-01-01 4:30', None, '2020-01-03']
epoch_dates = [1577836800, 1577923200, 1578009600]

In [8]:
pd.Series(dt_list)

0   2020-01-01 04:30:00
1   2020-01-02 00:00:00
2   2020-01-03 00:00:00
dtype: datetime64[ns]

In [9]:
pd.Series(string_dates, dtype='datetime64[ns]')

0   2020-01-01 04:30:00
1   2020-01-02 00:00:00
2   2020-01-03 00:00:00
dtype: datetime64[ns]

In [10]:
pd.Series(string_dates_missing, dtype='datetime64[ns]')

0   2020-01-01 04:30:00
1                   NaT
2   2020-01-03 00:00:00
dtype: datetime64[ns]

Be careful with epoch times; make sure you know the units. Here, the times are in seconds:

In [11]:
pd.Series(epoch_dates, dtype='datetime64[s]')

0   2020-01-01
1   2020-01-02
2   2020-01-03
dtype: datetime64[s]

In [12]:
pd.Series(dt_list, dtype='timestamp[ns][pyarrow]')

0    2020-01-01 04:30:00
1    2020-01-02 00:00:00
2    2020-01-03 00:00:00
dtype: timestamp[ns][pyarrow]

In [13]:
pd.Series(string_dates, dtype='timestamp[ns][pyarrow]')

0    2020-01-01 04:30:00
1    2020-01-02 00:00:00
2    2020-01-03 00:00:00
dtype: timestamp[ns][pyarrow]

Missing data seems to fail with PyArrow times:

`pd.Series(string_dates_missing, dtype='timestamp[ns][pyarrow]')`

In [14]:
pd.Series(epoch_dates, dtype='timestamp[s][pyarrow]')

0    2020-01-01 00:00:00
1    2020-01-02 00:00:00
2    2020-01-03 00:00:00
dtype: timestamp[s][pyarrow]

## Exercises

1. To represent the number of people in the US, I would use the `uint32[pyarrow]` type, which has a maximum value of around 4 billion. For the number of people worldwide, I would use the `uint64[pyarrow]` type.

2. To describe a product, I would likely use the `category` type. For the name, I would use a PyArrow string. For the price, I might use a floating-point value like `float32[pyarrow]`, but PyArrow provides exact decimals in `pa.decimal128`, so I might consider that for its precision.

3. For the date and time of a stock trade, I would use the `timestamp[ns][pyarrow]` type. For a date of birth of a person, I would likely use the `pa.date32` type.