We must use pyarrow type below to handle string data. We can’t just use 'string[pyarrow]' as a type to get the new Pandas 2 pyarrow types. This is because this type was introduced back in Pandas 1.5 era and the operations on it will generally return legacy NumPy typed data.

In [1]:
import pyarrow as pa
import pandas as pd
string_pa = pd.ArrowDtype(pa.string())
text_freeform = ['My name is Arthur', 'I like pandas', 'I like programming']
text_with_missing = ['My name is Arthur', None, 'I like programming']

The type of the below series is object because the series is storing Python objects. The numpy backend doesn't support strings.

In [2]:
pd.Series(text_freeform)

0     My name is Arthur
1         I like pandas
2    I like programming
dtype: object

In [3]:
pd.Series(text_with_missing)

0     My name is Arthur
1                  None
2    I like programming
dtype: object

This works, and we don’t need to use `.astype(str)` to convert the values to strings. However, the type of the series is object. This is because the series is storing Python objects. Pandas 1.x stores the str type as Python objects. This is because NumPy doesn’t support strings. So, let's use Pandas 2.0 string type:

In [4]:
tf1 = pd.Series(text_freeform, dtype=string_pa)
tf1

0     My name is Arthur
1         I like pandas
2    I like programming
dtype: string[pyarrow]

In [5]:
tf2 = pd.Series(text_freeform, dtype='string[pyarrow]')
tf2

0     My name is Arthur
1         I like pandas
2    I like programming
dtype: string

In [8]:
tf1.dtype == tf2.dtype

False

The text with missing data is string[pyarrow]
Uses less memory and is faster than Pandas 1.x

In [6]:
pd.Series(text_with_missing, dtype=string_pa)

0     My name is Arthur
1                  <NA>
2    I like programming
dtype: string[pyarrow]