# Convert `dtype`

Sometimes the pandas data types do not fit really well. This can be due to serialisation formats that do not contain type information, for example. However, sometimes you should also change the type to achieve better performance – either more manipulation possibilities or less memory requirements. In the following examples, we will make different conversions of a `Series`:

In [1]:
import numpy as np
import pandas as pd

In [2]:
rng = np.random.default_rng()
s = pd.Series(rng.normal(size=7))

In [3]:
s

0    0.684464
1   -0.521004
2    0.520805
3    0.086337
4   -1.427918
5    0.096508
6   -0.131822
dtype: float64

## Automatic conversion

[pandas.Series.convert_dtypes](https://pandas.pydata.org/docs/reference/api/pandas.Series.convert_dtypes.html) tries to convert a `Series` to a type that supports `NA`. In the case of our `Series`, the type is changed from `float64` to `Float64`:

In [4]:
s.convert_dtypes()

0    0.684464
1   -0.521004
2    0.520805
3    0.086337
4   -1.427918
5    0.096508
6   -0.131822
dtype: Float64

Unfortunately, however, with `convert_dtypes` I have little control over what data type is converted to. Therefore, I prefer [pandas.Series.astype](https://pandas.pydata.org/docs/reference/api/pandas.Series.astype.html):

In [5]:
s.astype("Float32")

0    0.684464
1   -0.521004
2    0.520805
3    0.086337
4   -1.427918
5    0.096508
6   -0.131822
dtype: Float32

Using the correct type can save memory. The usual data type is 8 bytes wide, for example `int64` or `float64`. If you can use a narrower type, this will significantly reduce memory consumption, allowing you to process more data. You can use NumPy to check the limits of integer and float types:

In [6]:
np.iinfo("int64")

iinfo(min=-9223372036854775808, max=9223372036854775807, dtype=int64)

In [7]:
np.finfo("float32")

finfo(resolution=1e-06, min=-3.4028235e+38, max=3.4028235e+38, dtype=float32)

In [8]:
np.finfo("float64")

finfo(resolution=1e-15, min=-1.7976931348623157e+308, max=1.7976931348623157e+308, dtype=float64)

## Memory usage

To calculate the memory consumption of the `Series`, you can use [pandas.Series.nbytes](https://pandas.pydata.org/docs/reference/api/pandas.Series.nbytes.html) to determine the memory used by the data. [pandas.Series.memory_usage](https://pandas.pydata.org/docs/reference/api/pandas.Series.memory_usage.html) also records the index memory and the data type. With `deep=True` you can also determine the memory consumption at system level.

In [9]:
s.nbytes

56

In [10]:
s.astype("Float32").nbytes

35

In [11]:
s.memory_usage()

188

In [12]:
s.astype("Float32").memory_usage()

167

In [13]:
s.memory_usage(deep=True)

188

## String and category types

The [pandas.Series.astype](https://pandas.pydata.org/docs/reference/api/pandas.Series.astype.html) method can also convert numeric series into strings if you pass `str`. Note the `dtype` in the following example:

In [14]:
s.astype(str)

0     0.6844642165885412
1    -0.5210038291133791
2     0.5208054433837618
3    0.08633690916390123
4    -1.4279177804659373
5     0.0965079370150347
6    -0.1318224826843188
dtype: object

In [15]:
s.astype(str).memory_usage()

188

In [16]:
s.astype(str).memory_usage(deep=True)

605

To convert to a categorical type, you can pass `'category'` as the type:

In [17]:
s.astype(str).astype("category")

0     0.6844642165885412
1    -0.5210038291133791
2     0.5208054433837618
3    0.08633690916390123
4    -1.4279177804659373
5     0.0965079370150347
6    -0.1318224826843188
dtype: category
Categories (7, object): ['-0.1318224826843188', '-0.5210038291133791', '-1.4279177804659373', '0.08633690916390123', '0.0965079370150347', '0.5208054433837618', '0.6844642165885412']

A categorical `Series` is useful for string data and can lead to large memory savings. This is because when converting to categorical data, pandas no longer uses Python strings for each value, but repeating values are not duplicated. You still have all the features of the `str` attribute, but you save a lot of memory when you have a lot of duplicate values and you increase performance because you don’t have to do as many string operations.

In [18]:
s.astype("category").memory_usage(deep=True)

495

## Ordered categories

To create ordered categories, you need to define your own [pandas.CategoricalDtype](https://pandas.pydata.org/docs/reference/api/pandas.CategoricalDtype.html):

In [19]:
from pandas.api.types import CategoricalDtype


sorted = pd.Series(sorted(set(s)))
cat_dtype = CategoricalDtype(categories=sorted, ordered=True)

s.astype(cat_dtype)

0    0.684464
1   -0.521004
2    0.520805
3    0.086337
4   -1.427918
5    0.096508
6   -0.131822
dtype: category
Categories (7, float64): [-1.427918 < -0.521004 < -0.131822 < 0.086337 < 0.096508 < 0.520805 < 0.684464]

In [20]:
s.astype(cat_dtype).memory_usage(deep=True)

495

The following table lists the types you can pass to `astype`.

Data type | Description
:-------- | :----------
`str`, `'str'` | convert to Python string
`'string'` | convert to Pandas string with `pandas.NA`
`int`, `'int'`, `'int64'` | convert to NumPy `int64`
`'int32'`, `'uint32'` | convert to NumPy `int32`
`'Int64'` | convert to pandas `Int64` with `pandas.NA`
`float`, `'float'`, `'float64'` | convert to floats
`'category'` | convert to `CategoricalDtype` with `pandas.NA`

## Conversion to other data types

The [pandas.Series.to_numpy](https://pandas.pydata.org/docs/reference/api/pandas.Series.to_numpy.html) method or the [pandas.Series.values](https://pandas.pydata.org/docs/reference/api/pandas.Series.values.html) property gives us a NumPy array of values, and [pandas.Series.to_list](https://pandas.pydata.org/docs/reference/api/pandas.Series.to_list.html) returns a Python list of values. Why would you want to do this? pandas objects are usually much more user-friendly and the code is easier to read. Also, python lists will be much slower to process. With [pandas.Series.to_frame](https://pandas.pydata.org/docs/reference/api/pandas.Series.to_frame.html) you can create a DataFrame with a single column, if necessary:

In [21]:
s.to_frame()

Unnamed: 0,0
0,0.684464
1,-0.521004
2,0.520805
3,0.086337
4,-1.427918
5,0.096508
6,-0.131822


The function [pandas.to_datetime](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html) can also be useful to convert values in pandas to date and time.