Type Support in Pandas API on Spark

.. currentmodule:: pyspark.pandas

In this chapter, we will briefly show you how data types change when converting pandas-on-Spark DataFrame from/to PySpark DataFrame or pandas DataFrame.

Type casting between PySpark and pandas API on Spark

When converting a pandas-on-Spark DataFrame from/to PySpark DataFrame, the data types are automatically casted to the appropriate type.

The example below shows how data types are casted from PySpark DataFrame to pandas-on-Spark DataFrame.

# 1. Create a PySpark DataFrame
>>> sdf = spark.createDataFrame([
...     (1, Decimal(1.0), 1., 1., 1, 1, 1, datetime(2020, 10, 27), "1", True, datetime(2020, 10, 27)),
... ], 'tinyint tinyint, decimal decimal, float float, double double, integer integer, long long, short short, timestamp timestamp, string string, boolean boolean, date date')

# 2. Check the PySpark data types
>>> sdf
DataFrame[tinyint: tinyint, decimal: decimal(10,0), float: float, double: double, integer: int, long: bigint, short: smallint, timestamp: timestamp, string: string, boolean: boolean, date: date]

# 3. Convert PySpark DataFrame to pandas-on-Spark DataFrame
>>> psdf = sdf.pandas_api()

# 4. Check the pandas-on-Spark data types
>>> psdf.dtypes
tinyint                int8
decimal              object
float               float32
double              float64
integer               int32
long                  int64
short                 int16
timestamp    datetime64[ns]
string               object
boolean                bool
date                 object
dtype: object

The example below shows how data types are casted from pandas-on-Spark DataFrame to PySpark DataFrame.

# 1. Create a pandas-on-Spark DataFrame
>>> psdf = ps.DataFrame({"int8": [1], "bool": [True], "float32": [1.0], "float64": [1.0], "int32": [1], "int64": [1], "int16": [1], "datetime": [datetime.datetime(2020, 10, 27)], "object_string": ["1"], "object_decimal": [decimal.Decimal("1.1")], "object_date": [datetime.date(2020, 10, 27)]})

# 2. Type casting by using `astype`
>>> psdf['int8'] = psdf['int8'].astype('int8')
>>> psdf['int16'] = psdf['int16'].astype('int16')
>>> psdf['int32'] = psdf['int32'].astype('int32')
>>> psdf['float32'] = psdf['float32'].astype('float32')

# 3. Check the pandas-on-Spark data types
>>> psdf.dtypes
int8                        int8
bool                        bool
float32                  float32
float64                  float64
int32                      int32
int64                      int64
int16                      int16
datetime          datetime64[ns]
object_string             object
object_decimal            object
object_date               object
dtype: object

# 4. Convert pandas-on-Spark DataFrame to PySpark DataFrame
>>> sdf = psdf.to_spark()

# 5. Check the PySpark data types
>>> sdf
DataFrame[int8: tinyint, bool: boolean, float32: float, float64: double, int32: int, int64: bigint, int16: smallint, datetime: timestamp, object_string: string, object_decimal: decimal(2,1), object_date: date]

Type casting between pandas and pandas API on Spark

When converting pandas-on-Spark DataFrame to pandas DataFrame, the data types are basically the same as pandas.

# Convert pandas-on-Spark DataFrame to pandas DataFrame
>>> pdf = psdf.to_pandas()

# Check the pandas data types
>>> pdf.dtypes
int8                        int8
bool                        bool
float32                  float32
float64                  float64
int32                      int32
int64                      int64
int16                      int16
datetime          datetime64[ns]
object_string             object
object_decimal            object
object_date               object
dtype: object

However, there are several data types only provided by pandas.

# pd.Catrgorical type is not supported in pandas API on Spark yet.
>>> ps.Series([pd.Categorical([1, 2, 3])])
Traceback (most recent call last):
...
pyarrow.lib.ArrowInvalid: Could not convert [1, 2, 3]
Categories (3, int64): [1, 2, 3] with type Categorical: did not recognize Python value type when inferring an Arrow data type

These kinds of pandas specific data types below are not currently supported in the pandas API on Spark but planned to be supported.

pd.Timedelta
pd.Categorical
pd.CategoricalDtype

The pandas specific data types below are not planned to be supported in the pandas API on Spark yet.

pd.SparseDtype
pd.DatetimeTZDtype
pd.UInt*Dtype
pd.BooleanDtype
pd.StringDtype

Internal type mapping

The table below shows which NumPy data types are matched to which PySpark data types internally in the pandas API on Spark.

NumPy	PySpark
np.character	BinaryType
np.bytes_	BinaryType
np.string_	BinaryType
np.int8	ByteType
np.byte	ByteType
np.int16	ShortType
np.int32	IntegerType
np.int64	LongType
np.float32	FloatType
np.float64	DoubleType
np.unicode_	StringType
np.datetime64	TimestampType
np.ndarray	ArrayType(StringType())

The table below shows which Python data types are matched to which PySpark data types internally in pandas API on Spark.

Python	PySpark
bytes	BinaryType
int	LongType
float	DoubleType
str	StringType
bool	BooleanType
datetime.datetime	TimestampType
datetime.date	DateType
decimal.Decimal	DecimalType(38, 18)

For decimal type, pandas API on Spark uses Spark's system default precision and scale.

You can check this mapping by using the as_spark_type function.

>>> import typing
>>> import numpy as np
>>> from pyspark.pandas.typedef import as_spark_type

>>> as_spark_type(int)
LongType

>>> as_spark_type(np.int32)
IntegerType

>>> as_spark_type(typing.List[float])
ArrayType(DoubleType,true)

You can also check the underlying PySpark data type of Series or schema of DataFrame by using Spark accessor.

>>> ps.Series([0.3, 0.1, 0.8]).spark.data_type
DoubleType

>>> ps.Series(["welcome", "to", "pandas-on-Spark"]).spark.data_type
StringType

>>> ps.Series([[False, True, False]]).spark.data_type
ArrayType(BooleanType,true)

>>> ps.DataFrame({"d": [0.3, 0.1, 0.8], "s": ["welcome", "to", "pandas-on-Spark"], "b": [False, True, False]}).spark.print_schema()
root
 |-- d: double (nullable = false)
 |-- s: string (nullable = false)
 |-- b: boolean (nullable = false)

Note

Pandas API on Spark currently does not support multiple types of data in a single column.

>>> ps.Series([1, "A"])
Traceback (most recent call last):
...
TypeError: an integer is required (got type str)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

types.rst

types.rst

Type Support in Pandas API on Spark

Type casting between PySpark and pandas API on Spark

Type casting between pandas and pandas API on Spark

Internal type mapping

Files

types.rst

Latest commit

History

types.rst

File metadata and controls

Type Support in Pandas API on Spark

Type casting between PySpark and pandas API on Spark

Type casting between pandas and pandas API on Spark

Internal type mapping