- Title: Handling Complicated Data Types in Python and PySpark
- Slug: python-complicated-data-types
- Date: 2020-05-30 12:10:31
- Category: Computer Science
- Tags: programming, Python, PySpark, Parquet, IO, data types, complicated
- Author: Ben Du

## Tips and Traps

1. An element in a pandas DataFrame can be any (complicated) type in Python.
    To save a padnas DataFrame with arbitrary (complicated) types as it is, 
    you have to use pickle. 
    For example, 
    you can use `pandas.DataFrame.to_pickle` and `pandas.DataFrame.read_pickle`
    or you can use `pick.dump` and `pickle.load` directly
    since `to_pickle` and `read_pickle` in pandas are simple wrappers over `pickle.dump` and `pickle.load`.
    
2. Apache Parquet is a binary file format 
    that stores data in a columnar fashion 
    for compressed, efficient columnar data representation.
    It is a very popular file format when working with big data (Hadoop/Spark, etc.) ecosystem. 
    However, 
    be aware that a Parquet file does not support arbitrary data types in Python!
    For example, 
    an element of the list type is converted to a numpy array first.
    This requires types of elements of a column to be consistent.
    For this reason,
    `numpy.ndarray` is preferred to `list` 
    if you want write the pandas DataFrame to a Parquet file later.
    
3. It is good practice to have consistent and specific types when working with Parquet file in Python,
    especially when you have to deal with the Parquet file in Spark/PySpark later.
    
    - `numpy.ndarray` is preferred to `list` and `tuple`.
    - Avoid mixing different types (`numpy.ndarray`, `list`, `tuple`, etc.) in the same column,
        even if it still might work.
    - An empty `numpy.ndarray` is preferred to `None` as handling of `None` can be inconssitent in different situations.
        Specically, 
        avoid a column with all `None`'s. 
        When written to a Parquet file and then read into Spark/PySpark,
        a column with all `None`'s is inferred as `IntegerType` (due to lack of specific type information). 
        This might or might not what you want.

4. You can specify a schema to help Spark/PySpark to read a Parquet file. 
    However, 
    I don't think this is a good practice.
    One advantage of Parquet file is that it has schema. 
    The accurate schema should be stored in the Parquet file.
    Otherwise, it is hard for other people for figure the correct shcema to use.

In [110]:
import pandas as pd
import numpy as np
import pickle

In [48]:
df_p = pd.DataFrame(
    {
        "x": [1, 2, 3, 4, 5],
        "z": [None, np.array([]), [],
              np.array([0.1, 0.2, 0.3]), [0.4, 0.5, 0.6]]
    }
)

df_p.head()

Unnamed: 0,x,z
0,1,
1,2,[]
2,3,[]
3,4,"[0.1, 0.2, 0.3]"
4,5,"[0.4, 0.5, 0.6]"


In [124]:
df_p = pd.DataFrame(
    {
        "x": [1, 2, 3, 4, 5],
        "y":
            [
                None,
                np.array([]), {
                    "key": 1
                },
                np.array([0.1, 0.2, 0.3]), ["how", 0.5, 0.6]
            ]
    }
)

df_p.head()

Unnamed: 0,x,y
0,1,
1,2,[]
2,3,{'key': 1}
3,4,"[0.1, 0.2, 0.3]"
4,5,"[how, 0.5, 0.6]"


In [125]:
df_p.to_pickle("/tmp/j1.pickle")

In [126]:
pickle.load(open("/tmp/j1.pickle", "rb"))

Unnamed: 0,x,y
0,1,
1,2,[]
2,3,{'key': 1}
3,4,"[0.1, 0.2, 0.3]"
4,5,"[how, 0.5, 0.6]"


In [127]:
df_p.to_parquet(file1)

ArrowInvalid: ('cannot mix list and non-list, non-null values', 'Conversion failed for column y with type object')

## Mixed Types

Explore the behavior of mixed types of None, numpy.ndarray and list.

In [None]:
### Pandas DataFrame

In [128]:
df_p = pd.DataFrame(
    {
        "x": [1, 2, 3, 4, 5],
        "y": [None, np.array([]), [],
              np.array([0.1, 0.2, 0.3]), [0.4, 0.5, 0.6]]
    }
)

df_p.head()

Unnamed: 0,x,y
0,1,
1,2,[]
2,3,[]
3,4,"[0.1, 0.2, 0.3]"
4,5,"[0.4, 0.5, 0.6]"


In [129]:
df_p.dtypes

x     int64
y    object
dtype: object

In [130]:
file1 = "/tmp/j1.parquet"

In [133]:
df_p.to_parquet(file, flavor='spark')

In [134]:
df2_p = pd.read_parquet(file)
df2_p

Unnamed: 0,x,y
0,1,
1,2,[]
2,3,[]
3,4,"[0.1, 0.2, 0.3]"
4,5,"[0.4, 0.5, 0.6]"


In [90]:
df2_p.dtypes

x     int64
y    object
dtype: object

In [93]:
type(df2_p.y[0])

NoneType

In [94]:
type(df2_p.y[1])

numpy.ndarray

In [95]:
type(df2_p.y[2])

numpy.ndarray

In [96]:
type(df2_p.y[3])

numpy.ndarray

In [97]:
type(df2_p.y[4])

numpy.ndarray

## Read the Parquet File into PySpark

In [69]:
import findspark
findspark.init("/opt/spark")

from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.types import *
from pyspark.sql.functions import *
spark = SparkSession.builder.appName("PySpark") \
    .enableHiveSupport().getOrCreate()

In [70]:
df = spark.read.parquet(file)
df.show()

+---+---------------+
|  x|              z|
+---+---------------+
|  1|           null|
|  2|             []|
|  3|             []|
|  4|[0.1, 0.2, 0.3]|
|  5|[0.4, 0.5, 0.6]|
+---+---------------+



In [71]:
df.schema

StructType(List(StructField(x,LongType,true),StructField(z,ArrayType(DoubleType,true),true)))

In [72]:
schema = StructType(
    [
        StructField("x", LongType(), False),
        StructField("z", ArrayType(DoubleType()), True),
    ]
)

In [73]:
df2 = spark.read.schema(schema).parquet(file)
df2.show()

+---+---------------+
|  x|              z|
+---+---------------+
|  1|           null|
|  2|             []|
|  3|             []|
|  4|[0.1, 0.2, 0.3]|
|  5|[0.4, 0.5, 0.6]|
+---+---------------+



In [74]:
df2.schema

StructType(List(StructField(x,LongType,true),StructField(z,ArrayType(DoubleType,true),true)))

In [39]:
df2.select(col("x"), col("y"), col("z"), col("z").isNull().alias("is_null")).show()

+---+---+---+-------+
|  x|  y|  z|is_null|
+---+---+---+-------+
|  1|  5| []|  false|
|  2|  4| []|  false|
|  3|  3| []|  false|
+---+---+---+-------+



In [75]:
file2 = "/tmp/j2.parquet"

In [76]:
df2.write.mode("overwrite").parquet(file2)

In [77]:
df3_p = pd.read_parquet(file2)
df3_p

Unnamed: 0,x,z
0,1,
1,2,[]
2,3,[]
3,4,"[0.1, 0.2, 0.3]"
4,5,"[0.4, 0.5, 0.6]"


In [78]:
df3_p.dtypes

x     int64
z    object
dtype: object

In [80]:
type(df3_p.z[0])

NoneType

In [81]:
type(df3_p.z[1])

numpy.ndarray

In [82]:
type(df3_p.z[2])

numpy.ndarray

In [83]:
type(df3_p.z[3])

numpy.ndarray

In [84]:
type(df3_p.z[4])

numpy.ndarray

## References

https://arrow.apache.org/docs/python/parquet.html