## Create a Sample DataFrame in Spark

Ref : https://bryancutler.github.io/toPandas/ (Spark toPandas() with Arrow, a Detailed Look)

To generate some sample data, we will make a DataFrame with 2 columns: 1 long and 1 double and 4,194,304 records

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession\
    .builder\
    .appName("PySpark_to_Pandas_with_Arrow")\
    .getOrCreate()

In [3]:
from pyspark.sql.functions import rand
df = spark.range(1 << 22).toDF("id").withColumn("x", rand())
df.printSchema()

root
 |-- id: long (nullable = false)
 |-- x: double (nullable = false)



## Conversion to a Pandas DataFrame without Arrow

This uses the default Spark serializers to transfer the data and load it into Pandas 1 record at a time. It's a very inefficient process due to the high overhead of serialization and having to process individual scalar values.

In [4]:
%time pdf = df.toPandas()

CPU times: user 15.5 s, sys: 977 ms, total: 16.4 s
Wall time: 21.8 s


## Enable Arrow with a Spark property
By default, Arrow is not enabled in Spark.  You can enable by setting the following SQLConf or adding "spark.sql.execution.arrow.enabled=true" to your Spark configuration at `conf/spark-defaults.conf`

In [5]:
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

## Run the conversion again, this time with Arrow
With Arrow enabled, the call `toPandas()` is optimized to use Arrow to transfer the data and avoid serialization costs. Arrow can then utilize zero-copy methods to produce a Pandas DataFrame on chunks of data at a time, making the entire process very efficient.

In [8]:
%time pdf = df.toPandas()

CPU times: user 86 ms, sys: 83.8 ms, total: 170 ms
Wall time: 1.31 s
