# How to Create a Spark DataFrame efficiently from Pandas using Arrow

ref :https://bryancutler.github.io/createDataFrame/

This notebook will demonstrate how to enable Arrow to quickly and efficiently create a Spark DataFrame from an existing Pandas DataFrame.


## Generate a Pandas DataFrame
First let's make a function to generate sample data from NumPy and wrap it in a Pandas DataFrame. The function will take an integer `num_records` and create a 2D array of doubles that translates to a DataFrame of 10 columns by `num_records` rows.

In [4]:
from pyspark.sql import SparkSession

spark = SparkSession\
    .builder\
    .appName("PySpark_createDataFrame_with_Arrow")\
    .getOrCreate()

In [5]:
import pandas as pd
import numpy as np

def gen_pdf(num_records):
    return pd.DataFrame(np.random.rand(num_records, 10), columns=list("abcdefghij"))

## Without Arrow, Life is Painful!
Lets first try to create a DataFrame without Arrow, but to avoid too much waiting around we will only use 100,000 records and time 1 call to create the DataFrame (this takes ~6-7s running local on my laptop).

In [6]:
spark.conf.set("spark.sql.execution.arrow.enabled", "false")

pdf = gen_pdf(100000)

%timeit -r1 spark.createDataFrame(pdf)

7.73 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


This really wasn't much data, but is still extremely slow! Mostly because Spark must iterate through each row of data and do type checking and conversions from Python to Java for each value, which in tern furces Numpy to convert data to plain Python objects and serialize these to the JVM.

## Enable Arrow with a Spark Conf
Now enable Arrow, this can also be done by adding as a line in `spark-defaults.conf`

In [8]:
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

In [9]:
pdf = gen_pdf(1000000)

%timeit spark.createDataFrame(pdf)

504 ms ± 90.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


That's more like it!  Even with a lot more data, it is still faster by a huge factor.  Arrow allows the Numpy data to be sent to the JVM in batches where it can be directly consumed without doing a bunch of conversions while still ensuring accurate type info.

Just to be sure nothing fishy is going on, we can take a look at the data and make sure it checks out.

In [10]:
spark.createDataFrame(pdf) \
  .select("a").summary().show()

+-------+--------------------+
|summary|                   a|
+-------+--------------------+
|  count|             1000000|
|   mean| 0.49979741282671175|
| stddev|  0.2887255683364922|
|    min|8.588998741121401E-7|
|    25%| 0.24983102003302005|
|    50%|  0.4998108544865818|
|    75%|  0.7498876936843447|
|    max|  0.9999972840513918|
+-------+--------------------+

