- Title: Conversion Between PySpark DataFrames and pandas DataFrames
- Slug: pyspark-pandas-dataframe-conversion
- Date: 2020-06-18 08:43:44
- Category: Computer Science
- Tags: programming, Python, HPC, high performance computing, PySpark, DataFrame, construct
- Author: Ben Du

## Comments

1. A PySpark DataFrame can be converted to a pandas DataFrame by calling the method `DataFrame.toPandas`,
    and a pandas DataFrame can be converted to a PySpark DataFrame by calling `SparkSession.createDataFrame`.
    Notice that when you call `DataFrame.toPandas` 
    to convert a Spark DataFrame to a pandas DataFrame, 
    the whole Spark DataFrame is collected to the driver machine!
    This means that you should only call the method `DataFrame.toPandas`
    when the Spark DataFrame is small enough to fit into the memory of the driver machine.

2. Apache Arrow can be leveraged to convert between Spark DataFrame and pandas DataFrame without data copying. 
    However, 
    there are some restrictions on this.
    Please refer to 
    [PySpark Usage Guide for Pandas with Apache Arrow](https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html)
    for more discussions.

2. The perhaps most convenient way to create an ad hoc PySpark DataFrame 
    is to first [create a pandas DataFrame](http://www.legendu.net/en/blog/construct-pandas-dataframe-python/)
    and then convert it to a PySpark DataFrame (using `SparkSession.createDataFrame`).

In [1]:
import pandas as pd
import findspark
findspark.init("/opt/spark")

from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import StructType
spark = SparkSession.builder.appName("PySpark_pandas") \
    .enableHiveSupport().getOrCreate()

In [6]:
df_p = pd.DataFrame(data=[
    ["Ben", 2, 30],
    ["Dan", 4, 25],
    ["Will", 1, 26],
], columns=["name", "id", "age"])
df_p

Unnamed: 0,name,id,age
0,Ben,2,30
1,Dan,4,25
2,Will,1,26


In [9]:
df1 = spark.createDataFrame(df_p)
df1.show()

+----+---+---+
|name| id|age|
+----+---+---+
| Ben|  2| 30|
| Dan|  4| 25|
|Will|  1| 26|
+----+---+---+



## References

https://stackoverflow.com/questions/37612622/spark-unionall-multiple-dataframes

https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.SparkSession.createDataFrame

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions