# Convert PySpark RDD to DataFrame

---

**In PySpark, toDF() function of the RDD is used to convert RDD to DataFrame. We would need to convert RDD to DataFrame as DataFrame provides more advantages over RDD. For instance, DataFrame is a distributed collection of data organized into named columns similar to Database tables and provides optimization and performance improvements.**

## 1. Create PySpark RDD

--- 

**In PySpark, when you have data in a list meaning you have a collection of data in a PySpark driver memory when you create an RDD, this collection is going to be parallelized.**

In [0]:

dept = [('Finance', 10), ('Marketing', 20), ('Sales', 30), ('IT', 40)]

rdd = sc.parallelize(dept)

##2. Convert PySpark RDD to DataFrame

---

**Converting PySpark RDD to DataFrame can be done using toDF(), createDataFrame(). In this section**

###2.1 Using rdd.toDF() function

---

**PySpark provides toDF() function in RDD which can be used to convert RDD into Dataframe**

In [0]:
df = rdd.toDF()
df.printSchema()
df.show(truncate=False)

root
 |-- _1: string (nullable = true)
 |-- _2: long (nullable = true)

+---------+---+
|_1       |_2 |
+---------+---+
|Finance  |10 |
|Marketing|20 |
|Sales    |30 |
|IT       |40 |
+---------+---+



**toDF() has another signature that takes arguments to define column names as shown below.**

In [0]:
deptColumns = ['dept_name', 'dept_id']

df2 = rdd.toDF(schema=deptColumns)
df2.printSchema()

df2.show(truncate=False)

root
 |-- dept_name: string (nullable = true)
 |-- dept_id: long (nullable = true)

+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance  |10     |
|Marketing|20     |
|Sales    |30     |
|IT       |40     |
+---------+-------+



### 2.2 Using PySpark createDataFrame() function

---

**sparkSession class provides createDataFrame() method to create DataFrame and it takes rdd object as an argument.**

In [0]:
deptDF = spark.createDataFrame(data=rdd, schema=deptColumns)

deptDF.printSchema()

deptDF.show(truncate=False)

root
 |-- dept_name: string (nullable = true)
 |-- dept_id: long (nullable = true)

+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance  |10     |
|Marketing|20     |
|Sales    |30     |
|IT       |40     |
+---------+-------+



###2.3 Using createDataFrame() with StructType schema

---


**When you infer the schema, by default the datatype of the columns is derived from the data and set’s nullable to true for all columns. We can change this behavior by supplying schema using StructType – where we can specify a column name, data type and nullable for each field/column.**

In [0]:
from pyspark.sql.types import StructType, StructField, StringType

deptSchema = StructType([
    StructField('dept_name', StringType(), True),
    StructField('dept_id', StringType(), True)
])


deptDF1 = spark.createDataFrame(data=rdd, schema=deptSchema)

deptDF1.printSchema()

deptDF1.show(truncate=False)


root
 |-- dept_name: string (nullable = true)
 |-- dept_id: string (nullable = true)

+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance  |10     |
|Marketing|20     |
|Sales    |30     |
|IT       |40     |
+---------+-------+

