<h2>PySpark DataFrame</h2>
<h2>1. Create Empty RDD in PySpark</h2>
<p>Create an <strong>empty RDD</strong> by using emptyRDD() of SparkContext for example <strong>spark.sparkContext.emptyRDD()</strong></p>
<p height="100" width="100%" style="background:black;font-size:20px;color:white">
from pyspark.sql import SparkSession</br>
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()</br>

#Creates Empty RDD</br>
emptyRDD = spark.sparkContext.emptyRDD()</br>
print(emptyRDD)</br>

#Diplays</br>
#EmptyRDD[188] at emptyRDD</br>
</p>
<p>Alternatively you can also get empty RDD by using spark.sparkContext.parallelize([]).</p>
<p height="100" width="100%" style="background:black;font-size:20px;color:white">
#Creates Empty RDD using parallelize</br>
rdd2= spark.sparkContext.parallelize([])</br>
print(rdd2)</br>

#EmptyRDD[205] at emptyRDD at NativeMethodAccessorImpl.java:0</br>
#ParallelCollectionRDD[206] at readRDDFromFile at PythonRDD.scala:262</br>
</p>
<h2>2. Create Empty DataFrame with Schema (StructType)</h2>
<p>Here is an example for creating empty DataFrame with Schema</p>
<p height="100" width="100%" style="background:black;font-size:20px;color:white">
#Create Schema</br>
from pyspark.sql.types import StructType,StructField, StringType</br>
schema = StructType([
  StructField('firstname', StringType(), True),</br>
  StructField('middlename', StringType(), True),</br>
  StructField('lastname', StringType(), True)</br>
  ])</br>
</p>
<p>Now use the empty RDD created above and pass it to createDataFrame() of SparkSession along with the schema for column names & data types.</p>
<p height="50" width="100%" style="background:black;font-size:20px;color:white">
#Create empty DataFrame from empty RDD</br>
df = spark.createDataFrame(emptyRDD,schema)</br>
df.printSchema()</br>
</p>
<h2>3. Convert Empty RDD to DataFrame</h2>
<p>You can also create empty DataFrame by converting empty RDD to DataFrame using toDF().</p>
<p height="100" width="100%" style="background:black;font-size:20px;color:white">
#Convert empty RDD to Dataframe</br>
df1 = emptyRDD.toDF(schema)</br>
df1.printSchema()</br>

#Create empty DataFrame directly.</br>
df2 = spark.createDataFrame([], schema)</br>
df2.printSchema()</br>
</p>
########### comes in next section ############
<h2>Create DataFrame from RDD</h2>
<h2>1. Create PySpark RDD</h2>
<p height="100" width="100%" style="background:black;font-size:20px;color:white">
from pyspark.sql import SparkSession</br>
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()</br>
dept = [("Finance",10),("Marketing",20),("Sales",30),("IT",40)]</br>
rdd = spark.sparkContext.parallelize(dept)</br>
</p>
<h2>2. Convert PySpark RDD to DataFrame</h2>
<p>Converting PySpark RDD to DataFrame can be done using <strong>toDF()</strong>, <strong>createDataFrame()</strong></p>
<p height="200" width="100%" style="background:black;font-size:20px;color:white">
<h2>Using rdd.toDF() function</h2></br>
df = rdd.toDF()</br>
df.printSchema()</br>
df.show(truncate=False)</br>
<h2>Using PySpark createDataFrame() function</h2></br>
deptDF = spark.createDataFrame(rdd, schema = deptColumns)</br>
deptDF.printSchema()</br>
deptDF.show(truncate=False)</br>
<h2>Using createDataFrame() with StructType schema</h2></br>
from pyspark.sql.types import StructType,StructField, StringType</br>
deptSchema = StructType([  </br>     
    StructField('dept_name', StringType(), True),</br>
    StructField('dept_id', StringType(), True)</br>
])</br>

deptDF1 = spark.createDataFrame(rdd, schema = deptSchema)</br>
deptDF1.printSchema()</br>
deptDF1.show(truncate=False)</br>
</p>

<p height="400" width="100%" style="background:black;font-size:20px;color:white">
<h2>Complete Example</h2>
import pyspark</br>
from pyspark.sql import SparkSession</br>
</br>
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()</br>
</br>
dept = [("Finance",10),("Marketing",20),("Sales",30),("IT",40)]</br>
rdd = spark.sparkContext.parallelize(dept)</br>
</br>
df = rdd.toDF()</br>
df.printSchema()</br>
df.show(truncate=False)</br>
</br>
deptColumns = ["dept_name","dept_id"]</br>
df2 = rdd.toDF(deptColumns)</br>
df2.printSchema()</br>
df2.show(truncate=False)</br>
</br>
deptDF = spark.createDataFrame(rdd, schema = deptColumns)</br>
deptDF.printSchema()</br>
deptDF.show(truncate=False)</br>
</br>
from pyspark.sql.types import StructType,StructField, StringType</br>
deptSchema = StructType([</br>       
    StructField('dept_name', StringType(), True),</br>
    StructField('dept_id', StringType(), True)</br>
])</br>
</br>
deptDF1 = spark.createDataFrame(rdd, schema = deptSchema)</br>
deptDF1.printSchema()</br>
deptDF1.show(truncate=False)</br>
</p>
############ third section #################
<h2>Convert PySpark DataFrame to Pandas</h2>
<p>Use the toPandas() method available in PySpark DataFrame objects to convert them to DataFrames.<p>
<p>Pandas DataFrames are in-memory data structures, so consider memory constraints when converting large PySpark DataFrames.</p>
<p>Converting PySpark DataFrames to Pandas DataFrames allows you to leverage Pandas’ extensive functionality for data manipulation and analysis.</p>
<p height="200" width="100%" style="background:black;font-size:20px;color:white">
import pyspark</br>
from pyspark.sql import SparkSession</br>

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()</br>

data = [("James","","Smith","36636","M",60000),</br>
        ("Michael","Rose","","40288","M",70000),</br>
        ("Robert","","Williams","42114","",400000),</br>
        ("Maria","Anne","Jones","39192","F",500000),</br>
        ("Jen","Mary","Brown","","F",0)]</br>

columns = ["first_name","middle_name","last_name","dob","gender","salary"]</br>
pysparkDF = spark.createDataFrame(data = data, schema = columns)</br>
pysparkDF.printSchema()</br>
pysparkDF.show(truncate=False)</br>
</p>
<p><strong>toPandas()</strong> results in the collection of all records in the PySpark DataFrame to the driver program and should be done only on a small subset of the data. running on larger dataset’s results in memory error and crashes the application.</p>
<p height="50" width="100%" style="background:black;font-size:20px;color:white">
pandasDF = pysparkDF.toPandas()</br>
print(pandasDF)</br>
</p>
<p>You can <strong>rename</strong> pandas columns by using rename() function.</p>
<p height="50" width="100%" style="background:black;font-size:20px;color:white">
pandasDF = pysparkDF.toPandas()</br>
print(pandasDF)</br>
</p>
<h2>Convert Spark Nested Struct DataFrame to Pandas</h2>
<p height="100" width="100%" style="background:black;font-size:20px;color:white">
# Nested structure elements</br>
from pyspark.sql.types import StructType, StructField, StringType,IntegerType</br>
dataStruct = [(("James","","Smith"),"36636","M","3000"), \</br>
      (("Michael","Rose",""),"40288","M","4000"), \</br>
      (("Robert","","Williams"),"42114","M","4000"), \</br>
      (("Maria","Anne","Jones"),"39192","F","4000"), \</br>
      (("Jen","Mary","Brown"),"","F","-1") \</br>
]</br>
</p>
<p height="150" width="100%" style="background:black;font-size:20px;color:white">
schemaStruct = StructType([</br>
        StructField('name', StructType([</br>
             StructField('firstname', StringType(), True),</br>
             StructField('middlename', StringType(), True),</br>
             StructField('lastname', StringType(), True)</br>
             ])),</br>
          StructField('dob', StringType(), True),</br>
         StructField('gender', StringType(), True),</br>
         StructField('salary', StringType(), True)</br>
         ])</br>
df = spark.createDataFrame(data=dataStruct, schema = schemaStruct)</br>
df.printSchema()</br>
pandasDF2 = df.toPandas()</br>
print(pandasDF2)</br>
</p>