#PySpark Row using on DataFrame and RDD

**In PySpark Row class is available by importing pyspark.sql.Row which is represented as a record/row in DataFrame, one can create a Row object by using named arguments, or create a custom Row like class.**

---


***how to use Row class on RDD, DataFrame and its functions.***

####Key Points of Row Class:


- Earlier to Spark 3.0, when used Row class with named arguments, the fields are sorted by name.
- Since 3.0, Rows created from named arguments are not sorted alphabetically instead they will be ordered in the position entered.
- To enable sorting by names, set the environment variable PYSPARK_ROW_FIELD_SORTING_ENABLED to true.
- Row class provides a way to create a struct-type column as well.

##1. Create a Row Object

---

**Row class extends the tuple hence it takes variable number of arguments, Row() is used to create the row object. Once the row object created, we can retrieve the data from Row using index similar to tuple.**

In [0]:
from pyspark.sql import Row

In [0]:
row = Row("James", 40)
print(row[0]+','+str(row[1]))

James,40


**This outputs James,40. Alternatively you can also write with named arguments. Benefits with the named argument is you can access with field name row.name. Below example print “Alice”.**

In [0]:
row = Row(name='Alice', age=11)
print(row.name)

Alice


##2. Create Custom Class from Row

---

**We can also create a Row like class, for example “Person” and use it similar to Row object. This would be helpful when you wanted to create real time object and refer it’s properties. On below example, we have created a Person class and used similar to Row.**

In [0]:
Person = Row("name", "age")

p1 = Person("James", 40)
p2 = Person("Alice", 35)

print(p1.name+','+p2.name)

James,Alice


#3. Using Row class on PySpark RDD

---

**We can use Row class on PySpark RDD. When you use Row to create an RDD, after collecting the data you will get the result back in Row.**

In [0]:
from pyspark.sql import Row

data = [
    Row(name='James,,Smith', lang=['Java','Scala','C++'], state='CA'),
    Row(name="Michael, Rose,", lang=['Spark', 'Java', 'C++'], state='NJ'),
    Row(name="Robert,,Williams", lang=['CSharp',"VB"], state='NV')
       ]
rdd = sc.parallelize(data)
print(rdd.collect())

[Row(name='James,,Smith', lang=['Java', 'Scala', 'C++'], state='CA'), Row(name='Michael, Rose,', lang=['Spark', 'Java', 'C++'], state='NJ'), Row(name='Robert,,Williams', lang=['CSharp', 'VB'], state='NV')]


***Now, let’s collect the data and access the data using its properties.***

In [0]:
collData = rdd.collect()

for row in collData:
    print(row.name + ',' + str(row.lang))

James,,Smith,['Java', 'Scala', 'C++']
Michael, Rose,,['Spark', 'Java', 'C++']
Robert,,Williams,['CSharp', 'VB']


***Alternatively, you can also do by creating a Row like class “Person”***

In [0]:
Person = Row('name', 'lang', 'state')

data_v2 = [
    Person('James,,Smith',['java','Scala','C++'],'CA'),
    Person('Michael,Rose,',['Spark','Java','C++'],"NJ"),
    Person('Robert,,Williams',['CSharp','VB'],'NV')
]

rdd_v2 = sc.parallelize(data_v2)
rdd_v2.collect()

Out[6]: [Row(name='James,,Smith', lang=['java', 'Scala', 'C++'], state='CA'),
 Row(name='Michael,Rose,', lang=['Spark', 'Java', 'C++'], state='NJ'),
 Row(name='Robert,,Williams', lang=['CSharp', 'VB'], state='NV')]

In [0]:
collData_v2 = rdd_v2.collect()
for row in collData_v2:
    print(row.name, row.lang)

James,,Smith ['java', 'Scala', 'C++']
Michael,Rose, ['Spark', 'Java', 'C++']
Robert,,Williams ['CSharp', 'VB']


#4. Using Row class on PySpark DataFrame

---

**Similarly, Row class also can be used with PySpark DataFrame, By default data in DataFrame represent as Row. To demonstrate, I will use the same data that was created for RDD.**

**Note that Row on DataFrame is not allowed to omit a named argument to represent that the value is None or missing. This should be explicitly set to None in this case.**

In [0]:
df = spark.createDataFrame(data=data)
df.printSchema()
df.show()

#This yields below output. Note that DataFrame able to take the column names from Row object.

root
 |-- name: string (nullable = true)
 |-- lang: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- state: string (nullable = true)

+----------------+------------------+-----+
|            name|              lang|state|
+----------------+------------------+-----+
|    James,,Smith|[java, Scala, C++]|   CA|
|   Michael,Rose,|[Spark, Java, C++]|   NJ|
|Robert,,Williams|      [CSharp, VB]|   NV|
+----------------+------------------+-----+



***You can also change the column names by using toDF() function***

In [0]:
columns = ['name', 'languageAtSchool', 'currentState']

df= spark.createDataFrame(data=data).toDF(*columns)
df.printSchema()

#This yields below output, note the column name “languagesAtSchool” from the previous example.

root
 |-- name: string (nullable = true)
 |-- languageAtSchool: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- currentState: string (nullable = true)



#5. Create Nested Struct Using Row Class

---

**The below example provides a way to create a struct type using the Row class. Alternatively, you can also create struct type using By Providing Schema using PySpark StructType & StructFields**

In [0]:
#Create Dataframe with struct using Row class
from pyspark.sql import Row

data = [
    Row(name='James', prop=Row(hair='black', eye='blue')),
    Row(name='Ann', prop=Row(hair='grey', eye='balck'))
]

df = spark.createDataFrame(data=data)
df.printSchema()
df.show()

root
 |-- name: string (nullable = true)
 |-- prop: struct (nullable = true)
 |    |-- hair: string (nullable = true)
 |    |-- eye: string (nullable = true)

+-----+-------------+
| name|         prop|
+-----+-------------+
|James|{black, blue}|
|  Ann|{grey, balck}|
+-----+-------------+

