#PySpark map() Transformation


---

**PySpark map (map()) is an RDD transformation that is used to apply the transformation function (lambda) on every element of RDD/DataFrame and returns a new RDD. In this article, you will learn the syntax and usage of the RDD map() transformation with an example and how to use it with DataFrame.**

**RDD map() transformation is used to apply any complex operations like adding a column, updating a column, transforming the data e.t.c, the output of map transformations would always have the same number of records as input.**


---


- Note1: DataFrame doesn’t have map() transformation to use with DataFrame hence you need to DataFrame to RDD first.

- Note2: If you have a heavy initialization use PySpark mapPartitions() transformation instead of map(), as with mapPartitions() heavy initialization executes only once for each partition instead of every record.


---


**First, let’s create an RDD from the list.**

In [0]:
data = [
"Project","Gutenberg’s","Alice’s","Adventures",
"in","Wonderland","Project","Gutenberg’s","Adventures",
"in","Wonderland","Project","Gutenberg’s"
]

rdd = sc.parallelize(data)

##map() Syntax

###map(f, preservesPartitioning=False)



---



##PySpark map() Example with RDD

**In this PySpark map() example, we are adding a new element with value 1 for each element, the result of the RDD is PairRDDFunctions which contains key-value pairs, word of type String as Key and 1 of type Int as value.**

In [0]:
rdd2 = rdd.map(lambda x:(x,1))
for element in rdd2.collect():
    print(element)


('Project', 1)
('Gutenberg’s', 1)
('Alice’s', 1)
('Adventures', 1)
('in', 1)
('Wonderland', 1)
('Project', 1)
('Gutenberg’s', 1)
('Adventures', 1)
('in', 1)
('Wonderland', 1)
('Project', 1)
('Gutenberg’s', 1)


##PySpark map() Example with DataFrame


---


**PySpark DataFrame doesn’t have map() transformation to apply the lambda function, when you wanted to apply the custom transformation, you need to convert the DataFrame to RDD and apply the map() transformation. Let’s use another dataset to explain this.**

In [0]:
data = [
    ('James','Smith','M',30),
  ('Anna','Rose','F',41),
  ('Robert','Williams','M',62)
]

columns = ["firstname", "lastname", "gender", "salary"]
df = spark.createDataFrame(data = data, schema=columns)
df.show(truncate=False)

+---------+--------+------+------+
|firstname|lastname|gender|salary|
+---------+--------+------+------+
|James    |Smith   |M     |30    |
|Anna     |Rose    |F     |41    |
|Robert   |Williams|M     |62    |
+---------+--------+------+------+



In [0]:
#Refering columns by index.

rdd2 = df.rdd.map(lambda x: (x[0]+","+x[1], x[2], x[3]*2))

df2 = rdd2.toDF( ["name", "gender", "new_salary"] )

df2.show(truncate=False)

+---------------+------+----------+
|name           |gender|new_salary|
+---------------+------+----------+
|James,Smith    |M     |60        |
|Anna,Rose      |F     |82        |
|Robert,Williams|M     |124       |
+---------------+------+----------+



**Note that aboveI have used index to get the column values, alternatively, you can also refer to the DataFrame column names while iterating.**

In [0]:
# Referring Column Names 

rdd2 = df.rdd.map(lambda x: (x["firstname"]+","+x["lastname"], x["gender"], x["salary"]*2))

df3 = rdd2.toDF( ["name", "gender", "new_salary"])
df3.show()

+---------------+------+----------+
|           name|gender|new_salary|
+---------------+------+----------+
|    James,Smith|     M|        60|
|      Anna,Rose|     F|        82|
|Robert,Williams|     M|       124|
+---------------+------+----------+



In [0]:
# Referring Column Names


rdd2 = df.rdd.map(lambda x: ( x.firstname+","+x.lastname, x.gender, x.salary*2 ))

df4 = rdd2.toDF( ["name", "gender", "new_salary"] )

df4.show()

+---------------+------+----------+
|           name|gender|new_salary|
+---------------+------+----------+
|    James,Smith|     M|        60|
|      Anna,Rose|     F|        82|
|Robert,Williams|     M|       124|
+---------------+------+----------+



**You can also create a custom function to perform an operation. Below func1() function executes for every DataFrame row from the lambda function.**

In [0]:
# By Calling Function

def func1(x):
    firstname = x.firstname
    lastname = x.lastname
    name = firstname + "," + lastname 
    gender = x.gender.lower()
    salary = x.salary*2
    return (name, gender, salary)


rdd2 = df.rdd.map(lambda x: func1(x))

df5 = rdd2.toDF( ["name", "gender", "new_salary"] )

df5.show()

+---------------+------+----------+
|           name|gender|new_salary|
+---------------+------+----------+
|    James,Smith|     m|        60|
|      Anna,Rose|     f|        82|
|Robert,Williams|     m|       124|
+---------------+------+----------+

