# Transformation

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/01/20 02:35:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Use `map()` on pyspark dataframe.rdd

*Should avoid rdd as much as possible*

[rdd.map func](https://sparkbyexamples.com/pyspark/pyspark-map-transformation/)

DataFrame doesn’t have map() transformation to use with DataFrame hence you need to DataFrame to RDD first:

```python
# rdd 
sparkDataframe.rdd
```

> PySpark `map()` is a RDD transformation that is used to apply the transformation function (lambda) on every element of RDD/DataFrame and returns a new RDD. 

It is used to apply any complex operations like adding a column, updating a column, transforming the data etc.

**Note:** You have to list out all the columns in expression

In [12]:
from pyspark.sql.types import *
schema = StructType([StructField("name", StringType(), True),
                    StructField("age", IntegerType(), True)])
textFile = spark.read.csv("sample.csv", header=True, schema=schema)
textFile.show()
textFile.printSchema()

+-----+---+
| name|age|
+-----+---+
| john| 33|
| jack| 26|
|derry| 28|
| mary| 64|
+-----+---+

root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)



In [34]:
def funcX(rdd):
    return (rdd.name + "!", rdd.age + 1)

# use function
lambdaCase1 = textFile.rdd.map(lambda x : funcX(x)).collect()

# or all in expression
lambdaCase2 = textFile.rdd.map(lambda x : (x['name']+"!", x['age'] + 1)).collect()
lambdaCase1 == lambdaCase2

True

### After `map()` - get list or pyspark dataframe

To list by `collect()`

In [27]:
rs = textFile.rdd.map(lambda x : funcX(x)).collect()
rs

[('john', 34), ('jack', 27), ('derry', 29), ('mary', 65)]

Back to dataframe by `toDF(schema) | toDF([columns])`

In [33]:
rs = textFile.rdd.map(lambda x : (x.name + "!", x.age + 1)).toDF(schema)
rs.printSchema()
type(rs)

root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)



pyspark.sql.dataframe.DataFrame