# PySpark training for data engineers
## 05. Data Enriching

### Goal

Adding more value to the data by 
* Adding new columns
* Using lambda functions
* Using user defined functions

### Highlights
* `df.withColumn('new_col', Function())` a new column is added to the DataFrame
* `len_fun = udf(lambda z: len(z), IntegerType())` is a User Defined Function that returns the length of the input as integer
* `df = df.withColumn('length_col', len_fun('text_col'))` will add a column `length_col` with the length of the item in `text_col`

### Implementation

In [None]:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
config = SparkConf().setMaster('local')
spark = SparkContext.getOrCreate(conf=config)
sqlContext = SQLContext(spark)

In [None]:
df = sqlContext.read.parquet('notebook-04-parquet/')

In [None]:
df.show()

In [None]:
df = df.withColumn('double_age', df.age*2)

Define a user defined function:

In [None]:
from pyspark.sql.functions import udf

@udf('integer')
def calc_name_length(name):
    return len(name)

In [None]:
df = df.withColumn('name_length', calc_name_length(df.first_name))

Define a lambda function with one input parameter:

In [None]:
from pyspark.sql.types import IntegerType
len_udf_int = udf(lambda z: len(z), IntegerType())

In [None]:
df = df.withColumn('last_name_length', len_udf_int('last_name'))

Define a lambda function with two input parameters:

In [None]:
len_udf_two_int = udf(lambda z,y: len(z)+len(y), IntegerType())

In [None]:
df = df.withColumn('full_name_length', len_udf_two_int('first_name', 'last_name'))

Remove a column from the dataframe: