<img src="../static/logo.png" alt="datio" style="width: 200px "align="right"/>

## Defining Spark UDFs

Defining our udf is pretty easy, we just create an anonymous function and register it through the SqlContext or through the udf function in org.apache.spark.sql.functions.udf depending on how you want to use it.  
UDF operates on distributed DataFrames and works row by row.  

As a montivating example assume we want to convert a String Column "f-cierre" with date information divided in year, month and day. 

In [None]:
import pyspark
from pyspark.sql.context import SQLContext
from pyspark.sql.types import *
sc = pyspark.SparkContext('local[*]') 
sqlContext = SQLContext(sc)
# We are going to work with a the data Ttgofici
dataPath = "../data/"
customSchema = StructType([
 StructField("cod_bancsb",  StringType(), True),
 StructField("cod_ofici",  IntegerType(), True),
 StructField("cnivel",  StringType(), True),
 StructField("cod_zona",  StringType(), True),
 StructField("cod_territor",  StringType(), True),
 StructField("cod_dirgener",  StringType(), True),
 StructField("cod_areanego",  IntegerType(), True),
 StructField("cod_dar",  StringType(), True),
 StructField("des_nomco",  StringType(), True),
 StructField("des_nomab",  StringType(), True),
 StructField("f_cierre",  StringType(), True),
 StructField("cod_cbc",  StringType(), True)])

ttgoficiDF = sqlContext.read.format("com.databricks.spark.csv")\
            .option("header", "true")\
            .load(dataPath + "ttgofici.csv", schema=customSchema)\

In [None]:
from pyspark.sql.functions import UserDefinedFunction
getDay = UserDefinedFunction(lambda x: x[8:10], StringType())
getMonth = UserDefinedFunction(lambda x: x[5:7], StringType())
getYear = UserDefinedFunction(lambda x: x[0:4], StringType())

In [None]:
ttgoficiDF2 = ttgoficiDF\
.withColumn("dia",getDay("f_cierre"))\
.withColumn("mes",getMonth("f_cierre"))\
.withColumn("anio",getYear("f_cierre"))

In [None]:
ttgoficiDF2.show()

In [None]:
ttgoficiDF2.registerTempTable("TtgoficiDMY")

## Register UDF in sparksql

Another option is to register a function as a UDF so it can be used in SQL statements.  

registerFunction(name, f, returnType=StringType)

### 1. Register an anonymous function

    Syntax  

    The syntax of lambda functions contains only a single statement, which is as follows −  

    lambda [arg1 [,arg2,.....argn]]:expression  

In [None]:
sqlContext.registerFunction("getDay", lambda x: x[8:10], StringType())
sqlContext.registerFunction("getMonth", lambda x: x[5:7], StringType())
sqlContext.registerFunction("getYear", lambda x: x[0:4], StringType())

In [None]:
sqlContext.registerDataFrameAsTable(ttgoficiDF,"ttgoficiDF")

Now we can use our function directly in SparkSQL.

In [None]:
sqlContext.sql("select *, getDay(f_cierre) as dia, getMonth(f_cierre) as mes, getYear(f_cierre) as anio \
from ttgoficiDF")

### 2. Register a function

    Syntax  

    def functionname( parameters ):  
       "function_docstring"  
       function_suite  
       return [expression]  

In [None]:
def includePref(value, pref ) : return pref + value

In [None]:
sqlContext.registerFunction("includePref",includePref)

In [None]:
sqlContext.sql("select *, includePref(cnivel, 'C-') as cnivel from ttgoficiDF").show()

In [None]:
# but not outside
ttgoficiDF.withColumn("cnivel2",udfincludePref(col("cnivel"), lit("C-"))).show()

You can see above that we can use it within SQL but not outside of it.  
To do that we're going to have to create a different UDF using:
spark.sql.function.udf wich returns a UDFRegistration for UDF registration.

In [None]:
from pyspark.sql.functions import udf,lit,col
udfincludePref = udf(includePref, StringType())
#now this works
ttgoficiDF.withColumn("cnivel2",udfincludePref("cnivel", lit("C-"))).show()

## Ejercicio  1: 

Añadir al dataframe ttgoficiDF una nueva columna denominada "area", según el avalor de cod_territor, de tal forma:  
    con_territor >= 8000 -> area = A  
    con_territor >= 6000 -> area = B  
    con_territor >= 4000 -> area = C  
    con_territor <  4000 -> area = D  

In [None]:
def codTerritorToArea(cod_territor):
    territor = int(cod_territor)
    if territor >= 8000: return 'A'
    elif territor >= 6000: return 'B'
    elif territor >= 4000: return 'C'
    else: return 'D'

In [None]:
udfcodTerritorToArea=udf(codTerritorToArea, StringType())
ttgoficiDF.withColumn("area", udfcodTerritorToArea("cod_territor")).show()

Realizar la misma funcionalidad mediante secuencia sql: case...