# Part VI: User Defined Functions

This notebook looks at our client data *(please note this is not real data)* and shows how complex logic can be created in Python, and then applied to a dataframe as a udf.

In this scenario we are looking at data quality. Our sales team has noticed that when client emails have uncommon top-level domains, they are unable to concact them quickly. Emails ending in .com or .org have faster response times based on the research conducted by the sales team analysts. 

To help with this, we will flag any rows of data that do not end with .com or .org as invalid.

#### Load Data

We already have our mock client data in DBFS from earlier. Now we just need to load it into our notebook.

In [0]:
# File location and type
file_location = "/FileStore/tables/MOCK_CLIENT_DATA-1.csv"
file_type = "csv"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","

# Create Dataframe.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

df.select(['id','email','account_balance','portfolio_theme']).show(10, False)

+---+------------------------+---------------+---------------+
|id |email                   |account_balance|portfolio_theme|
+---+------------------------+---------------+---------------+
|1  |bruse0@lulu.com         |$558243.81     |core           |
|2  |meffemy1@paypal.com     |$3739.01       |smart-beta     |
|3  |ciannello2@tinypic.com  |$347550.67     |smart-beta     |
|4  |lbonnette3@dropbox.com  |$250191.19     |core           |
|5  |mspellicy4@freewebs.com |$705098.95     |sustainable    |
|6  |cbauduin5@ustream.tv    |$414601.73     |smart-beta     |
|7  |rwolverson6@cnbc.com    |$183900.44     |sustainable    |
|8  |pgueste7@dagondesign.com|$824124.75     |core           |
|9  |pdrescher8@forbes.com   |$347858.05     |core           |
|10 |bhousbie9@posterous.com |$26098.72      |core           |
+---+------------------------+---------------+---------------+
only showing top 10 rows



Let's create a user-defined function (UDF) that will look at the email column and only consider it a valid email if it ends in ".com" or ".org". 

Notice that this python function is tagged with '@udf'

In [0]:
#create a udf that slices off the text after the 'dot' character
@udf
def checkEmail(email):
    validDomains = ["com","org"]
    arr = email.split(".")
    arrEnd = arr[-1]
    if arrEnd not in validDomains:
        return "Invalid domain"
    else:
        return "Valid domain"


In [0]:
# create new column that applies our udf to each row of the email column
df.withColumn("email_check", checkEmail("email")) \
    .select(['id','email','account_balance','portfolio_theme','email_check']) \
    .show(10, truncate=False)

+---+------------------------+---------------+---------------+--------------+
|id |email                   |account_balance|portfolio_theme|email_check   |
+---+------------------------+---------------+---------------+--------------+
|1  |bruse0@lulu.com         |$558243.81     |core           |Valid domain  |
|2  |meffemy1@paypal.com     |$3739.01       |smart-beta     |Valid domain  |
|3  |ciannello2@tinypic.com  |$347550.67     |smart-beta     |Valid domain  |
|4  |lbonnette3@dropbox.com  |$250191.19     |core           |Valid domain  |
|5  |mspellicy4@freewebs.com |$705098.95     |sustainable    |Valid domain  |
|6  |cbauduin5@ustream.tv    |$414601.73     |smart-beta     |Invalid domain|
|7  |rwolverson6@cnbc.com    |$183900.44     |sustainable    |Valid domain  |
|8  |pgueste7@dagondesign.com|$824124.75     |core           |Valid domain  |
|9  |pdrescher8@forbes.com   |$347858.05     |core           |Valid domain  |
|10 |bhousbie9@posterous.com |$26098.72      |core           |Va

Great! We can see in the sample that our udf checked each email and found the .tv entry and flagged it as invalid.

While this is a very simple sample and there are many scenarios where the logic implemented here would not hold up, you should now have a better understanding of how to create and apply udf's to your data. 

Thanks for reading!