# UDFs in BigQuery

This notebook shows how to use UDFs (user-defined functions) in Google BigQuery. UDFs allow you to operate on columns of a table, performing arbitrary transformations, and returning the result of those transformations as a value. Datalab currently supports temporary UDFS, which live only within the query that uses them.

You can read more about UDFs [here](https://cloud.google.com/bigquery/docs/reference/standard-sql/user-defined-functions)

## Scenario

In this notebook we are going to look at anonymized logs that originated in Google AppEngine. These logs include the paths of requested URIs, which contain a number of query parameters of interest. To help us use the logs, we will create a UDF that extracts the query parameters values, and puts them into a new column of type `ARRAY`.


## Examining the Data

We are going to look at logs from a week in October 2015. These logs were imported from Google AppEngine, with a few relevant fields extracted and anonymized. There is a separate table for each day. Let's look at an example, starting with the schema, and then look at sample rows:

In [1]:
%load_ext google.datalab.kernel

In [2]:
%bq tables describe --name cloud-datalab-samples.appenginelogs.sample_logs_20151027

In [3]:
%bq sample --count 5 --table cloud-datalab-samples.appenginelogs.sample_logs_20151027

timestamp,method,status,latency,path
2015-10-27 07:32:33.178964,POST,404,0.003679,/log/error?project=2&instance=2&user=2&page=master&path=3&version=0.1.1&release=alpha
2015-10-27 06:12:54.465872,POST,404,0.004054,/log/error?project=2&instance=2&user=2&page=master&path=3&version=0.1.1&release=alpha
2015-10-27 05:56:20.662169,POST,404,0.003147,/log/error?project=9&instance=2&user=2&page=master&path=3&version=0.1.1&release=alpha
2015-10-27 17:22:35.503786,POST,404,0.003487,/log/error?project=17&instance=2&user=2&page=master&path=3&version=0.1.1&release=alpha
2015-10-27 02:26:26.651847,POST,404,0.003345,/log/error?project=22&instance=2&user=2&page=master&path=3&version=0.1.1&release=alpha


You can see we have five columns; the 'path' column needs the most processing. Each URI has the form `log/event?params`, where params can be one of project, instance, user, page, path, version, or release. We are going to extract these values into a separate column of type `ARRAY`.

## Creating and Testing the UDF

UDFs are functions written in one of the supported languages (currently SQL and Javascript), that take a column and produces a value, after performing some computation. The BigQuery UDF documentation explains that the `CREATE TEMP FUNCTION` call is needed to define a UDF, including its parameter names and types, return type, and language. Datalab simplifies this syntax; it makes use of jsdoc-style `// @param` comments to achieve the same result. Also, it exposes UDFs as a Python class, and a magic command, to make building queries simpler. Let's see how we can do this:

In [4]:
%%bq query
SELECT * FROM `cloud-datalab-samples.appenginelogs.sample_logs_20151027`
LIMIT 5

timestamp,method,status,latency,path
2015-10-27 07:32:33.178964,POST,404,0.003679,/log/error?project=2&instance=2&user=2&page=master&path=3&version=0.1.1&release=alpha
2015-10-27 06:12:54.465872,POST,404,0.004054,/log/error?project=2&instance=2&user=2&page=master&path=3&version=0.1.1&release=alpha
2015-10-27 05:56:20.662169,POST,404,0.003147,/log/error?project=9&instance=2&user=2&page=master&path=3&version=0.1.1&release=alpha
2015-10-27 17:22:35.503786,POST,404,0.003487,/log/error?project=17&instance=2&user=2&page=master&path=3&version=0.1.1&release=alpha
2015-10-27 02:26:26.651847,POST,404,0.003345,/log/error?project=22&instance=2&user=2&page=master&path=3&version=0.1.1&release=alpha


In [5]:
%%bq udf --name extract_params -l js
// A function to split a set of URL query parameters into an array
// @param path STRING
// @returns ARRAY<STRING>
var re = /[?&]([^=]*)=([^&]*)/g;
var result = [];
var match;
while ((match = re.exec(path)) != null) {
  result.push(decodeURIComponent(match[2]));
}
return result;

Now we can try calling the UDF. We need to define a query to do this. We can call our UDF like any regular function call, taking one or more columns as input.

In [6]:
%%bq query -n extract_params_query --udfs extract_params
SELECT *, extract_params(path) as parameters FROM `cloud-datalab-samples.appenginelogs.sample_logs_20151027`
LIMIT 5

In [7]:
%bq execute -q extract_params_query

timestamp,method,status,latency,path,parameters
2015-10-27 07:32:33.178964,POST,404,0.003679,/log/error?project=2&instance=2&user=2&page=master&path=3&version=0.1.1&release=alpha,"['2', '2', '2', 'master', '3', '0.1.1', 'alpha']"
2015-10-27 06:12:54.465872,POST,404,0.004054,/log/error?project=2&instance=2&user=2&page=master&path=3&version=0.1.1&release=alpha,"['2', '2', '2', 'master', '3', '0.1.1', 'alpha']"
2015-10-27 05:56:20.662169,POST,404,0.003147,/log/error?project=9&instance=2&user=2&page=master&path=3&version=0.1.1&release=alpha,"['9', '2', '2', 'master', '3', '0.1.1', 'alpha']"
2015-10-27 17:22:35.503786,POST,404,0.003487,/log/error?project=17&instance=2&user=2&page=master&path=3&version=0.1.1&release=alpha,"['17', '2', '2', 'master', '3', '0.1.1', 'alpha']"
2015-10-27 02:26:26.651847,POST,404,0.003345,/log/error?project=22&instance=2&user=2&page=master&path=3&version=0.1.1&release=alpha,"['22', '2', '2', 'master', '3', '0.1.1', 'alpha']"


In order to see the actual expanded SQL, including the UDF defined above, we can inspect the query object, by typing its name:

In [8]:
extract_params_query

## Next Steps

You can learn how to test your UDF in the notebook by following the [UDF Testing in the Notebook](./notebooks/datalab/tutorials/BigQuery/UDF%20Testing%20in%20the%20Notebook.ipynb) tutorial. If you have code that you regularly use in your UDFs, you can factor it out and put it in Google Cloud Storage, then import it. This technique is covered in the [UDFs using Code in Cloud Storage](./notebooks/datalab/tutorials/BigQuery/UDFs%20using%20Code%20in%20Cloud%20Storage.ipynb) tutorial.