Skip to content

Commit

Permalink
Initial commit.
Browse files Browse the repository at this point in the history
  • Loading branch information
bmc committed Feb 27, 2016
0 parents commit 1e0de6d
Show file tree
Hide file tree
Showing 14 changed files with 738 additions and 0 deletions.
17 changes: 17 additions & 0 deletions .editorconfig
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# EditorConfig helps developers define and maintain consistent
# coding styles between different editors and IDEs
# editorconfig.org

root = true


[*]
end_of_line = lf
charset = utf-8
trim_trailing_whitespace = true
insert_final_newline = true
indent_style = space
indent_size = 2

[*.{diff,md}]
trim_trailing_whitespace = false
10 changes: 10 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
/RUNNING_PID
/logs/
/project/*-shim.sbt
/project/project/
/project/target/
/target/
/.idea*
/*.iml
/metastore_db
/derby.log
36 changes: 36 additions & 0 deletions LICENSE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
License
=======

This software is released under BSD license, adapted from
<http://opensource.org/licenses/BSD-3-Clause>

---

Copyright &copy; 2016, Brian M. Clapper.
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

* Neither the names "clapper.org" nor the names of any contributors may
be used to endorse or promote products derived from this software
without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.
120 changes: 120 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# Sample Hive UDF project

## Introduction

This project is just an example, containing several
[Hive User Defined Functions][] (UDFs), for use in Apache Spark. It's
intended to demonstrate how to build a Hive UDF in Scala and use it within
[Apache Spark][].

## Why use a Hive UDF?

One especially good use of Hive UDFs is with Python and DataFrames.
Native Spark UDFs written in Python are slow, because they have to be
executed in a Python process, rather than a JVM-based Spark Executor.
For a Spark Executor to run a Python UDF, it must:

* send data from the partition over to a Python process associated with
the Executor, and
* wait for the Python process to deserialize the data, run the UDF on it,
reserialize the data, and send it back.

By contrast, a Hive UDF, whether written in Scala or Java, can be executed
in the Executor JVM, _even if the DataFrame logic is in Python_.

There's really only one drawback: a Hive UDF _must_ be invoked via SQL.
You can't call it as a function from the DataFrame API.

## Building

This project builds with [SBT][], but you don't have to download SBT. Just use
the `activator` script in the root directory. To build the jar file, use
this command:

```
$ ./activator package
```

That command will download the dependencies (if they haven't already been
downloaded), compile the code, run the unit tests, and create a jar file
in `target/scala-2.10`.

### Building Maven

Honestly, I'm not a big fan of Maven; I prefer SBT or Gradle. But, if you
prefer Maven (or are simply required to use it for your project), you _can_
build this project with Maven. I've included a `pom.xml`. Just run:

```
$ mvn package
```

to build `target/hiveudf-0.0.1.jar`. Be sure to change the jar paths,
below, if you use Maven to build the jar.

## Running in Spark

The following Python code demonstrates the UDFs in this package and assumes
that you've packaged the code into `target/scala-2.10/hiveudf_2.10-0.0.1.jar`.
These commands assume Spark local mode, but they should also work fine within
a cluster manager like Spark Standalone or YARN.

First, fire up PySpark:

```
$ pyspark --jars target/scala-2.10/hiveudf_2.10-0.0.1.jar
```

At the PySpark prompt, enter the following. (The prompts are displayed.
Don't type them, obviously.)

```
In [1]: from datetime import datetime
In [2]: from collections import namedtuple
In [3]: Person = namedtuple('Person', ('first_name', 'last_name', 'birth_date', 'salary'))
In [4]: fmt = "%Y-%m-%d"
In [5]: people = [
...: Person('Joe', 'Smith', datetime.strptime("1993-10-20", fmt), 70000l),
...: Person('Jenny', 'Harmon', datetime.strptime("1987-08-02", fmt), 94000l)
...: ]
In [6]: df = sc.parallelize(people).toDF()
In [7]: sqlContext.sql("CREATE TEMPORARY FUNCTION to_hex AS 'com.ardentex.spark.hiveudf.ToHex'")
In [8]: sqlContext.sql("CREATE TEMPORARY FUNCTION datestring AS 'com.ardentex.spark.hiveudf.FormatTimestamp'")
In [9]: df.registerTempTable("people")
In [10]: df2 = sqlContext.sql("SELECT first_name, last_name, datestring(birth_date, 'MMMM dd, yyyy') as birth_date2, to_hex(salary) as hex_salary FROM people")
```

Then, take a look at the second DataFrame:

```
In [11]: df2.show()
+----------+---------+----------------+----------+
|first_name|last_name| birth_date2|hex_salary|
+----------+---------+----------------+----------+
| Joe| Smith|October 20, 1993| 0x11170|
| Jenny| Harmon| August 02, 1987| 0x16f30|
+----------+---------+----------------+----------+
```

## Converting to Java

The Scala code for the UDFs is relatively straightforward and easy to
read. If you prefer to write your UDFs in Java, it shouldn't be that
difficult for you write Java versions.

## Maven



[Hive User Defined Functions]: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
[Apache Spark]: http://spark.apache.org
[SBT]: http://scala-sbt.org
Loading

0 comments on commit 1e0de6d

Please sign in to comment.