Tensorflow wrapper for DataFrames on Apache Spark
Scala Python Protocol Buffer Shell Makefile Batchfile
Latest commit f5bc2c6 Jan 18, 2017 @thunterdb thunterdb committed on GitHub Merge pull request #53 from jaceklaskowski/gitignore-idea
Remove already-gitignored .idea directory
Permalink
Failed to load latest commit information.
bin permissions Mar 17, 2016
build moving files outside private repo Mar 10, 2016
project merge with master Jan 19, 2017
python travis Jul 15, 2016
src Spark 2.0.0 + ScalaTest 3.0.0 + updates sbt plugins Aug 4, 2016
.gitignore cleanup of the build script May 17, 2016
.travis.yml scala 2.10 Jul 16, 2016
LICENSE Initial commit Mar 4, 2016
README.md build icon Aug 11, 2016
build.sbt merge with master Jan 19, 2017

README.md

build

TensorFrames

Experimental TensorFlow binding for Scala and Apache Spark.

TensorFrames (TensorFlow on Spark Dataframes) lets you manipulate Apache Spark's DataFrames with TensorFlow programs.

This package is experimental and is provided as a technical preview only. While the interfaces are all implemented and working, there are still some areas of low performance.

Supported platforms:

This package only officially supports linux 64bit platforms as a target. Contributions are welcome for other platforms.

See the file project/Dependencies.scala for adding your own platform.

Officially supported Spark versions: 1.6 and Scala version 2.10. It is also known to work with Spark 2.0 (pre-released) and Scala 2.11.

See the user guide for extensive information about the API.

TensorFrames is available as a Spark package.

Requirements

  • A working version of Apache Spark (1.6 or greater)

  • java version >= 7

  • (Optional) python >= 2.7 if you want to use the python interface. Python 3+ should work but it has not been tested.

  • (Optional) the python TensorFlow package if you want to use the python interface. See the official instructions on how to get the latest release of TensorFlow.

Additionally, if you want to run unit tests for python, you need the following dependencies:

  • nose >= 1.3

How to run in python

Assuming that SPARK_HOME is set, you can use PySpark like any other Spark package.

$SPARK_HOME/bin/pyspark --packages databricks:tensorframes:0.2.3-s_2.10

Here is a small program that uses Tensorflow to add 3 to an existing column.

import tensorflow as tf
import tensorframes as tfs
from pyspark.sql import Row

data = [Row(x=float(x)) for x in range(10)]
df = sqlContext.createDataFrame(data)
with tf.Graph().as_default() as g:
    # The TensorFlow placeholder that corresponds to column 'x'.
    # The shape of the placeholder is automatically inferred from the DataFrame.
    x = tfs.block(df, "x")
    # The output that adds 3 to x
    z = tf.add(x, 3, name='z')
    # The resulting dataframe
    df2 = tfs.map_blocks(z, df)

# The transform is lazy as for most DataFrame operations. This will trigger it:
df2.collect()

# Notice that z is an extra column next to x

# [Row(z=3.0, x=0.0),
#  Row(z=4.0, x=1.0),
#  Row(z=5.0, x=2.0),
#  Row(z=6.0, x=3.0),
#  Row(z=7.0, x=4.0),
#  Row(z=8.0, x=5.0),
#  Row(z=9.0, x=6.0),
#  Row(z=10.0, x=7.0),
#  Row(z=11.0, x=8.0),
#  Row(z=12.0, x=9.0)]

The second example shows the block-wise reducing operations: we compute the sum of a field containing vectors of integers, working with blocks of rows for more efficient processing.

# Build a DataFrame of vectors
data = [Row(y=[float(y), float(-y)]) for y in range(10)]
df = sqlContext.createDataFrame(data)
# Because the dataframe contains vectors, we need to analyze it first to find the
# dimensions of the vectors.
df2 = tfs.analyze(df)

# The information gathered by TF can be printed to check the content:
tfs.print_schema(df2)
# TF has inferred that y contains vectors of size 2
# root
#  |-- y: array (nullable = false) DoubleType[?,2]

# Let's use the analyzed dataframe to compute the sum and the elementwise minimum 
# of all the vectors:
# First, let's make a copy of the 'y' column. This will be very cheap in Spark 2.0+
df3 = df2.select(df2.y, df2.y.alias("z"))
with tf.Graph().as_default() as g:
    # The placeholders. Note the special name that end with '_input':
    y_input = tfs.block(df3, 'y', tf_name="y_input")
    z_input = tfs.block(df3, 'z', tf_name="z_input")
    y = tf.reduce_sum(y_input, [0], name='y')
    z = tf.reduce_min(z_input, [0], name='z')
    # The resulting dataframe
    (data_sum, data_min) = tfs.reduce_blocks([y, z], df3)

# The final results are numpy arrays:
print data_sum
# [45.0, -45.0]
print data_min
# [0.0, -9.0]

Notes

Note the scoping of the graphs above. This is important because TensorFrames finds which DataFrame column to feed to TensorFrames based on the placeholders of the graph. Also, it is good practice to keep small graphs when sending them to Spark.

For small tensors (scalars and vectors), TensorFrames usually infers the shapes of the tensors without requiring a preliminary analysis. If it cannot do it, an error message will indicate that you need to run the DataFrame through tfs.analyze() first.

Look at the python documentation of the TensorFrames package to see what methods are available.

How to run in Scala

The scala support is a bit more limited than python. In scala, operations can be loaded from an existing graph defined in the ProtocolBuffers format, or using a simple scala DSL. The Scala DSL only features a subset of TensorFlow transforms. It is very easy to extend though, so other transforms will be added without much effort in the future.

You simply use the published package:

$SPARK_HOME/bin/spark-shell --packages databricks:tensorframes:0.2.3

Here is the same program as before:

import org.tensorframes.{dsl => tf}
import org.tensorframes.dsl.Implicits._

val df = sqlContext.createDataFrame(Seq(1.0->1.1, 2.0->2.2)).toDF("a", "b")

// As in Python, scoping is recommended to prevent name collisions.
val df2 = tf.withGraph {
    val a = df.block("a")
    // Unlike python, the scala syntax is more flexible:
    val out = a + 3.0 named "out"
    // The 'mapBlocks' method is added using implicits to dataframes.
    df.mapBlocks(out).select("a", "out")
}

// The transform is all lazy at this point, let's execute it with collect:
df2.collect()
// res0: Array[org.apache.spark.sql.Row] = Array([1.0,1.1,2.1], [2.0,2.2,4.2])   

How to compile and install for developers

The C++ bindings are already compiled though, so you should only have to deal with compiling the scala code. The recommended procedure is to use the assembly:

build/sbt assembly
# Builds the spack package:
build/sbt tfPackage

Assuming that SPARK_HOME is set and that you are in the root directory of the project:

$SPARK_HOME/bin/spark-shell --jars $PWD/target/scala-2.10/tensorframes-assembly-0.2.3.jar

If you want to run the python version:

PYTHONPATH=$PWD/target/scala-2.10/tensorframes-assembly-0.2.3.jar IPYTHON=1 \
$SPARK_HOME/bin/pyspark --jars $PWD/target/scala-2.10/tensorframes-assembly-0.2.3.jar

How to change the version of TensorFlow

By default, TensorFrames features a relatively stable version of TensorFlow that is optimized for build sizes and for CPUs. If you want to change the internal version being used, you should check the tensorframes-artifacts project. That project contains scripts to build the proper jar files.

It is also possible to drop in some precompiled versions of javacpp and tensorflow into the lib directory; they will be picked up in the compilation and the assembly. Ask the developers if you have more questions.

Acknowledgements

This project builds on the great javacpp project, that implements the low-level bindings between TensorFlow and the Java virtual machine.

Many thanks to Google for the release of TensorFlow.