Skip to content

TensorFrames user guide

Timothy Hunter edited this page Mar 16, 2016 · 24 revisions

TensorFrames user guide

TensorFrames (TensorFlow on Spark Dataframes) lets you manipulate Spark's DataFrames with TensorFlow programs.

This package is highly experimental and is provided as a technical preview only.

Officially supported Spark versions: 1.6+

This user guide helps you run some simple examples with the python and scala interface.

Core API

The core API provides some primitives to express transformation of DataFrames using TensorFlow programs. These programs can be written in python using the official TensorFlow API, in Scala using TensorFrames, or directly by passing a protocol buffer description of the operations graph.

The most simple way of using TensorFrames in a python program is to import TensorFlow and TensorFrames into PySpark:

import tensorflow as tf
import tensorframes as tfs

Additionally, this guide will make use of the following imports from PySpark:

from pyspark.sql import Row
from pyspark.sql.functions import *

Basic concepts

At its core, TensorFlow expresses operations on tensors: homogeneous data structures that consist in an array and a shape. They can be interpreted as a generalization of vectors and matrices:

  • a scalar (real or integer) is a tensor of dimension zero,
  • a vector is a tensor of one dimension,
  • a matrix is a tensor of two dimensions and so on. Tensors of an order higher than four are usually less common and they are not well-supported right now in TensorFrames. For more information about tensors, the TensorFlow tutorial contains extensive documentation.

Blocks In Spark SQL, tensors are represented as single values, as arrays or as arrays of arrays (or arrays of arrays of arrays, etc.). Consider the following table with some reals x and some vectors y.

x y
1 [1.1 1.2]
2 [2.1 2.2]
3 [3.1 3.2]

Spark SQL may decide to chunk this table into several pieces to distribute them across a cluster. For example, it may split the previous table into a first block:

x y
1 [1.1 1.2]

and a second block:

x y
2 [2.1 2.2]
3 [3.1 3.2]

TensorFlow is optimized for operations that manipulate large vectors of numbers at a time, and TensorFrames provides most operations in two forms: a row-based version and a block-based version. In the row based version, TensorFrames will work on one row after the other. In our example, it will process the table as follows:

> process_row: x = 1, y = [1.1 1.2]
> process_row: x = 2, y = [2.1 2.2]
> process_row: x = 3, y = [3.1 3.2]

In the block-based version, TensorFrames will consider each column in a block as a single tensor. In our example, a block transform will process the table as follows:

process_block: x = [1], y = [1.1 1.2] process_block: x = [2 3], y = [[2.1 2.2] [3.1 3.2]]

The block transforms are usually more efficient: there is less overhead in calling TensorFlow, and they can manipulate more data at once.

Mapping

Reducing

Aggregation

Using a different version of TensorFlow

Clone this wiki locally