## TensorFrames: Quick Start
This notebook provides a TensorFrames Quick Start using Databricks Community Edition.  You can run this from the `pyspark` shell like any other Spark package:
`$SPARK_HOME/bin/pyspark --packages databricks:tensorframes:0.2.8-s_2.11`

For more information, please refer to the Sources: [Tensorframes](https://github.com/databricks/tensorframes) github repo and the [TensorFrames User Guide](https://github.com/databricks/tensorframes/wiki/TensorFrames-user-guide)

## Cluster set-up

TensorFrames is available as a Spark Package. To use it on your cluster, create a new library with the Source option "Maven Coordinate", using "Search Spark Packages and Maven Central" to find "spark-deep-learning". Then [attach the library to a cluster](https://docs.databricks.com/user-guide/libraries.html). To run this notebook, also create and attach the following libraries: 
* via PyPI: tensorflow
* via Spark Packages: tensorframes

The latest version of TensorFrames is compatible with Spark versions 2.0 or higher and works with any instance type (CPU or GPU).

## TensorFlow Quick Start
Before we get into TensorFrames, let's review the concept of *tensors*, *operations*, and *data flow graph*.

TensorFlow performs numerical computation using data flow graphs. When thinking about graph, the node (or vertices) of this graph represent mathematical operations while the graph edges represent the multidimensional arrays (that is, tensors) that communicate between the different nodes (that is, mathematical operations). Referring to the following diagram, t1 is a 2x3 matrix while t2 is a 3x2 matrix; these are the tensors (or edges of the tensor graph). The node is the mathematical operations represented as op1.

![](https://github.com/dennyglee/databricks/blob/master/images/TF-matmul_500w.png?raw=true)

*Source:* [Learning PySpark](https://learningpyspark.com)

In this example, op1 is a matrix multiplication operation represented by the following diagram, though this could be any of the many mathematics operations available in TensorFlow.  

Together, to perform your numerical computations within the graph, there is a flow of multidimensional arrays (that is, tensors) between the mathematical operations (nodes) - that is, the flow of tensors, or *TensorFlow*.

### Matrix Multiplication using Placeholders

The next few steps will involve running a *TensorFlow* data flow graph involving matrix multiplication.

![](https://github.com/dennyglee/databricks/blob/master/images/TF-matrix-multiplication_500w.png?raw=true)

*Source:* [Learning PySpark](https://learningpyspark.com)

#### Creating Placeholders
Create placeholders to define our tensors (in this case t1 and t2) as well as the operation (op1)

In [7]:
# Import TensorFlow
import tensorflow as tf

# Setup placeholder for your model
#   t1: placeholder tensor
#   t2: placeholder tensor
t1 = tf.placeholder(tf.float32)
t2 = tf.placeholder(tf.float32)

# t3: matrix multiplication (m1 x m3)
tp = tf.matmul(t1, t2)


#### Running the model
Within the context of a TensorFlow graph, recall that the nodes in the graph are called operations (or *ops*). The following matrix multiplication is the *ops*, while the two matrices (m1, n2) are the tensors (typed multi-dimensional array). An op takes zero or more tensors as its input, performs the operation such as a mathematical calculation with the output being zero or more tensors in the form of numpy ndarray objects (http://www.numpy.org/) or tensor flow::Tensor interfaces in C, C++.

In [9]:
# Define input matrices
m1 = [[3., 2., 1.]]
m2 = [[-1.], [2.], [1.]]

# Execute the graph within a session
with tf.Session() as s:
  print(s.run([tp], feed_dict={t1:m1, t2:m2}))

#### Why Placeholders?
The reason we use placeholder is so we can execute the same operation but using different inputs.  

![](https://github.com/dennyglee/databricks/blob/master/images/TF-matrix-multiplication-2_500w.png?raw=true)

*Source:* [Learning PySpark](https://learningpyspark.com)

For example, let's re-run this using m1 (4 x 1) and m2 (1 x 4) matrices.

In [11]:
# setup input matrices
m1 = [[3., 2., 1., 0.]]
m2 = [[-5.], [-4.], [-3.], [-2.]]
# Execute the graph within a session
with tf.Session() as s:
  print(s.run([tp], feed_dict={t1:m1, t2:m2}))

## TensorFlow, Spark, TensorFrames...oh my!

With TensorFrames, one can manipulate Spark DataFrames with TensorFlow programs. Referring to the tensor diagrams in the previous section, we have updated the figure to show how Spark DataFrames work with TensorFlow, as shown in the following diagram.

![](https://github.com/dennyglee/databricks/blob/master/images/TF-TensorFrames-Diagram_500w.png?raw=true)

*Source:* [Learning PySpark](https://learningpyspark.com)

TensorFrames provides a bridge between Spark DataFrames and TensorFlow. This allows you to take your DataFrames and apply them as input into your TensorFlow computation graph. TensorFrames also allows you to take the TensorFlow computation graph output and push it back into DataFrames so you can continue your downstream Spark processing.
In terms of common usage scenarios for TensorFrames, these typically including:
* Utilize TensorFlow with your data
* Parallel training to determine optimal hyperparameters

This is a simple TensorFrames program that where the `op` is to perform a simple addition.  The original source code can be found at the [databricks/tensorframes](https://github.com/databricks/tensorframes) GitHub repo. This is in reference to the TensorFrames Readme.md > [How to Run in Python](https://github.com/databricks/tensorframes#how-to-run-in-python) section.


### Use Tensorflow to add 3 to an existing column
The first thing we will do is import TensorFlow, TensorFrames, and pyspark.sql.row and create a dataframe based on an RDD of floats.

In [15]:
# Import TensorFlow, TensorFrames, and Row
#import tensorflow as tf  # already imported above
import tensorframes as tfs
from pyspark.sql import Row

# Create RDD of floats and convert into DataFrame `df`
rdd = [Row(x=float(x)) for x in range(10)]
df = sqlContext.createDataFrame(rdd)

View the `df` DataFrame generated by the RDD of floats

In [17]:
df.show()

#### Execute the Tensor Graph
As noted above, this Tensor graph consists of adding 3 to the tensor created by the `df` DataFrame generated by the RDD of floats.
* `x` utilizes `tfs.block` where `block` builds a block placeholder based on the content of a column in a dataframe.
* `z` is a the output tensor from the tensorflow add method (`tf.add`) 
* `df2` is the new DataFrame which adds extra columns to the `df` DataFrame with the `z` tensor block by block

In [19]:
# Run TensorFlow program executes:
#   The `op` performs the addition (i.e. `x` + `3`)
#   Place the data back into a DataFrame
with tf.Graph().as_default() as g:
    # The TensorFlow placeholder that corresponds to column 'x'.
    # The shape of the placeholder is automatically inferred from the DataFrame.
    x = tfs.block(df, "x")
    
    # The output that adds y to x
    z = tf.add(x, 3, name='z')
    
    # The resulting dataframe
    # `map_blocks` transforms a DataFrame into another DataFrame block by block
    df2 = tfs.map_blocks(z, df)

# Note that `z` is the tensor output from the `tf.add` operation
print z

#### Review the output dataframe
With the tensor added as a column `z` to the `df` dataframe; you now have the `df2` dataframe that allows you to continue working with your data as a Spark DataFrame.

In [21]:
df2.show()

### Block-wise reducing operations example
In this next section, we will show how to work with block-wise reducing operations.  Specifically, we will compute the `sum` and `min` of a field  vectors, working with blocks of rows for more efficient processing.



#### Building a DataFrame of vectors
First, we will create an one-colummn DataFrame of vectors

In [24]:
# Build a DataFrame of vectors
data = [Row(y=[float(y), float(-y)]) for y in range(10)]
df = sqlContext.createDataFrame(data)
df.show()

### Analyze the DataFrame 
We need to analyze the DataFrame to determine what is its shape (i.e., dimensions of the vectors).  For example, below, we use the `tfs.print_schema` commmand for the `df` DataFrame.

In [26]:
# Print the information gathered by TensorFlow to check the content of the DataFrame
tfs.print_schema(df)

Notice the `double[?,?]` meaning that TensorFlow does not know the dimensions of the vectors.

In [28]:
# Because the dataframe contains vectors, we need to analyze it first to find the
# dimensions of the vectors.
df2 = tfs.analyze(df)

# The information gathered by TF can be printed to check the content:
tfs.print_schema(df2)

#### Analyze This
Upon analysis via `df2` DataFrame, TensorFlow has inferred that `y` contains vectors of size 2.  For small tensors (scalars and vectors), TensorFrames usually infers the shapes of the tensors without requiring a preliminary analysis. If it cannot do it, an error message will indicate that you need to run the DataFrame through `tfs.analyze()` first.

### Compute Elementwise Sum and Min of all vectors
Now, let's use the analyzed dataframe to compute the sum and the element wise minimum of all the vectors using `tf.reduce_sum` and `tf.reduce_min` - *fully reduced to one element*. 
* [`tf.reduce_sum`](https://www.tensorflow.org/api_docs/python/math_ops/reduction#reduce_sum): compute the sum of elements across dimensions of a tensor, e.g. if `x = [[3, 2, 1], [-1, 2, 1]]` then `tf.reduce_sum(x) ==> 8`.
* [`tf.reduce_min`](https://www.tensorflow.org/api_docs/python/math_ops/reduction#reduce_min): compute the minimum of elements across dimensions of a tensor, e.g. if `x = [[3, 2, 1], [-1, 2, 1]]` then `tf.reduce_min(x) ==> -1`.

![](https://github.com/dennyglee/databricks/blob/master/images/Element%20Wise%20Diagrams.png?raw=true)

In [31]:
# Note: First, let's make a copy of the 'y' column. This will be very cheap in Spark 2.0+
df3 = df2.select(df2.y, df2.y.alias("z"))

# Execute the Tensor Graph
with tf.Graph().as_default() as g:
    # The placeholders. Note the special name that end with '_input':
    y_input = tfs.block(df3, 'y', tf_name="y_input")
    z_input = tfs.block(df3, 'z', tf_name="z_input")
    
    # Perform elementwise sum and minimum 
    y = tf.reduce_sum(y_input, [0], name='y')
    z = tf.reduce_min(z_input, [0], name='z')
    
    # The resulting dataframe
    (data_sum, data_min) = tfs.reduce_blocks([y, z], df3)

In [32]:
# The final results are numpy arrays:
print "Elementwise sum: %s and minimum: %s " % (data_sum, data_min)

#### Notes:
* The scoping of the graphs above is important because TensorFrames finds which DataFrame column to feed to TensorFrames based on the placeholders of the graph. 
* It is good practice to keep small graphs when sending them to Spark.

In [34]:
# Element wise sum (via tf.reduce_sum)
x = [[3, 2, 1], [-1, 2, 1]]
t1 = tf.placeholder(tf.float32)
tp = tf.reduce_sum(t1)

# Execute the graph within a session
with tf.Session() as s:
  print(s.run([tp], feed_dict={t1:x}))

In [35]:
# Element wise minimum (via tf.reduce_min)
x = [[3, 2, 1], [-1, 2, 1]]
t1 = tf.placeholder(tf.float32)
#tp = tf.reduce_min(t1, 1)
tp = tf.reduce_min(t1)

# Execute the graph within a session
with tf.Session() as s:
  print(s.run([tp], feed_dict={t1:x}))