databricks · dennyglee · Jul 18, 2017
diff --git a/notebooks/TensorFrames- Quick Start.ipynb b/notebooks/TensorFrames- Quick Start.ipynb
@@ -0,0 +1 @@
+{"cells":[{"cell_type":"markdown","source":["## TensorFrames: Quick Start\nThis notebook provides a TensorFrames Quick Start using Databricks Community Edition.  You can run this from the `pyspark` shell like any other Spark package:\n`$SPARK_HOME/bin/pyspark --packages databricks:tensorframes:0.2.8-s_2.11`\n\nFor more information, please refer to the Sources: [Tensorframes](https://github.com/databricks/tensorframes) github repo and the [TensorFrames User Guide](https://github.com/databricks/tensorframes/wiki/TensorFrames-user-guide)"],"metadata":{}},{"cell_type":"markdown","source":["## Cluster set-up\n\nTensorFrames is available as a Spark Package. To use it on your cluster, create a new library with the Source option \"Maven Coordinate\", using \"Search Spark Packages and Maven Central\" to find \"spark-deep-learning\". Then [attach the library to a cluster](https://docs.databricks.com/user-guide/libraries.html). To run this notebook, also create and attach the following libraries: \n* via PyPI: tensorflow\n* via Spark Packages: tensorframes\n\nThe latest version of TensorFrames is compatible with Spark versions 2.0 or higher and works with any instance type (CPU or GPU)."],"metadata":{}},{"cell_type":"code","source":[""],"metadata":{},"outputs":[],"execution_count":3},{"cell_type":"markdown","source":["## TensorFlow Quick Start\nBefore we get into TensorFrames, let's review the concept of *tensors*, *operations*, and *data flow graph*.\n\nTensorFlow performs numerical computation using data flow graphs. When thinking about graph, the node (or vertices) of this graph represent mathematical operations while the graph edges represent the multidimensional arrays (that is, tensors) that communicate between the different nodes (that is, mathematical operations). Referring to the following diagram, t1 is a 2x3 matrix while t2 is a 3x2 matrix; these are the tensors (or edges of the tensor graph). The node is the mathematical operations represented as op1.\n\n![](https://github.com/dennyglee/databricks/blob/master/images/TF-matmul_500w.png?raw=true)\n\n*Source:* [Learning PySpark](https://learningpyspark.com)\n\nIn this example, op1 is a matrix multiplication operation represented by the following diagram, though this could be any of the many mathematics operations available in TensorFlow.  \n\nTogether, to perform your numerical computations within the graph, there is a flow of multidimensional arrays (that is, tensors) between the mathematical operations (nodes) - that is, the flow of tensors, or *TensorFlow*."],"metadata":{}},{"cell_type":"markdown","source":["### Matrix Multiplication using Placeholders\n\nThe next few steps will involve running a *TensorFlow* data flow graph involving matrix multiplication.\n\n![](https://github.com/dennyglee/databricks/blob/master/images/TF-matrix-multiplication_500w.png?raw=true)\n\n*Source:* [Learning PySpark](https://learningpyspark.com)"],"metadata":{}},{"cell_type":"markdown","source":["#### Creating Placeholders\nCreate placeholders to define our tensors (in this case t1 and t2) as well as the operation (op1)"],"metadata":{}},{"cell_type":"code","source":["# Import TensorFlow\nimport tensorflow as tf\n\n# Setup placeholder for your model\n#   t1: placeholder tensor\n#   t2: placeholder tensor\nt1 = tf.placeholder(tf.float32)\nt2 = tf.placeholder(tf.float32)\n\n# t3: matrix multiplication (m1 x m3)\ntp = tf.matmul(t1, t2)\n"],"metadata":{},"outputs":[],"execution_count":7},{"cell_type":"markdown","source":["#### Running the model\nWithin the context of a TensorFlow graph, recall that the nodes in the graph are called operations (or *ops*). The following matrix multiplication is the *ops*, while the two matrices (m1, n2) are the tensors (typed multi-dimensional array). An op takes zero or more tensors as its input, performs the operation such as a mathematical calculation with the output being zero or more tensors in the form of numpy ndarray objects (http://www.numpy.org/) or tensor flow::Tensor interfaces in C, C++."],"metadata":{}},{"cell_type":"code","source":["# Define input matrices\nm1 = [[3., 2., 1.]]\nm2 = [[-1.], [2.], [1.]]\n\n# Execute the graph within a session\nwith tf.Session() as s:\n  print(s.run([tp], feed_dict={t1:m1, t2:m2}))"],"metadata":{},"outputs":[],"execution_count":9},{"cell_type":"markdown","source":["#### Why Placeholders?\nThe reason we use placeholder is so we can execute the same operation but using different inputs.  \n\n![](https://github.com/dennyglee/databricks/blob/master/images/TF-matrix-multiplication-2_500w.png?raw=true)\n\n*Source:* [Learning PySpark](https://learningpyspark.com)\n\nFor example, let's re-run this using m1 (4 x 1) and m2 (1 x 4) matrices."],"metadata":{}},{"cell_type":"code","source":["# setup input matrices\nm1 = [[3., 2., 1., 0.]]\nm2 = [[-5.], [-4.], [-3.], [-2.]]\n# Execute the graph within a session\nwith tf.Session() as s:\n  print(s.run([tp], feed_dict={t1:m1, t2:m2}))"],"metadata":{},"outputs":[],"execution_count":11},{"cell_type":"code","source":[""],"metadata":{},"outputs":[],"execution_count":12},{"cell_type":"markdown","source":["## TensorFlow, Spark, TensorFrames...oh my!\n\nWith TensorFrames, one can manipulate Spark DataFrames with TensorFlow programs. Referring to the tensor diagrams in the previous section, we have updated the figure to show how Spark DataFrames work with TensorFlow, as shown in the following diagram.\n\n![](https://github.com/dennyglee/databricks/blob/master/images/TF-TensorFrames-Diagram_500w.png?raw=true)\n\n*Source:* [Learning PySpark](https://learningpyspark.com)\n\nTensorFrames provides a bridge between Spark DataFrames and TensorFlow. This allows you to take your DataFrames and apply them as input into your TensorFlow computation graph. TensorFrames also allows you to take the TensorFlow computation graph output and push it back into DataFrames so you can continue your downstream Spark processing.\nIn terms of common usage scenarios for TensorFrames, these typically including:\n* Utilize TensorFlow with your data\n* Parallel training to determine optimal hyperparameters"],"metadata":{}},{"cell_type":"markdown","source":["This is a simple TensorFrames program that where the `op` is to perform a simple addition.  The original source code can be found at the [databricks/tensorframes](https://github.com/databricks/tensorframes) GitHub repo. This is in reference to the TensorFrames Readme.md > [How to Run in Python](https://github.com/databricks/tensorframes#how-to-run-in-python) section.\n\n\n### Use Tensorflow to add 3 to an existing column\nThe first thing we will do is import TensorFlow, TensorFrames, and pyspark.sql.row and create a dataframe based on an RDD of floats."],"metadata":{}},{"cell_type":"code","source":["# Import TensorFlow, TensorFrames, and Row\n#import tensorflow as tf  # already imported above\nimport tensorframes as tfs\nfrom pyspark.sql import Row\n\n# Create RDD of floats and convert into DataFrame `df`\nrdd = [Row(x=float(x)) for x in range(10)]\ndf = sqlContext.createDataFrame(rdd)"],"metadata":{},"outputs":[],"execution_count":15},{"cell_type":"markdown","source":["View the `df` DataFrame generated by the RDD of floats"],"metadata":{}},{"cell_type":"code","source":["df.show()"],"metadata":{},"outputs":[],"execution_count":17},{"cell_type":"markdown","source":["#### Execute the Tensor Graph\nAs noted above, this Tensor graph consists of adding 3 to the tensor created by the `df` DataFrame generated by the RDD of floats.\n* `x` utilizes `tfs.block` where `block` builds a block placeholder based on the content of a column in a dataframe.\n* `z` is a the output tensor from the tensorflow add method (`tf.add`) \n* `df2` is the new DataFrame which adds extra columns to the `df` DataFrame with the `z` tensor block by block"],"metadata":{}},{"cell_type":"code","source":["# Run TensorFlow program executes:\n#   The `op` performs the addition (i.e. `x` + `3`)\n#   Place the data back into a DataFrame\nwith tf.Graph().as_default() as g:\n    # The TensorFlow placeholder that corresponds to column 'x'.\n    # The shape of the placeholder is automatically inferred from the DataFrame.\n    x = tfs.block(df, \"x\")\n    \n    # The output that adds y to x\n    z = tf.add(x, 3, name='z')\n    \n    # The resulting dataframe\n    # `map_blocks` transforms a DataFrame into another DataFrame block by block\n    df2 = tfs.map_blocks(z, df)\n\n# Note that `z` is the tensor output from the `tf.add` operation\nprint z"],"metadata":{},"outputs":[],"execution_count":19},{"cell_type":"markdown","source":["#### Review the output dataframe\nWith the tensor added as a column `z` to the `df` dataframe; you now have the `df2` dataframe that allows you to continue working with your data as a Spark DataFrame."],"metadata":{}},{"cell_type":"code","source":["df2.show()"],"metadata":{},"outputs":[],"execution_count":21},{"cell_type":"code","source":[""],"metadata":{},"outputs":[],"execution_count":22},{"cell_type":"markdown","source":["### Block-wise reducing operations example\nIn this next section, we will show how to work with block-wise reducing operations.  Specifically, we will compute the `sum` and `min` of a field  vectors, working with blocks of rows for more efficient processing.\n\n\n\n#### Building a DataFrame of vectors\nFirst, we will create an one-colummn DataFrame of vectors"],"metadata":{}},{"cell_type":"code","source":["# Build a DataFrame of vectors\ndata = [Row(y=[float(y), float(-y)]) for y in range(10)]\ndf = sqlContext.createDataFrame(data)\ndf.show()"],"metadata":{},"outputs":[],"execution_count":24},{"cell_type":"markdown","source":["### Analyze the DataFrame \nWe need to analyze the DataFrame to determine what is its shape (i.e., dimensions of the vectors).  For example, below, we use the `tfs.print_schema` commmand for the `df` DataFrame."],"metadata":{}},{"cell_type":"code","source":["# Print the information gathered by TensorFlow to check the content of the DataFrame\ntfs.print_schema(df)"],"metadata":{},"outputs":[],"execution_count":26},{"cell_type":"markdown","source":["Notice the `double[?,?]` meaning that TensorFlow does not know the dimensions of the vectors."],"metadata":{}},{"cell_type":"code","source":["# Because the dataframe contains vectors, we need to analyze it first to find the\n# dimensions of the vectors.\ndf2 = tfs.analyze(df)\n\n# The information gathered by TF can be printed to check the content:\ntfs.print_schema(df2)"],"metadata":{},"outputs":[],"execution_count":28},{"cell_type":"markdown","source":["#### Analyze This\nUpon analysis via `df2` DataFrame, TensorFlow has inferred that `y` contains vectors of size 2.  For small tensors (scalars and vectors), TensorFrames usually infers the shapes of the tensors without requiring a preliminary analysis. If it cannot do it, an error message will indicate that you need to run the DataFrame through `tfs.analyze()` first."],"metadata":{}},{"cell_type":"markdown","source":["### Compute Elementwise Sum and Min of all vectors\nNow, let's use the analyzed dataframe to compute the sum and the element wise minimum of all the vectors using `tf.reduce_sum` and `tf.reduce_min` - *fully reduced to one element*. \n* [`tf.reduce_sum`](https://www.tensorflow.org/api_docs/python/math_ops/reduction#reduce_sum): compute the sum of elements across dimensions of a tensor, e.g. if `x = [[3, 2, 1], [-1, 2, 1]]` then `tf.reduce_sum(x) ==> 8`.\n* [`tf.reduce_min`](https://www.tensorflow.org/api_docs/python/math_ops/reduction#reduce_min): compute the minimum of elements across dimensions of a tensor, e.g. if `x = [[3, 2, 1], [-1, 2, 1]]` then `tf.reduce_min(x) ==> -1`.\n\n![](https://github.com/dennyglee/databricks/blob/master/images/Element%20Wise%20Diagrams.png?raw=true)"],"metadata":{}},{"cell_type":"code","source":["# Note: First, let's make a copy of the 'y' column. This will be very cheap in Spark 2.0+\ndf3 = df2.select(df2.y, df2.y.alias(\"z\"))\n\n# Execute the Tensor Graph\nwith tf.Graph().as_default() as g:\n    # The placeholders. Note the special name that end with '_input':\n    y_input = tfs.block(df3, 'y', tf_name=\"y_input\")\n    z_input = tfs.block(df3, 'z', tf_name=\"z_input\")\n    \n    # Perform elementwise sum and minimum \n    y = tf.reduce_sum(y_input, [0], name='y')\n    z = tf.reduce_min(z_input, [0], name='z')\n    \n    # The resulting dataframe\n    (data_sum, data_min) = tfs.reduce_blocks([y, z], df3)"],"metadata":{},"outputs":[],"execution_count":31},{"cell_type":"code","source":["# The final results are numpy arrays:\nprint \"Elementwise sum: %s and minimum: %s \" % (data_sum, data_min)"],"metadata":{},"outputs":[],"execution_count":32},{"cell_type":"markdown","source":["#### Notes:\n* The scoping of the graphs above is important because TensorFrames finds which DataFrame column to feed to TensorFrames based on the placeholders of the graph. \n* It is good practice to keep small graphs when sending them to Spark."],"metadata":{}},{"cell_type":"code","source":["# Element wise sum (via tf.reduce_sum)\nx = [[3, 2, 1], [-1, 2, 1]]\nt1 = tf.placeholder(tf.float32)\ntp = tf.reduce_sum(t1)\n\n# Execute the graph within a session\nwith tf.Session() as s:\n  print(s.run([tp], feed_dict={t1:x}))"],"metadata":{},"outputs":[],"execution_count":34},{"cell_type":"code","source":["# Element wise minimum (via tf.reduce_min)\nx = [[3, 2, 1], [-1, 2, 1]]\nt1 = tf.placeholder(tf.float32)\n#tp = tf.reduce_min(t1, 1)\ntp = tf.reduce_min(t1)\n\n# Execute the graph within a session\nwith tf.Session() as s:\n  print(s.run([tp], feed_dict={t1:x}))"],"metadata":{},"outputs":[],"execution_count":35}],"metadata":{"name":"TensorFrames: Quick Start","notebookId":4051666211830674},"nbformat":4,"nbformat_minor":0}