
https://www.tensorflow.org/versions/r0.11/tutorials/mnist/beginners/index.html


$$\text{softmax}(x)_i = \frac{\exp(x_i)}{\sum_j \exp(x_j)}$$


<tr>
<td> <img src="mnist-img/softmax-weights.png" alt="Softmax Weights" style="width: 250px;"/> </td>
<td> <img src="mnist-img/softmax-regression-scalargraph.png" alt="Softmax Regression" style="width: 400px;"/> </td>
</tr>




Tensorflow allows us to specify all of our operations as a graph that can be computed entirely outside of Python. This makes Tensorflow very efficient where the most common bottleneck is the transfer of data between languages.

In [3]:
import tensorflow as tf

In [6]:
# x is a placeholder for a 2D array of [any length], [784]
x = tf.placeholder(tf.float32, [None, 784])

In [7]:
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))

In [8]:
y = tf.nn.softmax(tf.matmul(x, W) + b)

[x] [Understanding cross-entropy](http://colah.github.io/posts/2015-09-Visual-Information/)
> What is Information Theory? A language for describing information? Could be highly interesting.
> Review up to http://colah.github.io/posts/2015-09-Visual-Information/#cross-entropy

<tr>
<td> <img src="./cross-entropy-img/Hxy-3D.png" alt="Entropy of what I wear and the weather" style="width: 250px;"/> </td>
</tr>

In [9]:
# Input correct answers
y_ = tf.placeholder(tf.float32, [None, 10])

Here is our cross entropy function,
$$-\sum y'\log(y)$$

In [11]:
cross_entropy = \
    tf.reduce_mean(
        -tf.reduce_sum(y_ * tf.log(y),
                       reduction_indices=[1]))


Because TensorFlow knows the entire graph of your computations, it can automatically use the
[ ] [Backpropagation algorithm](http://colah.github.io/posts/2015-08-Backprop/) to efficiently determine how your variables affect the loss you ask it to minimize.


Then it can apply your choice of optimization algorithm to modify the variables and reduce the loss

In [12]:
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)


Minimize `cross_entropy` using the
[ ] [Gradient Descent Algorithm](https://en.wikipedia.org/wiki/Gradient_descent) with a learning rate of `.5`. Gradient descent shifts each variable a little bit in the direction that reduces the cost.

Tensorflow provides [many other optimization algorithms](https://www.tensorflow.org/versions/r0.11/api_docs/python/train.html#optimizers) which can each be implemented as easily as changing one line in the code.


In [13]:
init = tf.initialize_all_variables()

Now we can launch the model in a `Session`, and now we run the operation that initializes the variables:

In [14]:
sess = tf.Session()
sess.run(init)

Now we can train. Running the training step 1000 times!

In [15]:
for i in range(1000):
    batch_xs, batch_ys = mnist.train.next_batch(100)
    sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})

Running `train_step` we feed in the batches of data to replace the `placeholders`.

In [19]:
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))

In [21]:
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

In [22]:
print(sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels}))

0.9202


[Next portion of tutorial](https://www.tensorflow.org/versions/r0.11/tutorials/mnist/pros/index.html)