# [Tensorflow](https://www.tensorflow.org/) basics

In this tutorial we are going to classify images from the notMNIST dataset . The goal is to automatically detect the letter based on the image in the dataset.

In [35]:
import tensorflow as tf

In [5]:
# Create TensorFlow object called hello_constant
hello_constant = tf.constant('Hello World!')

with tf.Session() as sess:
    # Run the tf.constant operation in the session
    output = sess.run(hello_constant)
    print(output)

b'Hello World!'


## Tensor

In TensorFlow data is not stored as strings, floats or strings. These values are encapsulated in an object called a [tensors](https://en.wikipedia.org/wiki/Tensor). In the case of `hello_constant = tf.constant('Hello World!')`, `hello_constant` is a 0-dimensional string tensor, but tensors come in a variety of sizes as shown below:

In [6]:
# A is a 0-dimensional int32 tensor
A = tf.constant(1234)

# B is a 1-dimensional in32 tensor
B = tf.constant([123, 456, 789])

# C is a 2-dimensional int 32 tensor
C = tf.constant([ [123, 456, 789,], [222,333,444] ])

The tensor returned by `tf.constant()` is called a constant tensor, because the value of the tensor never changes.

## Session

TensorFlow’s api is built around the idea of a computational graph. The previous TensorFlow code can be turned into a graph:

![TensorFlow_Session](./figures/session.png)

A "TensorFlow Session", as shown above, is an environment for running a graph. The session is in charge of allocating the operations to GPU(s) and/or CPU(s), including remote machines.

In [7]:
with tf.Session() as sess:
    # Run the tf.constant operation in the session
    output = sess.run(hello_constant)

The code has already created the tensor, `hello_constant`, from the previous lines. The next step is to evaluate the tensor in a session.

The code creates a session instance, `sess`, using `tf.Session`. The `sess.run()` function then evaluates the tensor and returns the results.

## Input

If we want to use a non-constant we use [`tf.placeholder()`](https://www.tensorflow.org/api_docs/python/tf/placeholder) and `feed_dict`. Next we go over the basics of feeding data into TensorFlow.

### tf.placeholder()

We use [`tf.placeholder()`](https://www.tensorflow.org/api_docs/python/tf/placeholder) to use it as placeholder for arbitrary data input. Thus allowing TensorFlow to take in different datasets with different parameters.

[`tf.placeholder()`](https://www.tensorflow.org/api_docs/python/tf/placeholder) returns a tensor that gets its value from data passed to the tf.session.run() function, allowing you to set the input right before the session runs.

### Session's feed_dict

In [8]:
x = tf.placeholder(tf.string)

with tf.Session() as sess:
    output = sess.run(x, feed_dict={x: 'Hello World'})

Use the `feed_dict` parameter in [`tf.session.run()`](https://www.tensorflow.org/api_docs/python/tf/Session#run) to set the placeholder tensor. The above example shows the tensor `x` being set to the string `"Hello, world"`. It's also possible to set more than one tensor using `feed_dict` as shown below.

In [9]:
x = tf.placeholder(tf.string)
y = tf.placeholder(tf.int32)
z = tf.placeholder(tf.float32)

with tf.Session() as sess:
    output = sess.run(x, feed_dict={x: 'Test String', y: 123, z: 45.67})

**Note:** If the data passed to the `feed_dict` doesn’t match the tensor type and can’t be cast into the tensor type, we get the error `“ValueError: invalid literal for`...”.

## TensorFlow Math

After getting the input we are going to use it by applying basic math functions - add, subtract, multiply, and divide - with tensors. (There's many more math functions, see in the [documentation](https://www.tensorflow.org/api_docs/python/math_ops/).)

### Addition

In [10]:
x = tf.add(5, 2)  # 7

The [`tf.add()`](https://www.tensorflow.org/api_guides/python/math_ops) function does exactly what you expect it to do. It takes in two numbers, two tensors, or one of each, and returns their sum as a tensor.

### Subtraction and Multiplication

Here’s an example with subtraction and multiplication.

In [11]:
x = tf.subtract(10, 4) # 6
y = tf.multiply(2, 5)  # 10

The x tensor will evaluate to `6`, because `10 - 4 = 6`. The `y` tensor will evaluate to `10`, because `2 * 5 = 10`.

### Converting types

It may be necessary to convert between types to make certain operators work together. For example, if we'd try the following, it would fail with an exception:

In [12]:
#tf.subtract(tf.constant(2.0),tf.constant(1))
# Fails with ValueError: Tensor conversion requested dtype float32 
# for Tensor with dtype int32:

That's because the constant `1` is an integer but the constant `2.0` is a floating point value and subtract expects them to match.

In cases like these, you can either make sure our data is all of the same type, or we can cast a value to another type. In this case, converting the `2.0` to an integer before subtracting, like so, will give the correct result:

In [13]:
tf.subtract(tf.cast(tf.constant(2.0), tf.int32), tf.constant(1))   # 1

<tf.Tensor 'Sub_1:0' shape=() dtype=int32>

In [14]:
x = tf.constant(10)
y = tf.constant(2)
z = tf.subtract(tf.cast(tf.divide(x,y), tf.int32),tf.constant(1))

# Note:TensorFlow has multiple ways to divide.
#   tf.divide(x,y) uses Python 3 division semantics and will return a float here
#          It would be the best choice if all the other values had been floats
#   tf.div(x,y) uses Python 2 division semantics and will return an integer here
#          TensorFlow documentation suggests we should prefer tf.divide
#   tf.floordiv(x,y) will do floating point division and then round down to the nearest
#          integer (but the documentation says it may still represent
#          its result as a floating point value)
#   tf.cast(tf.divide(x,y), tf.int32)
#          This lets us do floating point division and then cast it to an integer
#          to match the 1 passed to subtract


# TODO: Print z from a session
with tf.Session() as sess:
    output = sess.run(z)
    print(output)

4


### Recap

We did the following:
- Ran operations in [`tf.Session`](https://www.tensorflow.org/api_docs/python/tf/Session).
- Created a constant tensor with [`tf.constant()`](https://www.tensorflow.org/api_docs/python/tf/constant).
- Used [`tf.placeholder()`](https://www.tensorflow.org/api_docs/python/tf/placeholder) and `feed_dict` to get input.
- Applied the [`tf.add()`](https://www.tensorflow.org/api_docs/python/tf/add), [`tf.subtract()`](https://www.tensorflow.org/api_docs/python/tf/subtract), [`tf.multiply()`](https://www.tensorflow.org/api_docs/python/tf/multiply), and [`tf.divide()`](https://www.tensorflow.org/api_docs/python/tf/divide) functions using numeric data.
- Learned about casting between types with [`tf.cast()`](https://www.tensorflow.org/api_docs/python/tf/cast)

These are the basics of TensorFlow. Next we learn about one of the most popular applications of neural networks - classification.

## Supervised Classification

https://en.wikipedia.org/wiki/Statistical_classification

## Training a Logistic Classifier

A logistic classifier takes an input e.g. the pixels of an image $X$ and applies a linear funciton to them to generate its predictions.

$$
WX + b = y
$$

$W$ are the weights and $b$ is the bias term. Output vector $y$ reflects the class of the input. This should be a probability vector where we want the probability of the correct class to be very close to one and the probability to every other class to be close to zero.
The way to turn scores in to probabilities is to use a softmax function:

$$
S(y_i) = \frac{e^{y_i}}{\sum_j e^{y_j}}
$$

This function denoted by $S$ can turn any score into proper probabilities.
Proper probabilities sum to one and they will be larger when the scores are large and small when the scores are comparatively smaller. Scores in the terms of logistic regression, are also often called logits.

By training our network we are going to try to find the values for the weights and bias which are good at performing correct predictions.



## Linear functions in TensorFlow

The most common operation in neural networks is calculating the linear combination of inputs, weights, and biases. As a reminder, we can write the output of the linear operation as

$$
y = xW + b
$$

Here, $W$ is a matrix of the weights connecting two layers. The output $y$, the input $x$, and the biases $b$ are all vectors.

### Weights and Bias in TensorFlow

The goal of training a neural network is to modify weights and biases to best predict the labels. In order to use weights and bias, we'll need a Tensor that can be modified. This leaves out [`tf.placeholder()`](https://www.tensorflow.org/api_docs/python/tf/placeholder) and [`tf.constant()`](https://www.tensorflow.org/api_docs/python/tf/constant), since those Tensors can't be modified. This is where [`tf.Variable`](https://www.tensorflow.org/api_docs/python/tf/Variable) class comes in.

#### tf.Variable()

In [15]:
x = tf.Variable(5)

The [`tf.Variable`](https://www.tensorflow.org/api_docs/python/tf/Variable) class creates a tensor with an initial value that can be modified, much like a normal Python variable. This tensor stores its state in the session, so we must initialize the state of the tensor manually. We'll use the [`tf.global_variables_initializer()`](https://www.tensorflow.org/api_docs/python/tf/global_variables_initializer) function to initialize the state of all the Variable tensors.

##### Initialization

In [16]:
init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)

The [`tf.global_variables_initializer()`](https://www.tensorflow.org/api_docs/python/tf/global_variables_initializer) call returns an operation that will initialize all TensorFlow variables from the graph. We call the operation using a session to initialize all the variables as shown above. Using the [`tf.Variable`](https://www.tensorflow.org/api_docs/python/tf/Variable) class allows us to change the weights and bias, but an initial value needs to be chosen.

Initializing the weights with random numbers from a normal distribution is good practice. Randomizing the weights helps the model from becoming stuck in the same place every time we train it. We'll see more about this in the next section, when we study gradient descent.

Similarly, choosing weights from a normal distribution prevents any one weight from overwhelming other weights. We'll use the [`tf.truncated_normal()`](https://www.tensorflow.org/api_docs/python/tf/truncated_normal) function to generate random numbers from a normal distribution.

#### tf.truncated_normal()

In [17]:
n_features = 120
n_labels = 5
weights = tf.Variable(tf.truncated_normal((n_features, n_labels)))

The [`tf.truncated_normal()`](https://www.tensorflow.org/api_docs/python/tf/truncated_normal) function returns a tensor with random values from a normal distribution whose magnitude is no more than 2 standard deviations from the mean.

Since the weights are already helping prevent the model from getting stuck, you don't need to randomize the bias. Let's use the simplest solution, setting the bias to 0.

#### tf.truncated_normal()

In [18]:
n_labels = 5
bias = tf.Variable(tf.zeros(n_labels))

The [`tf.zeros()`](https://www.tensorflow.org/api_docs/python/tf/zeros) function returns a tensor with all zeros.

### Linear Classifier Example

![mnist](./figures/mnist-012.png)
A subset of the MNIST dataset

We'll be classifying the handwritten numbers `0`, `1`, and `2` from the MNIST dataset using TensorFlow. The above is a small sample of the data we'll be training on. Notice how some of the `1`s are written with a [serif](https://en.wikipedia.org/wiki/Serif) at the top and at different angles. The similarities and differences will play a part in shaping the weights of the model.

![weights](./figures/weights-0-1-2.png)
Left: Weights for labeling 0. Middle: Weights for labeling 1. Right: Weights for labeling 2.

The images above are trained weights for each label (`0`, `1`, and `2`). The weights display the unique properties of each digit they have found. In the following we will train our own weights using the MNIST dataset.

In [20]:
def get_weights(n_features, n_labels):
    """
    Return TensorFlow weights
    :param n_features: Number of features
    :param n_labels: Number of labels
    :return: TensorFlow weights
    """
    # Return weights
    return tf.Variable(tf.truncated_normal((n_features, n_labels)))


def get_biases(n_labels):
    """
    Return TensorFlow bias
    :param n_labels: Number of labels
    :return: TensorFlow bias
    """
    # Return biases
    return tf.Variable(tf.zeros(n_labels))


def linear(input, w, b):
    """
    Return linear function in TensorFlow
    :param input: TensorFlow input
    :param w: TensorFlow weights
    :param b: TensorFlow biases
    :return: TensorFlow linear function
    """
    # Linear Function (xW + b)
    return tf.add(tf.matmul(input, w), b)

In [22]:
from tensorflow.examples.tutorials.mnist import input_data


def mnist_features_labels(n_labels):
    """
    Gets the first <n> labels from the MNIST dataset
    :param n_labels: Number of labels to use
    :return: Tuple of feature list and label list
    """
    mnist_features = []
    mnist_labels = []

    mnist = input_data.read_data_sets('./mnist', one_hot=True)

    # In order to make this run faster, we're only looking at 10000 images
    for mnist_feature, mnist_label in zip(*mnist.train.next_batch(10000)):

        # Add features and labels if it's for the first <n>th labels
        if mnist_label[:n_labels].any():
            mnist_features.append(mnist_feature)
            mnist_labels.append(mnist_label[:n_labels])

    return mnist_features, mnist_labels


# Number of features (28*28 image is 784 features)
n_features = 784
# Number of labels
n_labels = 3

# Features and Labels
features = tf.placeholder(tf.float32)
labels = tf.placeholder(tf.float32)

# Weights and Biases
w = get_weights(n_features, n_labels)
b = get_biases(n_labels)

# Linear Function xW + b
logits = linear(features, w, b)

# Training data
train_features, train_labels = mnist_features_labels(n_labels)

# get init operation that is used to initialize all TensorFlow variables
init = tf.global_variables_initializer()

with tf.Session() as session:
    # initialize all TensorFlow variables
    session.run(init)

    # Softmax
    prediction = tf.nn.softmax(logits)

    # Cross entropy
    # This quantifies how far off the predictions were.
    # You'll learn more about this in future lessons.
    cross_entropy = -tf.reduce_sum(labels * tf.log(prediction), reduction_indices=1)

    # Training loss
    # You'll learn more about this in future lessons.
    loss = tf.reduce_mean(cross_entropy)

    # Rate at which the weights are changed
    # You'll learn more about this in future lessons.
    learning_rate = 0.08

    # Gradient Descent
    # This is the method used to train the model
    # You'll learn more about this in future lessons.
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)

    # Run optimizer and get loss
    _, l = session.run(
        [optimizer, loss],
        feed_dict={features: train_features, labels: train_labels})

# Print loss
print('Loss: {}'.format(l))

Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting ./mnist/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting ./mnist/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting ./mnist/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting ./mnist/t10k-labels-idx1-ubyte.gz
Loss: 6.363794803619385


## TensorFlow Softmax

In the Intro to TFLearn we used the softmax function to calculate class probabilities as output from the network. The softmax function squashes it's inputs, typically called logits or logit scores, to be between `0` and `1` and also normalizes the outputs such that they all sum to `1`. This means the output of the softmax function is equivalent to a categorical probability distribution. It's the perfect function to use as the output activation for a network predicting multiple classes.

![](./figures/softmax-input-output.png)
Example of the softmax function at work.

### TensorFlow Softmax
We're using TensorFlow to build neural networks and, appropriately, there's a function for calculating softmax.

In [23]:
x = tf.nn.softmax([2.0, 1.0, 0.2])

 [`tf.nn.softmax()`](https://www.tensorflow.org/api_docs/python/tf/nn/softmax) implements the softmax function. It takes in logits and returns softmax activations.

In [24]:
def run():
    output = None
    logit_data = [2.0, 1.0, 0.1]
    logits = tf.placeholder(tf.float32)

    softmax = tf.nn.softmax(logits)

    with tf.Session() as sess:
        output = sess.run(softmax, feed_dict={logits: logit_data})

    return output

In [25]:
run()

array([ 0.65900117,  0.24243298,  0.09856589], dtype=float32)

## One-Hot Encoding

We need a way to represent our labels mathematically. We want the probability for the correct class to be close to `1` and the probability for all the other classes close to `0`.
Using [one-hot encoding](https://en.wikipedia.org/wiki/One-hot), each label will be represented by vector, that is as long as there are classes and it has a value `1.0` for the correct class and `0.0` every where else.

### One-Hot Encoding With Scikit-Learn

Transforming labels into one-hot encoded vectors is pretty simple with scikit-learn using LabelBinarizer.

In [27]:
import numpy as np
from sklearn import preprocessing

# Example labels
labels = np.array([1,5,3,2,1,4,2,1,3])

# Create the encoder
lb = preprocessing.LabelBinarizer()

# Here the encoder finds the classes and assigns one-hot vectors 
lb.fit(labels)

# And finally, transform the labels into one-hot encoded vectors
lb.transform(labels)

array([[1, 0, 0, 0, 0],
       [0, 0, 0, 0, 1],
       [0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 0, 0, 1, 0],
       [0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 0, 1, 0, 0]])

## Cross Entropy in TensorFlow

In the Intro to TFLearn lesson we discussed using cross entropy as the cost function for classification with one-hot encoded labels. Again, TensorFlow has a function to do the cross entropy calculations for us.

![cross_entropy](./figures/cross-entropy-diagram.png)
Cross entropy loss function

To create a cross entropy function in TensorFlow, we'll need to use two new functions:

- [`tf.reduce_sum()`](https://www.tensorflow.org/api_docs/python/tf/reduce_sum)
- [`tf.log()`](https://www.tensorflow.org/api_docs/python/tf/log)

### Reduce Sum

In [31]:
x = tf.reduce_sum([1, 2, 3, 4, 5])  # 15

The [`tf.reduce_sum()`](https://www.tensorflow.org/api_docs/python/tf/reduce_sum) function takes an array of numbers and sums them together.

### Natural Log

In [40]:
x = tf.log(100.0)  # 4.60517

This function does exactly what you would expect it to do. [`tf.log()`](https://www.tensorflow.org/api_docs/python/tf/log) takes the natural log of a number.

In [41]:
softmax_data = [0.7, 0.2, 0.1]
one_hot_data = [1.0, 0.0, 0.0]

softmax = tf.placeholder(tf.float32)
one_hot = tf.placeholder(tf.float32)

# Calculate cross entropy
cross_entropy = -tf.reduce_sum(tf.mul(one_hot, tf.log(softmax)))

# Print cross entropy from session
with tf.Session() as session:
    output = session.run(cross_entropy, feed_dict={one_hot: one_hot_data, softmax: softmax_data})
    print(output)

0.356675


## Minimizing Cross Entropy

Having all the pieces of the puzzle, the quesion is how we are going to find the weights $w$ and the biases $b$ that will get our classifier to do what we want it to do. That is, have low distance for the correct class but have a high distance for the incorrect class. One thing we can do is to measure that distance averaged over the entire training set for all the inputs and the labels that we have available. That is called the training loss 

$$
L = \frac{1}{N} \sum_i = D(S(W x_i + b), y_i)
$$

This loss which is the average cross entropy over our entire training set is one humongous function. Every example in our training set gets multiplied by this one big matrix $W$ and then they get all added up in one big sum.
We want all the distances to be small with, which would mean we are doing a good job at classifying every example in the training data. So we want the loss to be small. The loss is a function of the weights and the biases. So we are going to minimize that function.

We can turn the machine learning problem into a numerical optimization problem and use gradient descent to find the minimium loss.

## Transition into Practical Aspects of Learning

In the next section we will see how we can use tensorFlow tools to compute the derivatives and see the pros and cons about gradient descent. But for now we assume that we have the optimizer as a black box that we can simply use. 
There are still two last practical things that stand in the way before we can train our model:

1. How do we fill image pixels to this classifier?
2. Where do we initialize the optimization?

## Numerical Stability

When we do numerical computations, we always have to worry a bit about calculating values that are too large or too small. In particular, adding very small values, to a very large value can introduce a lot of errors.

In [42]:
a = 1000000000
for i in range(1000000):
    a = a + 1e-6
print(a - 1000000000)

0.95367431640625


If we add 1e-6 a million times to the value one million and after that subract again one million we should get 1.0 according to math. But the code result is 0.95 which is a big difference. 

If we replace the one billion with just one we see that the error becomes very tiny.

In [43]:
a = 1
for i in range(1000000):
    a = a + 1e-6
print(a - 1)

0.9999999999177334


### Normalized Inputs and Initial Weights

Because of these numerical issues we want the values involved in the loss function to never get too big or too small.
One good guiding prinicple is that we want our values to always have zero mean and equal variance whenever possible.
On top of the numerical issues there are also a very good mathematical reasons to keep values we compute roughly around a mean of zero and equal variance when we are doing optimization (well conditioned). A badly conditioned problem means that the optimizer has to a lot of searching to go and find a good solution. A well conditioned problem makes it a lot easier for the optimizer to do its job.

![normalization](./figures/normalization.jpeg)

If we are dealing with images it is simple. We can take the pixel values of the images. They are typically between 0 and 255 and simply subtract 128 and divide by 128. This does not change the content of the image but it makes it much easier for the optmization to proceed numerically. 

We also want our weights and biases to be initialized at good enough starting point for the gradient descent to proceed. There are a lot of schemes to fidn good initialization values but we are going to focus on a simple general method. We draw the weights randomly from a gaussian distribution with mean zero and standard deviation sigma. The sigma value determines the order of magnitude of our outputs at the intital point of our optimizaiton. Because of the softmax on top of it, the order of magnitude also determines the peakiniess of our initial probability distribution. A large sigma will mean that our distribuiton have large peaks and going to be very opinionated. A small sigma means the our distribuiton is very uncertain about things. It is usually better to begin with an uncertain distriibution, and let the optimization become more confident as training progresses. So we use small sigma to begin with.

Now that we have every thing we traing our classifier. We got our trainign data which is normalized to have zero mean and unit variance. We multiply it by a large matrix which is initialized with random weights. We apply the softmax, then the cross entropy loss function and finally calculate the average of this loss over the entire training data.

Then our optimization package computes the derivative of this loss with respect to the weights and to the biases and takes a step back into the direction opposite to that derivative. After that we start all over again. We repeat the process until we reach a minimum of the loss function.

## Measuring Performance

Now that we know how to train our model there is another important part:
We have seen that we have a training set as well as a validation set and a test set.
This has to do with measuring with how well we are doing. Measuring performance is subtle.
A classifier should not memorize the training set because it would fail to generalize to new input examples.
This is just not a theoretical problem. Every classifier that we will build will try to memorize the training set. It is our job to help it to generalize to new data instead. Therefor we use a small test data that the clasifier did not see before to test the current performance of the classifier. The problem is that training a classifier is usually trail and error. It is a cycle, we train a classifier, measure its performance and try another classifier and measure again, and again, ... we tweak the model and explore the different parameters and measure and finally we think that we have the perfect classifier.
