The cross-entropy is easy to implement as part of a program which learns using gradient descent and backpropagation.
The new program is called network2.py, and incorporates not just the cross-entropy, but also several other techniques developed.
For now, let's look at how well our new program classifies MNIST digits. As was the case in Chapter 1, we'll use a network with 3030 hidden neurons, and we'll use a mini-batch size of 1010. We set the learning rate to η=0.5

In Chapter 1 we used the quadratic cost and a learning rate of η=3.0η=3.0. As discussed above, it's not possible to say precisely what it means to use the "same" learning rate when the cost function is changed. For both cost functions I experimented to find a learning rate that provides near-optimal performance, given the other hyper-parameter choices. 

There is, incidentally, a very rough general heuristic for relating the learning rate for the cross-entropy and the quadratic cost. As we saw earlier, the gradient terms for the quadratic cost have an extra σ′=σ(1−σ)σ′=σ(1−σ) term in them. Suppose we average this over values for σσ, ∫10dσσ(1−σ)=1/6∫01dσσ(1−σ)=1/6. We see that (very roughly) the quadratic cost learns an average of 66 times slower, for the same learning rate. This suggests that a reasonable starting point is to divide the learning rate for the quadratic cost by 66. Of course, this argument is far from rigorous, and shouldn't be taken too seriously. Still, it can sometimes be a useful starting point.

In [2]:
import mnist_loader
training_data, validation_data, test_data = mnist_loader.load_data_wrapper()

In [6]:
import network2
net = network2.Network([784, 30, 10], cost=network2.CrossEntropyCost)
net.large_weight_initializer()
net.SGD(training_data, 30, 10, 0.5, evaluation_data=test_data,monitor_evaluation_accuracy=True)

Epoch 0 training complete
Accuracy on evaluation data: 9142 / 10000

Epoch 1 training complete
Accuracy on evaluation data: 9304 / 10000

Epoch 2 training complete
Accuracy on evaluation data: 9361 / 10000

Epoch 3 training complete
Accuracy on evaluation data: 9383 / 10000

Epoch 4 training complete
Accuracy on evaluation data: 9404 / 10000

Epoch 5 training complete
Accuracy on evaluation data: 9416 / 10000

Epoch 6 training complete
Accuracy on evaluation data: 9419 / 10000

Epoch 7 training complete
Accuracy on evaluation data: 9481 / 10000

Epoch 8 training complete
Accuracy on evaluation data: 9497 / 10000

Epoch 9 training complete
Accuracy on evaluation data: 9487 / 10000

Epoch 10 training complete
Accuracy on evaluation data: 9494 / 10000

Epoch 11 training complete
Accuracy on evaluation data: 9519 / 10000

Epoch 12 training complete
Accuracy on evaluation data: 9514 / 10000

Epoch 13 training complete
Accuracy on evaluation data: 9508 / 10000

Epoch 14 training complete
Acc

([],
 [9142,
  9304,
  9361,
  9383,
  9404,
  9416,
  9419,
  9481,
  9497,
  9487,
  9494,
  9519,
  9514,
  9508,
  9519,
  9521,
  9511,
  9537,
  9515,
  9514,
  9523,
  9532,
  9527,
  9496,
  9524,
  9506,
  9532,
  9528,
  9511,
  9526],
 [],
 [])

Note, by the way, that the net.large_weight_initializer() command is used to initialize the weights and biases in the same way as described in Chapter 1. We need to run this command because later in this chapter we'll change the default weight initialization in our networks. The result from running the above sequence of commands is a network with 95.4995.49 percent accuracy. This is pretty close to the result we obtained in Chapter 1, 95.4295.42 percent, using the quadratic cost.

Let's look also at the case where we use 100100 hidden neurons, the cross-entropy, and otherwise keep the parameters the same. In this case we obtain an accuracy of 96.8296.82 percent. That's a substantial improvement over the results from Chapter 1, where we obtained a classification accuracy of 96.5996.59 percent, using the quadratic cost. That may look like a small change, but consider that the error rate has dropped from 3.413.41 percent to 3.183.18 percent. That is, we've eliminated about one in fourteen of the original errors. That's quite a handy improvement.

It's encouraging that the cross-entropy cost gives us similar or better results than the quadratic cost. However, these results don't conclusively prove that the cross-entropy is a better choice. The reason is that I've put only a little effort into choosing hyper-parameters such as learning rate, mini-batch size, and so on. For the improvement to be really convincing we'd need to do a thorough job optimizing such hyper-parameters. Still, the results are encouraging, and reinforce our earlier theoretical argument that the cross-entropy is a better choice than the quadratic cost.

This, by the way, is part of a general pattern that we'll see through this chapter and, indeed, through much of the rest of the book. We'll develop a new technique, we'll try it out, and we'll get "improved" results. It is, of course, nice that we see such improvements. But the interpretation of such improvements is always problematic. They're only truly convincing if we see an improvement after putting tremendous effort into optimizing all the other hyper-parameters. That's a great deal of work, requiring lots of computing power, and we're not usually going to do such an exhaustive investigation. Instead, we'll proceed on the basis of informal tests like those done above. Still, you should keep in mind that such tests fall short of definitive proof, and remain alert to signs that the arguments are breaking down.

In [7]:
net = network2.Network([784, 30, 10], cost=network2.CrossEntropyCost) 

In [8]:
net.large_weight_initializer()

# model got overfitted

In [9]:
net.SGD(training_data[:1000], 400, 10, 0.5, evaluation_data=test_data,monitor_evaluation_accuracy=True, monitor_training_cost=True)

Epoch 0 training complete
Cost on training data: 1.84139565543
Accuracy on evaluation data: 6202 / 10000

Epoch 1 training complete
Cost on training data: 1.38580417171
Accuracy on evaluation data: 6893 / 10000

Epoch 2 training complete
Cost on training data: 1.15467253384
Accuracy on evaluation data: 7166 / 10000

Epoch 3 training complete
Cost on training data: 0.919018693931
Accuracy on evaluation data: 7723 / 10000

Epoch 4 training complete
Cost on training data: 0.817511801552
Accuracy on evaluation data: 7778 / 10000

Epoch 5 training complete
Cost on training data: 0.67193699903
Accuracy on evaluation data: 7889 / 10000

Epoch 6 training complete
Cost on training data: 0.553130571116
Accuracy on evaluation data: 8023 / 10000

Epoch 7 training complete
Cost on training data: 0.507954768934
Accuracy on evaluation data: 8049 / 10000

Epoch 8 training complete
Cost on training data: 0.437310636767
Accuracy on evaluation data: 8181 / 10000

Epoch 9 training complete
Cost on trainin

([],
 [6202,
  6893,
  7166,
  7723,
  7778,
  7889,
  8023,
  8049,
  8181,
  8105,
  8208,
  8248,
  8266,
  8292,
  8291,
  8330,
  8320,
  8348,
  8364,
  8355,
  8340,
  8317,
  8384,
  8388,
  8368,
  8378,
  8398,
  8378,
  8401,
  8416,
  8428,
  8403,
  8434,
  8421,
  8411,
  8429,
  8440,
  8436,
  8432,
  8434,
  8429,
  8426,
  8430,
  8439,
  8439,
  8439,
  8460,
  8447,
  8444,
  8438,
  8450,
  8436,
  8431,
  8444,
  8441,
  8443,
  8448,
  8449,
  8448,
  8449,
  8448,
  8447,
  8445,
  8443,
  8448,
  8444,
  8446,
  8446,
  8432,
  8452,
  8445,
  8444,
  8442,
  8447,
  8444,
  8454,
  8448,
  8456,
  8453,
  8449,
  8444,
  8449,
  8446,
  8444,
  8461,
  8453,
  8456,
  8447,
  8452,
  8455,
  8445,
  8455,
  8460,
  8453,
  8463,
  8460,
  8462,
  8463,
  8461,
  8468,
  8468,
  8467,
  8471,
  8471,
  8469,
  8470,
  8472,
  8472,
  8477,
  8470,
  8470,
  8475,
  8473,
  8476,
  8474,
  8476,
  8481,
  8481,
  8478,
  8477,
  8477,
  8480,
  8482,
  8479,
  8