Skip to content
ciechowoj edited this page Feb 14, 2016 · 11 revisions

Current result 85.35%.

The good news and the bad news.

31 Jan 2016

I tested two more architectures and achieved better error rate, that is 14.50% one with maxout and one without.

nn = compose(
	conv2D(3, 128, 3),		# 31 x 31 
	relu(), 
	conv2D(128, 128, 3),	# 30 x 30
	bnorm2D(128, 0.1),		
	relu(), 
	max_pool_2d(2),			# 15 x 15
	conv2D(128, 128, 3),	# 14 x 14
	bnorm2D(128, 0.1),		
	relu(), 
	max_pool_2d(2),
	conv2D(128, 128, 5),
	flatten(),
	xaffine(512, 512),
	bnorm(512, 0.1),
	relu(),
	xaffine(512, 512),
	bnorm(512, 0.1),
	relu(),
	xaffine(512, 10),
	relu(),
	softmax()
	)

nn = compose(
	conv2D(3, 128, 3),		# 31 x 31 
	relu(), 
	conv2D(128, 128, 3),	# 30 x 30
	bnorm2D(128, 0.1),		
	relu(), 
	max_pool_2d(2),			# 15 x 15
	conv2D(128, 128, 3),	# 14 x 14
	bnorm2D(128, 0.1),		
	relu(), 
	max_pool_2d(2),			# 7 x 7
	conv2D(128, 128, 5),    # 5 x 5
	flatten(),
	xaffine(512, 512),
	bnorm(512, 0.1),
	relu(),
	maxout(512, 512, 4),
	xaffine(512, 10),
	relu(),
	softmax()
	)

The second layer (with maxtout) was slightly better (tenth of percent), but it was much more computationally expensive to train, if it goes about the clock times. However it was learning two times faster, if we measure time in epochs. So, at the end they were more or less equal.

As I have more time to play with that networks, I'll try a one with the maxout layer that have more smaller sublayers instead of 4 big ones. The networks are not overfitting so badly as at the beginning, so maybe I sgoul make them slighly more powerful. But adding convolutional layers ins't way to go. Above the current state the learning process start to crawl, and I want to see some result in reasonable time.

ReLu.

30 Jan 2016

I ran three tests over night and here are the results. The test 1 and test 2 are comparison whether the ReLu unit does something when placed before the max_pool layer.

# Test #1
nn = compose(
	conv2D(3, 128, 3),
	bnorm2D(128, 0.1), 
	max_pool_2d(2),
	conv2D(128, 128, 3),
	bnorm2D(128, 0.1),
	max_pool_2d(2),
	conv2D(128, 128, 3),
	bnorm2D(128, 0.1),
	max_pool_2d(2),
	flatten(),
	dropout(),
	xaffine(512, 512),
	bnorm(512, 0.1),
	relu(),
	xaffine(512, 512),
	bnorm(512, 0.1),
	relu(),
	xaffine(512, 10),
	relu(),
	softmax()
	)

# Test #2
nn = compose(
	conv2D(3, 128, 3),
	bnorm2D(128, 0.1), 
	relu(), 
	max_pool_2d(2),
	conv2D(128, 128, 3),
	bnorm2D(128, 0.1),
	relu(), 
	max_pool_2d(2),
	conv2D(128, 128, 3),
	bnorm2D(128, 0.1),
	relu(), 
	max_pool_2d(2),
	flatten(),
	dropout(),
	xaffine(512, 512),
	bnorm(512, 0.1),
	relu(),
	xaffine(512, 512),
	bnorm(512, 0.1),
	relu(),
	xaffine(512, 10),
	relu(),
	softmax()
	)

Test #1 reached test error rate 25.62% and test #2 reached 22.50%. So, clearly it is beneficial to have a additional layer of nonlinearity before max pooling (actually it works the same way if were taking additional 0 pixel beside the four pixels for max pooling). The networks have a dropout layer in them, but it doesn't seem to anything good. Maybe it doesn't work with batch normalization layer very good?

Another network I trained it my so far my the best one, but with batch normalization layers added after convolutional layers. So far its score is 18.50% but I'll try to train it with less training rate a little more, usually it gives additional one or two percent.

Overfitting.

29 Jan 2016

I'm training following network:

nn = compose(
	conv2D(3, 64, 5), 
	relu(), 
	max_pool_2d(2),
	conv2D(64, 128, 5), 
	relu(), 
	max_pool_2d(2),
	flatten(),
	xaffine(3200, 625),
	relu(),
	xaffine(625, 10),
	relu(),
	softmax()
	)

And the results of training are:

|----------------------|-------------------|-------------------|--------------------|--------------------|
|     epoch//batch     | validation error  | avg. train error  |   avg. train nll   |  avg. train loss   |
|----------------------|-------------------|-------------------|--------------------|--------------------|
|    1 / 3 / 1600      |             50.40 |             62.56 |            1.78533 |            7.10849 |
|----------------------|-------------------|-------------------|--------------------|--------------------|
|    2 / 5 / 3200      |             43.31 |             44.01 |            1.24278 |            6.14392 |
|----------------------|-------------------|-------------------|--------------------|--------------------|
.                      .                   .                   .                    .                    .
|----------------------|-------------------|-------------------|--------------------|--------------------|
|   19 / 35 / 30400    |             27.57 |              2.27 |           0.108749 |            1.53602 |
|----------------------|-------------------|-------------------|--------------------|--------------------|
.                      .                   .                   .                    .                    .
|----------------------|-------------------|-------------------|--------------------|--------------------|
|   36 / 55 / 57600    |             29.00 |              0.64 |          0.0603659 |           0.623329 |
|----------------------|-------------------|-------------------|--------------------|--------------------|
|   37 / 55 / 59200    |             29.25 |              0.76 |           0.062785 |           0.604012 |
|----------------------|-------------------|-------------------|--------------------|--------------------|
|   38 / 55 / 60800    |             26.88 |              0.91 |           0.067559 |           0.589266 |
|----------------------|-------------------|-------------------|--------------------|--------------------|

The network is clearly overfitting. I'll try to generate some more data and retrain the network again.

Update #1

I'll try with feeding the network with all 90 degrees rotations of the input images and... it does't work. The results were the same as before, the network have been learning only 4 time slower. Then I tried to rotate every picture by small degree (about +-25), and it had some effect - the network isn't overfitting badly, but I'm still unable to beat the 25% threshold. Here are results from training:

|----------------------|-------------------|-------------------|--------------------|--------------------|
|     epoch//batch     | validation error  | avg. train error  |   avg. train nll   |  avg. train loss   |
|----------------------|-------------------|-------------------|--------------------|--------------------|
|    1 / 3 / 1600      |             57.73 |             67.73 |            1.89809 |            7.22001 |
|----------------------|-------------------|-------------------|--------------------|--------------------|
.                      .                   .                   .                    .                    .
|----------------------|-------------------|-------------------|--------------------|--------------------|
|   14 / 27 / 22400    |             33.50 |             29.36 |           0.851831 |            2.74235 |
|----------------------|-------------------|-------------------|--------------------|--------------------|
|   15 / 27 / 24000    |             33.91 |             28.47 |            0.82959 |              2.586 |
|----------------------|-------------------|-------------------|--------------------|--------------------|
|   16 / 33 / 25600    |             31.80 |             27.74 |           0.803085 |            2.43744 |
|----------------------|-------------------|-------------------|--------------------|--------------------|
.                      .                   .                   .                    .                    .
|----------------------|-------------------|-------------------|--------------------|--------------------|
|   27 / 55 / 43200    |             28.51 |             20.91 |           0.618744 |            1.44841 |
|----------------------|-------------------|-------------------|--------------------|--------------------|
.                      .                   .                   .                    .                    .
|----------------------|-------------------|-------------------|--------------------|--------------------|
| 142 / 285 / 227200   |             25.67 |              3.86 |           0.162509 |            0.55653 |
|----------------------|-------------------|-------------------|--------------------|--------------------|
.                      .                   .                   .                    .                    .
|----------------------|-------------------|-------------------|--------------------|--------------------|
| 178 / 285 / 284800   |             26.89 |              3.11 |           0.144652 |           0.524722 |
|----------------------|-------------------|-------------------|--------------------|--------------------|
.                      .                   .                   .                    .                    .
|----------------------|-------------------|-------------------|--------------------|--------------------|
| 204 / 285 / 326400   |             26.44 |              2.83 |           0.132378 |           0.504439 |
|----------------------|-------------------|-------------------|--------------------|--------------------|

Maybe the model is wrong next I'll try with something deeper.

Update #2

Indeed changing the network to deeper model improved its performance.

nn = compose(
	conv2D(3, 128, 3), 
	relu(), 
	max_pool_2d(2),
	conv2D(128, 128, 3), 
	relu(), 
	max_pool_2d(2),
	conv2D(128, 128, 3), 
	relu(), 
	max_pool_2d(2),
	flatten(),
	xaffine(512, 625),
	relu(),
	xaffine(625, 10),
	relu(),
	softmax()
	)

|----------------------|-------------------|-------------------|--------------------|--------------------|
|     epoch//batch     | validation error  | avg. train error  |   avg. train nll   |  avg. train loss   |
|----------------------|-------------------|-------------------|--------------------|--------------------|
|    1 / 3 / 1600      |             62.56 |             69.23 |            1.95699 |            3.46333 |
|----------------------|-------------------|-------------------|--------------------|--------------------|
|    3 / 7 / 4800      |             49.66 |             51.15 |             1.4326 |            2.72909 |
|----------------------|-------------------|-------------------|--------------------|--------------------|
|    4 / 9 / 6400      |             45.82 |             48.16 |            1.35441 |            2.55898 |
|----------------------|-------------------|-------------------|--------------------|--------------------|
|   9 / 19 / 14400     |             38.36 |             39.01 |            1.10923 |            1.96608 |
|----------------------|-------------------|-------------------|--------------------|--------------------|
|   35 / 65 / 56000    |             29.66 |             26.28 |           0.757308 |            1.09091 |
|----------------------|-------------------|-------------------|--------------------|--------------------|
|  68 / 109 / 108800   |             27.10 |             21.16 |           0.617656 |           0.939542 |
|----------------------|-------------------|-------------------|--------------------|--------------------|
|  86 / 171 / 137600   |             26.47 |             19.58 |           0.568533 |           0.898887 |
|----------------------|-------------------|-------------------|--------------------|--------------------|
|  87 / 175 / 139200   |             24.07 |             19.15 |           0.561152 |           0.891842 |
|----------------------|-------------------|-------------------|--------------------|--------------------|
| 115 / 223 / 184000   |             24.11 |             17.51 |           0.516204 |           0.857924 |
|----------------------|-------------------|-------------------|--------------------|--------------------|
| 175 / 327 / 280000   |             23.40 |             15.10 |           0.451168 |           0.807419 |
|----------------------|-------------------|-------------------|--------------------|--------------------|
| 176 / 353 / 281600   |             22.32 |             14.86 |           0.448549 |           0.804895 |
|----------------------|-------------------|-------------------|--------------------|--------------------|

Test error rate is 23.090000%

I'll implemented the batch normalization layer and put it to the upper layers it my network. The first results are promising.

Update #3

Something like this takes too long to compute :(.

nn = compose(
	conv2D(3, 128, 5),   # 30x30
	conv2D(128, 128, 5), # 28x28
	conv2D(128, 128, 5), # 26x26
	conv2D(128, 128, 5), # 24x24
	max_pool_2d(2),	     # 12x12
	conv2D(128, 128, 5), # 10x10
	max_pool_2d(2),		 # 5x5
	flatten(),
	xaffine(512, 512),
	bnorm(512, 0.1),
	relu(),
	xaffine(512, 512),
	bnorm(512, 0.1),
	relu(),
	xaffine(512, 10),
	relu(),
	softmax()
	)

And I have 21.95% with branch normalization : ).

Update #4

nn = compose(
	conv2D(3, 128, 3), 
	relu(), 
	max_pool_2d(2),
	conv2D(128, 128, 3), 
	relu(), 
	max_pool_2d(2),
	conv2D(128, 128, 3), 
	relu(), 
	max_pool_2d(2),
	flatten(),
	xaffine(512, 512),
	bnorm(512, 0.1),
	relu(),
	xaffine(512, 512),
	bnorm(512, 0.1),
	relu(),
	xaffine(512, 10),
	relu(),
	softmax()
	)

With above network I achieved test error of 17.04%, batch normalization speeds the learning twice.

|----------------------|-------------------|-------------------|--------------------|--------------------|
|     epoch//batch     | validation error  | avg. train error  |   avg. train nll   |  avg. train loss   |
|----------------------|-------------------|-------------------|--------------------|--------------------|
|    1 / 3 / 1600      |             52.36 |             59.41 |            1.64795 |            3.64217 |
|----------------------|-------------------|-------------------|--------------------|--------------------|
|    2 / 5 / 3200      |             46.34 |             49.98 |            1.39405 |             3.2366 |
|----------------------|-------------------|-------------------|--------------------|--------------------|
|   15 / 31 / 24000    |             31.64 |             30.38 |            0.86917 |            1.59159 |
|----------------------|-------------------|-------------------|--------------------|--------------------|
|  51 / 101 / 81600    |             25.63 |             20.44 |           0.591895 |           0.878576 |
|----------------------|-------------------|-------------------|--------------------|--------------------|
|  52 / 101 / 83200    |             25.27 |             20.45 |           0.594006 |           0.880513 |
|----------------------|-------------------|-------------------|--------------------|--------------------|
| 120 / 239 / 192000   |             21.14 |             12.89 |           0.375537 |           0.640512 |
|----------------------|-------------------|-------------------|--------------------|--------------------|
| 212 / 425 / 339200   |             16.96 |              4.92 |           0.163319 |           0.405602 |
|----------------------|-------------------|-------------------|--------------------|--------------------|
| 213 / 425 / 340800   |             17.25 |              5.07 |           0.168446 |           0.410606 |
|----------------------|-------------------|-------------------|--------------------|--------------------|
Test error rate is 17.040000%

More processing power.

24 Jan 2016

So far I've been trying to run the training on my notebook GeForce 740M card. The speed wasn't very bad. It was taking about 2 minutes per epoch. So, not very good too. I've expected great boost when I run it on lab's computers (GTX 780, by the way, where is that GTX 980 advertised in lab guide? Because I cannot spot it. Gas someone stolen it?). It took me a while to setup my computations there, but after a few hours of tinkering (and cursing - having no access to apt-get sucks), I had my training running there and... I was disappointed. It doesn't run so much better at all. 2.5 faster? c'mon.

Current test error rate is 25.58%. This one comes from the slightly enlarged starter code network. When it stuck at 26%, I lowered the learning rate and momentum a bit by hand, and it helped to reach the next 1 %. Maybe I should change the learning rate scheduling in some way?

And I did more experiments with maxout layers, but I still don't know how to train them. So, far I've just tried to replace the affine and rectifier layers with the maxout but 40% is the best I can do.

Tinkering with maxout layers

23 Jan 2016

Implemented a linear maxout layer. The best result so far 27.89%. Not good.

Some tidy-out.

22 Jan 2016

Rewritten the code the way it is easier to tinker with now.

Hello world!

21 Jan 2016

Forked starter code from the nn homepage. Made it run on GPU.