# First Somewhat Successful Scale Up with XTCAV

Attempt to classify lasing vs. nolasing.

## Input

* 80 xtcav images, randomly chosen from amo86815 runs 69 (no lasing) 70 and 71 (lasing), but chosen to be balanced (40 no las and 40 lasing)
* background subtracted (from dark run 68)
* downsampled to (284, 363) from (726, 568)
* log transformed: $ I \rightarrow \log(1 + \max(0,I)) $

input tensor shape=(80, 284, 363, 1) # one channel

8,247,360 pixels, 33 MB

## Network

* Two layers - one convolutional layer, one fully connected layer mapping to two outputs
* softmax classifier

* convolutional kernels: window=(16,16) numchannels=16, strides=(2,2)
* bias_init=0, K_init_stddev=0.01

* conv output shape=(80, 142, 182, 16)   (126 MB)
* relu activation
* avg pool: window=(12,12) strides=(10,10)
* pool output shape=(80, 15, 19, 16)  1.39 MB

cvn layer produces 4560 output units for fully connected output layer, with

* Weights = (4560,2)  
* bias_init=0
* W_init_stddev=0.01

convnet has 13,234 unknown variables, 4112 (31%) in convnet layers, and 9,122 (68%) in hidden layers.

convnet maps 103,092 features to 2 outputs for hidden layers.

## loss/optimization

* average cross entropy, i.e
``` 
reduce_mean( softmax_cross_entropy_with_logits( H_O, labels))
```
where `H_O` is the `(80,2)` output tensor from the linear operations of the fully connected output layer, and labels are the one hot vectors of the lasing/no lasing truth

* no regularization
* momemtum optimizer with mom=0.4
* learning rate starts at 0.01, exponential decay rate of 0.96, decay steps=10, staircase=True

## Job

* told tensorflow to use 4 threads via 
```
        sess = tf.Session(config=tf.ConfigProto(intra_op_parallelism_threads = FLAGS.intra_op_parallelism_threads))
```
which seemed to work, I get these messages
```
I tensorflow/core/common_runtime/local_device.cc:40] Local device intra op parallelism threads: 4
I tensorflow/core/common_runtime/direct_session.cc:58] Direct session inter op parallelism threads: 24
```

After about 10 hours, the job did 30 steps - 20 minutes a step. Below is some output. Key

* `m1s` is the number of 1's in the training data (40/80) this is more interesting for minibatch and stochatstic gradient descent
* tr.acc/#1s is the training set accuracy and #1's predicted in the training set
* tst.acc/#1s is the test set accuracy and #1's predicted in the test set - this has 60 randomly chosen samples (not overlapping the train set)
* loss same as xentropy since no regularization
* gr-ang is $\cos(g_{t-1}, g)$ where $g_t$ is the gradient (a 13,234 long vector) at step $t$. 

```
  step m1s  tr.acc/#1s tst.acc/#1s xentropy  loss  |grad| gr-ang  learnrate 
     1  40   0.49  75   0.52  57   0.6931   0.6931  0.260   0.00   0.0100
     6  40   0.66  47   0.57  36   0.6882   0.6882  0.253   1.00   0.0100
    11  40   0.59  43   0.50  30   0.6823   0.6823  0.297   1.00   0.0096
    16  40   0.60  44   0.48  31   0.6745   0.6745  0.334   1.00   0.0096
    21  40   0.60  46   0.47  30   0.6653   0.6653  0.350   1.00   0.0092
    26  40   0.62  48   0.43  32   0.6558   0.6558  0.354   1.00   0.0092
    31  40   0.64  49   0.42  31   0.6463   0.6463  0.351   1.00   0.0088
```
It does seem to be learning the training data, one has a 1/161 chance of getting .64 accuracy at random ($Z$ score of 2.5), but it is not learning anything to help it with the test data


### Job status
Doing `bjobs -l` I see
```
Mon Feb 22 22:46:01: Started 1 Task(s) on Host(s) <psana1411>, Allocated 1 Slot
                     (s) on Host(s) <psana1411>, 

Tue Feb 23 08:56:29: Resource usage collected.
                     The CPU time used is 86926 seconds.
                     MEM: 4.1 Gbytes;  SWAP: 4 Mbytes;  NTHREAD: 36
                     PGID: 904;  PIDs: 904 977 979 


 MEMORY USAGE:
 MAX MEM: 6.7 Gbytes;  AVG MEM: 3.6 Gbytes
```
When I do top on psana1411, I see other jobs running, so I am not reserving slots correctly. I should add -n X. Here is a ganglia plot:

![ganglia](psana1411_ganglia_first_scale_up.png)


In [6]:
# calculate a Z score
acc = .64 #.42
mu=40
X=acc*80
std=(.5*.5*80)**0.5
Z = (X-mu)/std
print Z


2.5043961348


# Unsuccessful Scale Up Issues with XTCAV data

Before dropping down to the small network, I was testing with 

* 2-3 convent layers
* 2 fully connected layers
* 'whitening', or basicaly GCN - global contrast normalization, each pixel over minibatch - mean=0, stddev=1
* not whitening
* high momentum, .9, .95, .99
* different learning rates, up to .1, down to .0005 (I think)
* different bias inits

## minibatch and SGD

* Tried to read 128 random images for each step
* worried that all time was going to reading the images
* dropped down to minibatch of 8
* smaller minibatch - maybe not a good idea to whiten
* accuracy was not improving
* train steps still long

## Swing between all 1s or 0s, image normalization

* My networks often swing from predicting all class 0 or all class 1
* this seems weird, proper random initalization and small learning rate should give me a balanced amount of 0's vs. 1's in the loss function?
* Fiddled a lot with bias/weight initalization, learning rate, ect
* after reading ch 13 of deeplearningbook.net, I think some image preprocessing is important - dynamic range of [-1,1] or [0,1], log transform, whitening, not sure if we should normalize by per pixel stddev
* after this, inference function predicted more balanced 0s vs 1s during training


# Simulated Data Scale Up

Here we report on simulated data scale up

## simulated data

This is signal vs. noise

### miniBatch
```
size=128 images 
img I, shape=(40,40)
signal: 10+ vertical line - mean=5, stddev=.1
noise: stddev=.1
```

## network
a 4 layer feed forward network

* whiten: `WI=ApproxWhiten(I)`   each pixel in miniBatch
* architecture/loss

```
Layer    kernel   kstrides  bias nonlinear  lrn maxpool-ksize strides
CVN01   (5,5,1,4)  (2,2)    Yes   relu     False  (3,3)      (2,2)
CVN02   (5,5,4,3)  (2,2)    Yes   relu     False  (3,3)      (2,2)

CVN01: (40,40) -> (20,20)
CVN02(20,20).shape=(10,10)

H03     W-shape=(100,5)
H04     W-shape=(5,2)

loss = avg(xentropy)  + 0.005 L1-norm(CVN01_K, CVN02_K, H03_W)
```

## scale up - more paremters

Still 40 x 40 images, but more parameters:

* momentum .9
* learnrate.01
* CVN01 kernel window (12,12) channels 20, strides (4,4)
* CVN02 kernel window (8,8) channels 16   strides (3,3)
* hidden units 10

got to 100% accuracy in 130 steps.

## scale up - larger image - 100 x 100

* go from (40,40) image to (100,100) image
* initialze biases as: 

```
CVN01_B=.1
CVN02_B=.1
H03_B=.1
H04_B=0
```
got 100% acc in 310 steps

## scale up - larger image 200 x 200, more parameters

* 307k parameters, 160k image (200,200)
* CVN01: kernel window (32,32) channels 16, strides (2,2), relu, maxpool window=(4,4) strides=(3,3)
* CVN01 output is (34,34)
* CVN02 output is (6,6) with 16 channels
* H03: 576 inputs, 50 outupts
* H04: 2 outputs

trained at 12 sec/step, got 100% on validation in 5 minutes (12 steps)

Seemed to use 100GB mem on ganglia?

## scale up to big - 1000 x 1000 image

Went to 905,912 parameters, and 4MB images (1,000,000 pixels). Created 7GB of simulated data in 2 minutes, however ganglia reported memory usage went to swap. Details of arch:


```
tensorflow)psana1612: ~/condaDev/xtcav-mlearn/convnet $ bsub -q psnehq -x -I python convnet_app.py -c convnet_flags_big.py -t 32 -d 1000
Warning: job being submitted without an AFS token.
Job <7322> is submitted to queue <psnehq>.
<<Waiting for dispatch ...>>
<<Starting on psana1503.pcdsn>>
SimpleImgData - about to produce 1780 (1000 x 1000) images (6790.16 MB)
  made data in 100.02 sec
I tensorflow/core/common_runtime/local_device.cc:40] Local device intra op parallelism threads: 32
I tensorflow/core/common_runtime/direct_session.cc:58] Direct session inter op parallelism threads: 32
('CVN01', 'CVN02')
('H03', 'H04')
evolving learning rate
whitenedInput.shape=(128, 1000, 1000, 1)
CVN01:
             CVN_K.shape=(32, 32, 1, 16)
          CVN_conv.shape=(128, 500, 500, 16)
             CVN_B.shape=(16,)
     CVN_nonlinear.shape=(128, 500, 500, 16)
         CVN_pool.shape=(128, 167, 167, 16)
             CVN_U.shape=(128, 167, 167, 16)
CVN02:
             CVN_K.shape=(32, 32, 16, 16)
          CVN_conv.shape=(128, 84, 84, 16)
             CVN_B.shape=(16,)
     CVN_nonlinear.shape=(128, 84, 84, 16)
         CVN_pool.shape=(128, 28, 28, 16)
             CVN_U.shape=(128, 28, 28, 16)
H03:
   H_W.shape=(12544,50)
   H_B.shape=(50,)
H04:
   H_W.shape=(50,2)
   H_B.shape=(2,)
convnet has 905912 unknown variables, 278560 (30%) in convnet layers, and 627352 (69%) in hidden layers.
convnet maps 1000000 features to 2 outputs for hidden layers.
initial loss=0.70
```
