TODO List for 0.1 release #17

pavanky · 2015-06-17T22:34:25Z

jramapuram · 2015-06-18T06:12:19Z

@pavanky : It doesn't necessarily have to be a classifier though.
Neural networks can easily do regression type problems as well.
Only difference would be the loss function used. I.e. L2 loss(regression) vs. Cross Entropy+Softmax(classification).

jramapuram · 2015-06-18T09:19:39Z

I think the following would be helpful from an API standpoint:

struct Model {
  int add(Layer layer_type);
  int compile(Optimizer opt, Loss loss, int max_iter = 200, bool early_stop = 1);
  float fit(DataSet train_data, DataSet target_data, std::tuple<DataSet, DataSet> validation = std::nullptr, float validation_split = 0.0f);  // for batch training up till either max_iter or early_stop if it is set. Has the option of either accepting cross validation data or splitting given data with ratio
  float train(DataSet train_data, DataSet target_data); // single step for online methods (can be called from fit)
  DataSet predict(DataSet test_data); // for evaluating new data
}

All the int's are return codes in the above.

This will give maximum flexibility in:

Layer creation
Model training
Online + batch setting

The Layer class as you mentioned should do the following:

struct Layer{
  int connect(Layer prev); // to connect to previous layer in a deep net
  DataSet derivative(DataSet x);
  DataSet forwardPass(DataSet x);
  DataSet input(); //merely returns the input that was last used (or the output of the previous layer)
  Weights weights(); //returns just weights
  Bias bias(); // returns the bias (or stack of bias') if any (otherwise std::nullptr maybe? )
  std::map<std::string, std::string> conf(); //getter to return config of the layer itself.
}

futurely · 2015-06-22T20:11:17Z

According to many recent researches, the most powerful neural networks are no longer stacked layers but rather arbitrarily complex graphs, e.g. the bunches of advanced recursive neural networks, Facebook AI research's Memory Networks, Google DeepMind's Neural Turing Machine etc. So node is a more general name than layer.

connect can't reflect the direction of connections between the nodes. Except for the nodes that connect to the input data, the other nodes' input are the output of their predecessor nodes. Caffe uses Google Protocol Buffer to define and serialize the network. Apache Thrift which was open sourced by Facebook supports many more languages.

The following API is inspired by Caffe's Blob, Layer and Net.

typedef  shared_ptr<array> ArrayPtr;

class Data {
  public:
    explicit Data(vector<int>& size);
    int nDimension() const;
    vector<int> size();
    // Caffe exposes the raw CPU/GPU pointers to use in BLAS functions.
    // array has high level API. So there's no need to to so.
    ArrayPtr weights() const;
    ArrayPtr gradients() const;
}

typedef shared_ptr<Data> DataPtr;
typedef vector<DataPtr> DataPtrVec;

class Node {
 public:
  explicit Node(NodeParam& nodeParam);
  virtual ~Node();
  // Calls initNode which subclass can override
  void init();
  // Input and output are more general than the top and bottom of Caffe
  virtual void forward(const DataPtrVec& input, 
      const DataPtrVec& output);
  // propagate_back is more general than propagate_down of Caffe
  virtual void backward(const DataPtrVec& input,
      const vector<bool>& propagate_back,
      const DataPtrVec& output);
  // The model is DAG(Directed Acyclic Graph)
  // it's more intuitive for the predecessor to add the successor
  void addSuccessor(Node& node);
  void addSuccessors(vector<Node>& nodes);  
 protected:
  virtual initNode();
};

// Dtype is float or double
template <typename Dtype>
class Graph {
 public:
  explicit Graph(GraphParam& graphParam);
  virtual ~Graph();
  virtual forward(const DataPtrVec& inputs, DataPtrVec* outputs,
      Dtype* loss = NULL);
 /**
   * (Caffe) The network backward should take no input and output, since it solely
   * computes the gradient w.r.t the parameters, and the data has already been
   * provided during the forward pass.
   */
  virtual backward();
  Dtype forwardBackward(const DataPtrVec& inputs) {
    Dtype loss;
    DataPtrVec outputs;
    forward(inputs, &outputs, &loss);
    backward();
    return loss;
  }
};

futurely · 2015-06-23T00:49:06Z

Microsoft Research's "Computational Networks A Generalization of Deep Learning Models" presented its open source deep learning framework CNTK as

jramapuram · 2015-06-23T06:31:41Z

I like the generalizations. Few notes:

  void addSuccessor(Node& node);

seems redundant with:

  void addSuccessors(vector<Node>& nodes);

since a node can have subnodes as well, right?
Why not just have a ptr to the next and previous nodes: a basic doubly-linked list.

One item that is very important that your schema is missing is some form of accuracy.
99% of the time you will need some form of data splitting and verification vs. a cross-validation data set.
It would be a good idea to incorporate this right off the back.

pavanky · 2015-06-23T07:17:37Z

@jramapuram

If I am reading correctly, the following is for cases when a single node is connected to multiple successors (like the CN with shared params diagram).

void addSuccessors(vector<Node>& nodes);

I am not sure a linked list is the solution here.

pavanky · 2015-06-23T07:24:05Z

@futurely Thanks for the great feedback! The proposed API looks solid. However the one issue I see is that going the Node / Graph route means that the amount of parallelism will be decreased in networks that use more traditional layers. I wonder if we can specialize a Layer class as well that sits on top of Node to achieve this.

pavanky · 2015-06-23T07:35:14Z

@futurely Sorry for jumping the gun. I re-read the entire discussion again. It looks the proposed Node is just a generalized Layer.

pavanky · 2015-06-23T11:17:07Z

@futurely @jramapuram Would it be possible to continue the discussion over here: https://gitter.im/arrayfire/arrayfire_ml ?

pavanky · 2015-06-25T16:45:09Z

Suggestions from @alcinos:

Instead of having a list of child nodes in eachNode, it may be better to have an adjacency list in the Network class.
The weights and gradients (and I guess you also meant the biases) shouldn't be part of the input parameters of the "forward" function, they should be private to the node (of course, some getters can be written)

alcinos · 2015-06-25T18:57:25Z

Here is a proposition including the modifications:
https://gist.github.com/alcinos/3bedb2f7c4518fa93220

futurely · 2015-06-26T02:26:07Z

The Graph or Network class is not needed at all. Here's a simple illustration:

structure node
   [list of nodes] neighbors
   [data]
end

cost(X, Y) := if (X.neighbors contains Y) return X.neighbors[Y];
           else "Not possible"

list nodes;

alcinos · 2015-06-26T04:17:13Z

@futurely
The point is that the net structure should be independant from the nodes. The same node can be involved in different topologies depending on the context. For example, in stacked auto encoders, the training is done layer by layer, which requires a different topology to train each level

futurely · 2015-06-28T08:00:56Z

#22.

pavanky · 2015-06-30T11:51:17Z

@alcinos Can you explain how having an adjacency list helps in the situations you mentioned ? I still think it is the better option to have a centralized location for the representation, however I do not see it solving the problems of greedy layer by layer training.

alcinos · 2015-06-30T16:46:21Z

@pavanky Well let's say we have a 3 layers stacked auto-encoder.
I will denote by "O" an output layer, "I" an input layer, "E" an encoding layer and "D" a decoding layer.
The first step is to train greedily the first layer. This training is performed on the network :
I -> E1 -> D1 -> O. After some number of iterations (or once the reconstruction error goes bellow a given threshold), we can train the second layer. This time, the network is
I -> E1 -> E2 -> D2 -> O (we want the second layer to reconstruct the encoding of the first one)
And so on for the others layers.
The last step is a finetuning of the weights (by gradient descent), performed on the net:
I -> E1 -> E2 -> E3 -> D3 -> D2 -> D1 -> O

Eventually, depending on the applications, it is likely that the interesting part of the trained net is the output of E3 (high level features of the input). Once trained, we'll thus only use the first part of the net: I -> E1 -> E2 -> E3

The point is that in all those training steps, the architecture of the net is different, hence is makes more sense to store this architecture independently of the nodes.
Moreover, several architectures can be used more or less concurrently for the same nodes: for example, we can use the net I -> E1 -> E2 -> E3 as a feature generator for some problem (control, classification,...), while constently improving the reconstruction error (of the full net I -> E1 -> E2 -> E3 -> D3 -> D2 -> D1 -> O) given some new samples that comes from experience (the training set of the net is not always fully available from the beginning)

pavanky · 2015-06-30T16:55:28Z

I understand what autogenerators are doing. My question was more about implementation. What you are suggesting requires updating the adjacency list after each step or creating a new network after each step. Am I correct in this observation ?

alcinos · 2015-06-30T16:56:40Z

Absolutely!

alcinos · 2015-07-01T04:45:55Z

@futurely The code you are showing doesn't feature any encapsulation. This is a problem:

What if we want to change the number of hidden layers?
What if we would like to use convolution layers instead of fully connected ones ?
...

futurely · 2015-07-13T14:46:37Z

Caffe creator's plan on "Improving Caffe: Some Refactoring".

futurely · 2015-07-14T06:34:23Z

Minerva's DAG and operators based implementation.

pavanky · 2015-07-14T18:58:47Z

@futurely @alcinos @jramapuram I pushed some of the basic classes out. I do not have any convolution nodes yet. There is no network classes yet. I am going to push some sample network classes out later today to demonstrate how the structure can be stored.

The network class will also extend Node so that it can be used as a super node in a larger network.

I understand that this is not much right now, but feedback is welcome.

jramapuram · 2015-07-15T10:51:35Z

Looks great so far @pavanky ! Few queries / comments:

Why is lr > 1 in your example for perceptron? Generally it is << 1.0
Are the thoughts to make LinearLayer : LinearNode , etc ? I like this idea.
normalize() should be factored out of Node in my opinion (into a Normalize class). There are many times of normalization that one might decide to use for example.

I will take a deeper look a little later, but good job!

unbornchikken · 2015-07-15T11:11:19Z

@pavanky @jramapuram I believe forward and backward computations have to be refactored from the base interface too, because I can see no way to support forward phase gradient computations with the current Node class. I mean RTLR for example.

jramapuram · 2015-07-15T11:23:19Z

@unbornchikken : Agreed. I think a simple solution is to create a virtual forward / backward in the Node class & implement it in each class. That way you can unwind recurrences if need be (or keep a truncated moving average of the weights as in RTRL)

unbornchikken · 2015-07-15T11:27:04Z

But for calculating RTLR you'll need a totally different kind of network traversal logic other than forward or backward. I meant that if traversal logic is refactored out of Node interfaces in separate classes (ForwardPhase, BackwardPhase, RTLRPhase, etc) its easy to define separate ones for every possible algorithm AFML want to support in the future and after the future.

jramapuram · 2015-07-15T11:39:59Z

I think we are might be talking about two separate things. Are you talking about Real-Time-Recurrent-Learning: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.52.9724&rep=rep1&type=pdf ?

unbornchikken · 2015-07-15T12:29:03Z

@jramapuram Exactly.

jramapuram · 2015-07-15T12:37:53Z

Exactly in that it is the same? Or different? I didn't find any mention online for RTLR. RTRL involves just computing the gradients of time n+1 using gradients and activations of time n in addiction to accumulating the gradients from 0 --> T_current. It is fully online. My solution of having virtual forward/backward functions should easily solve this problem..

unbornchikken · 2015-07-15T12:49:51Z

@jramapuram :) Yeah, I've trouble with acronyms, I often write LSTM as LTSM also. Sorry. Ok, but for calculating gradients you'll need to feed desired outputs and to do a special kind of forward iteration of the whole network for each weights:

http://www.willamette.edu/~gorr/classes/cs449/rtrl.html (18)

Or maybe @pavanky doesn't want to support RTRL altogether in favor of LSTM, which is a separate structure other than this.

jramapuram · 2015-07-15T12:57:49Z

@unbornchikken : no worries, just wanted to be on the same page :) . So RTRL is an optimization algorithm. It can be applied to LSTMs/GRU/...(insert your RNN flavor here). If you look at that link (specifically step # 7 & # 8) it isn't really anything fancy. You need to keep an accumultor for the error upto the current timestep as well as the gradient (as opposed to BPTT where things are unfolded all the way from T_current --> T0).

You can ignore a lot of the delta function stuff that is mentioned there. It is as in the paper, a way to unify the input/output pairs

unbornchikken · 2015-07-15T13:08:00Z

@jramapuram What I tried to say is that to support RTRL with your rigid Node abstract class, you gotta define some other methods in it accepting and holding this forward propagated p values. If you have some other special algorithm or trying to prototype one, you gotta modify the Node class for that too, which cannot be done at home, you gotta PR the master there. Am I right? But if Node class role changed to only hold and provide values, and those traversal algorithms could be defined in a separate classes on top of this basic interface, then you have the ability to design the next Realtime Backward-Forward Long Term Populated Short Theory of Everything at Once neural network learning algorithm by using ArrayFire ML. This ability is often missing from ML libraries out there because of this early decisions.

jramapuram · 2015-07-15T13:17:49Z

@unbornchikken : Agreed. I think a simple solution is to create a virtual forward / backward in the Node class & implement it in each class.

If you look at Linear.hpp it implements it's own forward / backward.
The easiest way to solve your problem is make a RecurrentNode class that inherits from Node.
The details as to whether to use RTRL or BPTT can be resolved there.

you gotta modify the Node class for that too, which cannot be done at home

If you are implementing a new algorithm then you will have to do this. This is c++ , not python.
Abstracting away forward/backward into it's own class doesn't seem like it would gain anything at this point. If anything it adds another level of indirection. I.e. you will still have to recompile the library.

Now, that being said it makes sense to have an Optimizer class. However, this is separate from the Nodes as optimizer just take a cost, a gradient and do some form of ascent/descent.

unbornchikken · 2015-07-15T13:28:18Z

@jramapuram You're talking about object oriented design and I'm talking about composite design there. In your case RTRLNode : RecurrentNode will have a backward and froward methods those throws unsupported exception, and a forward method that accepts a whole new vector for storing derivates along with the input, and an other forward method propagating p values - declared in RTRLNode class. In my case Node doesn't have forward and backward methods, only provide connections and node, weights and other weight related values. Forward, backward, etc cases are implemented separately, and an Optimizer component groups them to provide training functionality.

jramapuram · 2015-07-15T13:48:56Z

In your case RTRLNode : RecurrentNode will have a backward and froward methods those throws unsupported exception, and a forward method that accepts a whole new vector for storing derivates

RTRLNode will have two private internal members that merely accumulate the error & derivatives locally. All RTRL is doing is estimating derivatives for t+1 using derivatives for t (this is done on a node by node basis). Each unit is a linear combination of the previous layer's activations coupled with a gemv call. The only thing that needs to be passed along is the current update (which is already being done).

oriented design and I'm talking about composite design

In either case, how does your solution help you prototype faster?

unbornchikken · 2015-07-15T14:13:54Z

Maybe this is not the right issue to discuss about the internals of RTRL, but we can agree about that you will end up having an RTLNode class that have to throw not supported exception in its backward method, right? That's because your fundamental design hardcoded the ability to serve a single algorithm suite. Btw, I haven't talked about anyhing that helps pavan to prototype faster. I've talked about something that I believe makes the fundamental design more extensible. It was just only that: "I believe forward and backward computations have to be refactored from the base interface too" :)

pavanky · 2015-07-15T14:50:40Z

Why is lr > 1 in your example for perceptron? Generally it is << 1.0

The LR was greater than 1 because I knew it was going to converge for that simple simple test.

Are the thoughts to make LinearLayer : LinearNode , etc ? I like this idea

Each "Node" here is the equivalent of a "Layer" from caffe and "Module" from torch-nn.

normalize() should be factored out of Node in my opinion (into a Normalize class). There are many times of normalization that one might decide to use for example.

If you can point to some examples, I will look into this.

I think a simple solution is to create a virtual forward / backward in the Node class & implement it in each class. That way you can unwind recurrences if need be (or keep a truncated moving average of the weights as in RTRL)

That is how things are at the moment..

But for calculating RTLR you'll need a totally different kind of network traversal logic other than forward or backward.

Each "Node" can be a simple layer or a composite node like Multi Layer Perceptrons, Recurrent Neural networks, Autoencoders and Restricted Boltzman Machines etc. The composite nodes can be used as is or by plugging them into a larger network. The "forward" and "backward" methods help the composite nodes interface with other nodes in the larger network. For training these networks, the methods used will obviously be different.

That said, none of the API is final and will change until we can address all the problems. This project is still in the prototyping phase afterall.

P.S. If you want to have prolonged discussions, can you move to the gitter room instead :-)

jramapuram · 2015-07-15T14:58:23Z

const double lr = 10;

@pavanky : This is what I was referring to in perceptron.cpp

That is how things are at the moment..

Yea, noticed that in the later comments. I need to learn to read :)

If you can point to some examples, I will look into this.

Do you mean examples of other normalization strategies? Example would be whitening using SVD

@unbornchikken : we can discuss further if you like, but I think the only other area where removing the forward/backward paradigm would be when straying away from neural networks. In that case you can simply ignore 'backward' . Otherwise, extending them should prove sufficient. RTRL still have a forward and a backward btw, you don't need to throw an undefined exception.

pavanky · 2015-07-15T15:02:32Z

@jramapuram The LR was only high because the test (binary AND) is a linear operation and weights will always get updated in the same direction once the algorithm starts. I just wanted to speed things up a little while testing. I understand why learning rates usually are small, but for this case it does not really matter.

pavanky · 2015-07-15T15:06:32Z

Do you mean examples of other normalization strategies? Example would be whitening using SVD

I do not see how such strategies can be applied at each node. From what I understand this can be calculated near the output and the norm can be propagated back for scaling up / down. Perhaps I should rename the method to simply say "scale"

jramapuram · 2015-07-15T15:10:55Z

Scale sounds more appropriate imo. But do we want to do internal node scaling?
Generally the scaling op should be done prior (on the dataset directly) or using something like batch normalization

pavanky · 2015-07-15T15:14:04Z

I agree. I will just merge normalize and update step for now.

unbornchikken · 2015-07-15T19:12:05Z

@pavanky "The "forward" and "backward" methods help the composite nodes interface with other nodes in the larger network."

Oh, I get it! This was not obvious, I've seen many ML libraries out there that had this fw/bw only traversal hard coded in fundamentals. Now, I started to get excited to see where your design leads to. I cannot wait to put my hands on something that support this meta-network building approach. I started to imagine something like Peltarion Synapse but with performance level of x100.

UniqueFool · 2016-06-02T17:13:48Z

Also see: arrayfire/arrayfire#1441

pavanky · 2017-07-10T22:48:14Z

Updated the issue to reflect the new architecture.

pavanky added the feature label Jun 17, 2015

pavanky added this to the 1.0 milestone Jun 17, 2015

pavanky changed the title ~~Base Layer class~~ Layer class and derivatives Jun 17, 2015

This was referenced Jun 17, 2015

Perceptron #4

Closed

Proposal for the initial API #3

Closed

pavanky modified the milestones: 0.1, 1.0 Jun 17, 2015

jramapuram mentioned this issue Jun 18, 2015

RNN Models #20

Open

5 tasks

pavanky changed the title ~~Layer class and derivatives~~ Building blocks for the library Jun 23, 2015

pavanky changed the title ~~Building blocks for the library~~ Building blocks for neural networks in the library Jun 23, 2015

pavanky mentioned this issue Jun 24, 2015

Consider collaboration alcinos/DeepFire#1

Open

futurely mentioned this issue Jun 28, 2015

Proof of concept neural network API design #22

Closed

bhack mentioned this issue Jul 15, 2015

Caffe OpenCL support BVLC/caffe#2610

Closed

pavanky changed the title ~~Building blocks for neural networks in the library~~ TODO List for 0.1 release Jul 10, 2017

ghost mentioned this issue Dec 6, 2017

Benchmarking JuliaGPU/ArrayFire.jl#183

Closed

TODO List for 0.1 release #17

TODO List for 0.1 release #17

Comments

pavanky commented Jun 17, 2015 • edited

Base Classes

Autograd

Neural Network

Solvers / Optimizers

Examples

jramapuram commented Jun 18, 2015

jramapuram commented Jun 18, 2015

futurely commented Jun 22, 2015

futurely commented Jun 23, 2015

jramapuram commented Jun 23, 2015

pavanky commented Jun 23, 2015

pavanky commented Jun 23, 2015

pavanky commented Jun 23, 2015

pavanky commented Jun 23, 2015

pavanky commented Jun 25, 2015

alcinos commented Jun 25, 2015

futurely commented Jun 26, 2015

alcinos commented Jun 26, 2015

futurely commented Jun 28, 2015

pavanky commented Jun 30, 2015

alcinos commented Jun 30, 2015

pavanky commented Jun 30, 2015

alcinos commented Jun 30, 2015

alcinos commented Jul 1, 2015

futurely commented Jul 13, 2015

futurely commented Jul 14, 2015

pavanky commented Jul 14, 2015

jramapuram commented Jul 15, 2015

unbornchikken commented Jul 15, 2015

jramapuram commented Jul 15, 2015

unbornchikken commented Jul 15, 2015

jramapuram commented Jul 15, 2015

unbornchikken commented Jul 15, 2015

jramapuram commented Jul 15, 2015

unbornchikken commented Jul 15, 2015

jramapuram commented Jul 15, 2015

unbornchikken commented Jul 15, 2015

jramapuram commented Jul 15, 2015

unbornchikken commented Jul 15, 2015

jramapuram commented Jul 15, 2015

unbornchikken commented Jul 15, 2015

pavanky commented Jul 15, 2015

jramapuram commented Jul 15, 2015

pavanky commented Jul 15, 2015

pavanky commented Jul 15, 2015

jramapuram commented Jul 15, 2015

pavanky commented Jul 15, 2015

unbornchikken commented Jul 15, 2015

UniqueFool commented Jun 2, 2016

pavanky commented Jul 10, 2017

pavanky commented Jun 17, 2015 •

edited