Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Matrix class to abstract algorithms away from data storage details #54

Closed
kloudkl opened this issue Jan 24, 2014 · 3 comments
Closed
Labels

Comments

@kloudkl
Copy link
Contributor

kloudkl commented Jan 24, 2014

Currently, the algorithm codes are quite aware of the memory layout of the underlying data. Adding a Matrix class in-between helps separate concerns of different modules which is a good practice in software engineering.

The biggest benefit is to simplify coding and improve the development productivity. It will also ease understanding of the existing and future algorithms. As a result, we will see accelerated development and adoption progresses.

The Matrix class is intended to be a view of 2D array contained in a Blob. Its main functionality is to provide high level wrappers of the common operations.

using boost::move;

template<Dtype>
class Matrix {
public:
  Matrix();
  Matrix(shared_ptr<Blob<Dtype> > blob);
  Matrix<Dtype> mul(Matrix<Dtype>& that) {
    Matrix<Dtype> product;
    caffe_gpu_gemm(...);
    return move(product);
  }
  Matrix<Dtype> add(Matrix<Dtype>& that);
  minus, div, rdiv, sqr, pow, exp, conv, sum, max, min, mean, std, ones, zeros, rand, randn, size, rows, cols, row, col, roi, t/transpose, rot90, ...
private:  
  shared_ptr<Blob<Dtype> > blob_;
  size_t num_;
  size_t channel_;
  size_t offset_;
}

So that we can write like codes like the following snippets.
The convolution:

output = image.conv(filter);

The fully connected layer:

output = weight.mul(input).add(bias);

The ReLU activation:

activation = input.max(0);

The Softmax activation

activations = input.exp();
probs = activations.rdiv(activations.sum(dim));

As you can see, the API is highly inspired by MATLAB which also motivates ArrayFire C++. But of course the snippets are only rough sketches. Many more details need to be considered. For example, if the performance price of boost move operations is too high, it could be replaced by shared_ptr which would complicate the user codes a little. Another question is should we pass in the shared_ptr of the result matrix instead of returning it. More importantly, the GPU codes may greatly differ from the CPU codes depending on whether CUDA can play well with the proposed API syntax.

Therefore, this issue's scope is limited to the implementation of the Matrix classes for both kinds of devices. Porting algorithms should be put into independent issues until benchmark results show no performance gap between the low level API and the proposed high level API.

Welcome efforts to refine the API and help implement it.

@Yangqing
Copy link
Member

I am in general against writing a matrix class, or using an existing matrix
class (in which case it would be very tricky to synchronize CPU and GPU
operations). What we essentially should need is a Tensor class that
achieves 4-dimensional array operations, but that involves some substantial
changes to more than half of the code.

I am also a little against Matlab style implementations. For example, the
code:

activations = input.exp();
probs = activations.rdiv(activations.sum(dim));

effectively allocates two arrays, activations and probs, and then discards
them on the fly. Of course this could be written in a more careful way by
preallocating arrays, like exp(input, &activation), but it would introduce
careless codes more often. The current code actually requires you to
explicitly define such "buffer" blobs, which I believe is important in
writing effectively codes.

I do like the idea of separating interface from actual implementations. The
Blob class is sort of halfway here - I was in a fast iteration when writing
all those codes, but one can imaging better separation between the blob
operation interfaces and the actual blob implementations (e.g. do
add(blob1, blob2), or conv(blob1, blob2)), which is essentially what you
are proposing here. At this stage, I don't think refactoring is an urgent
issue though.

Yangqing

On Thu, Jan 23, 2014 at 9:20 PM, kloudkl notifications@github.com wrote:

Currently, the algorithm codes are quite aware of the memory layout of the
underlying data. Adding a Matrix class in-between helps seperate concerns
of different modules which is a good practice in software engineering.

The biggest benefits is to simplify coding and improve the development
productivity. It will also ease understanding of the existing and future
algorithms. As a result, we will see accelerate the development and
adoption progress.

The Matrix class is intended to be a view of 2D array contained in a Blob.
Its main functionality is to provide high level wrapper of the common
operations.

using boost::move;
templateclass Matrix {public:
Matrix();
Matrix(shared_ptr<Blob > blob);
Matrix mul(Matrix& that) {
Matrix product;
caffe_gpu_gemm(...);
return move(product);
}
Matrix add(Matrix& that);
minus, div, rdiv, sqr, pow, exp, conv, sum, max, min, mean, std, ones, zeros, rand, randn, size, rows, cols, row, col, roi, t/transpose, rot90, ...private:
shared_ptr<Blob > blob_;
size_t num_;
size_t channel_;
size_t offset_;}

So that we can write like codes like the following snippets.
The convolution:

output = image.conv(filter);

The fully connected layer:

output = weight.mul(input).add(bias);

The ReLU activation:

activation = input.max(0);

The Softmax activation

activations = input.exp();probs = activations.rdiv(activations.sum(dim));

As you can see, the API is highly inspired by the MATLAB counterparts
which also motivates ArrayFire C++. But of course the snippets are only a
rough sketch. Many more details need to be considered. For example, if the
performance price of boost move operations is too high, it could be replace
by shared_ptr which would complicate the user codes a little. Another
question is should we pass in the shared_ptr of the result matrix instead
of returning it. More importantly, the GPU codes may greatly differ from
the CPU codes depending on whether CUDA can play well with the proposed API
syntax.

Therefore, this issue's scope is limited to the implementation of the
Matrix classes for both kinds of devices. Porting algorithms should be
delayed until benchmark results shows no performance gap between the low
level API and the proposed high level ones.

Welcome efforts to refine the APIs and help implement them.

Reply to this email directly or view it on GitHubhttps://github.com//issues/54
.

@kloudkl
Copy link
Contributor Author

kloudkl commented Jan 24, 2014

Thanks for your suggestions! In a larger context of this proposal, I am wondering for a while what are the vision, scope, dos with priorities and dont's of Caffe? If you have a plan that can direct the community towards a shared destination, it would concentrate the limited resources out there and lead to more effective development and wider adoption in the near future.

@Yangqing
Copy link
Member

Closed per #85.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants