Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gram matrix using VexCL #93

Closed
ddemidov opened this issue Feb 10, 2014 · 1 comment
Closed

Gram matrix using VexCL #93

ddemidov opened this issue Feb 10, 2014 · 1 comment

Comments

@ddemidov
Copy link
Owner

This is a copy of a stackoverflow question. Keeping it here for future reference.

I have a pretty large data (does not fit into a GPU memory) containing many vectors where each vector is several MBs. I'd like to calculate, using multiple GPU devices, the Gram matrix using a gaussian kernel.

In other words, for every pair of vectors x,y, I need to calculate the norm of x-y. So if I have N vectors, I have (N^2+N)/2 such pairs. I don't care about saving space or time by taking advantage of the symmetry, it can do the whole N^2.

How can I do it with VexCL? I know its the only library supporting multiple GPUs, and I did pretty much doing it effectively with plain OpenCL with no success so far.

Please note that the dataset won't even fit the machine's RAM, I'm reading blocks of vectors from a memory mapped file.

Thanks much!!

@ddemidov
Copy link
Owner Author

You will obviously need to split your vectors into groups of, say, m, load the groups one by one onto your GPUs and do the computations. Here is a complete program that does the computation (as I understood it) for two currently loaded chunks:

#include <vexcl/vexcl.hpp>

int main() {
    const size_t n = 1024; // Each vector size.
    const size_t m = 4;    // Number of vectors in a chunk.

    vex::Context ctx( vex::Filter::Count(1) );

    // The input vectors...
    vex::vector<double> chunk1(ctx, m * n);
    vex::vector<double> chunk2(ctx, m * n);

    // ... with some data.
    chunk1 = vex::element_index();
    chunk2 = vex::element_index();

    vex::vector<double> gram(ctx, m * m); // The Gram matrix to fill.

    /*
     * chunk1 and chunk2 both have dimensions [m][n].
     * We want to take each of chunk2 m rows, subtract those from each of
     * chunk1 rows, and reduce the result along the dimension n (I assume
     * norm of (x - y) is max(abs(x - y)), but this could be easily generalized
     * for other norms).
     *
     * In order to do this, we create two virtual 3D matrices (x and y below,
     * those are just expressions and are never instantiated) sized [m][m][n],
     * where
     *
     *     x[i][j][k] = chunk1[i][k] for each j, and
     *     y[i][j][k] = chunk2[j][k] for each i.
     *
     * Then what we need to compute is
     *
     *     gram[i][j] = max_k(abs(x[i][j][k] - y[i][j][k]));
     *
     * Here it goes:
     */
    using vex::extents;
    using vex::_;

    auto x = vex::reshape(chunk1, extents[m][m][n], extents[0][2]);
    auto y = vex::reshape(chunk2, extents[m][m][n], extents[1][2]);

    // The single OpenCL kernel is generated and launched here:
    gram = vex::reduce<vex::MAX>(
            // the dimensions of the expression to reduce:
            vex::slicer<3>(extents[m][m][n])[_],
            fabs(x - y),          // The expression to reduce.
            2                     // The dimension to reduce along.
            );

    // Copy the result to host, spread it across your complete gram matrix.
    // I am lazy though, so let's just dump it to std::cout:
    std::cout << gram << std::endl;
}

I suggest you load chunk1 once, then in sequence load all chunk2 variants and do the computations, then load next chunk1, etc. etc. Note that slicing, reshaping, and multidimensional reduction operations are only supported for a context with a single compute device in it. So what is left is how to spread the computations across all of your compute devices. The easiest way to do this is probably to create single VexCL context that would grab all available GPUs, and then create vectors of command queues out of it:

vex::Context ctx( vex::Filter::Any );

std::vector<std::vector<vex::command_queue>> q;
for(size_t d = 0; d < ctx.size(); ++d)
    q.push_back({ctx.queue(d)});

//...
// In a std::thread perhaps:
chunk1(q[d], m * n);
chunk2(q[d], m * n);
// ...

I hope this is enough to get you started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant