Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify SharedTensor syncing #37

Open
hobofan opened this issue Feb 2, 2016 · 16 comments
Open

Simplify SharedTensor syncing #37

hobofan opened this issue Feb 2, 2016 · 16 comments

Comments

@hobofan
Copy link
Member

hobofan commented Feb 2, 2016

The current way of syncing a SharedTensor is a bit unintuitive and in some parts not quite complete.

Things to consider changing (loosely ordered by importance):

  1. Add a way to set the latest device without syncing. This is absolutely necessary for performant usage that doesn't require complex syncing.
  2. Change the behaviour of add_device to one that doesn't return an Error if the Tensor is already tracking memory for that device. The current behaviour is unintuitive. This change is breaking and will require adjustments in all plugins.
  3. Add a convenience function that handles both add_device and sync. This should clean up end-applications and the "managed" part of plugins.
  4. Track write access. This could allow us to skip syncing if we know the value was only read but not written between a back-and-forth sync. This should improve performance for the typical use case of only reading the results on Native without manually having to use the function from 1.
@alexandermorozov
Copy link
Contributor

Interface can be implemented like this (breaking change):

pub struct Tensor<'a> {
    dim: Cow<'a, [usize]>,
    memory: Rc<MemoryType>,
}

pub struct SharedTensor {
    dim: Vec<usize>,

    /// Current version and memory descriptor.
    copies: LinearMap<Device, (usize, Rc<MemoryType>)>,

    latest_version: usize,
}

impl SharedTensor {
    pub fn new(dim: Vec<usize>) -> SharedTensor {
        SharedTensor {
            dim: dim,
            copies: LinearMap::new(),
            latest_version: 1,
        }
    }

    pub fn read<'a>(&'a self, device: &Device) -> Result<Tensor<'a>, Error> {
        // Check if there is initialized data anywhere
        // Lookup memory and it's version for `device`, allocate it if it doesn't exist
        // Check version, if it's old, syncronize

        Ok(Tensor {dim: Cow::Borrowed(&self.dim)})
    }

    pub fn read_write<'a>(&'a mut self, device: &Device) -> Result<Tensor<'a>, Error> {
        // Same as ::read(), but also increase mem version and latest_version
        Ok(Tensor {dim: Cow::Borrowed(&self.dim)})
    }

    pub fn write_only<'a>(&'a mut self, device: &Device) -> Result<Tensor<'a>, Error> {
        // Same as ::write(), but skip initialization check
        Ok(Tensor {dim: Cow::Borrowed(&self.dim)})
    }
}

SharedTensor holds copies and their version numbers. User can request any number of immutable Tensors or a single mutable Tensor (enforced by borrowck). It's possible to validate in runtime that tensor data is initialized when user requests Tensor for reading and allow initialization to be skipped if Tensor is requested only for writing.

It's also possible to add some restricted slicing to Tensor.

What I don't like here:

  • Tensor doesn't contain any info if it's writable or not. Someone can write into it by mistake. It's possible to create two types Tensor and MutTensor, latter can have methods for getting writable memory. Or there may be other ways to encode mutability in type system.
  • it would be nice to include Framework in Tensor and DeviceType type signatures: SharedTensor::read<'a, F: Framework>(&'a self, device: &F::D) -> Tensor<'a, F>. I have a feeling that this kind of approach taken further would remove a lot of enum wrappers, unwraps, and would allow to move all frameworks out of collenchyma and remove feature flags. That's not for free though: some things would have to be wrapped in smart pointers, since size of framework-related structures wouldn't be known at collenchyma's compile time. But I doubt that would add perceptable runtime cost for Leaf usecases.

@alexandermorozov
Copy link
Contributor

I've created a draft/prototype here. It's still work in progress, but please let me know what do you think about it!

@alexandermorozov
Copy link
Contributor

I've updated the prototype. There is full list of features in comments in the head of the file, here is short version:

  • Memory is versioned. Version is incremented when memory is mutably borrowed. If all memory locations are up to date, then no sync is done when any one of them is borrowed. It's also possible to borrow uninitialized mem write-only.
  • Request to borrow memory returns Tensor struct that contains dimensions along with memory pointer. Tensor can be reshaped without affecting it's parent SharedTensor. It's also possible to implement restricted form of slicing on Tensor.
  • Addition of new memory location and internal syncronization doesn't require SharedTensor to be mutable. Mutability is only required to change contents.
  • Backends can be separated into their own crates, e.g. collenchyma-cuda, collencyma-opencl, so there is mostly no need for feature flags. See cuda mock implementation. This has slight runtime cost: Device/Merory has to be wrapped in Any and unwrapped on synchronization.

What is not done:

  • it's not possible to erase type of Device/Backend and use trait object instead. So e.g. solvers in Leaf have to be parametrized with Backend types. Well, it couldn't be done in current collenchyma code either.
  • nothing has changed in respect to async transfers.

@hobofan @MichaelHirn I'm ready to refactor collenchyma, plugins and leaf according to this proposal, but I'd like to know your opinion on it beforehand. It'd be a waste if after all is done it'd turn out that it was clear from the start that PR wouldn't be accepted. Of course during refactoring something unexpected may still show up and it'll turn out that those changes are a bad idea after all, but that's another matter.

@MichaelHirn
Copy link
Member

Hey @alexandermorozov,

That is pretty huge, I looked through the code and read about the intention. Following my thoughts and ideas: (It might be, that I take concepts from your earlier comments, which are no longer in the code. I try to avoid it, though.)

  1. Why version numbers?
    I see, that version numbers make the interface cleaner. Can you elaborate, why you choose the version number design and what benefits it holds?

  2. Costs of decoupling frameworks and shared tensor.
    I am a big fan of this approach. Do you have any tests on performance on this one?
    Also, @hobofan, do you have a feeling for how profound this decoupling would be in terms of runtime performance? Quote:

    it would be nice to include Framework in Tensor and DeviceType type signatures: SharedTensor::read<'a, F: Framework>(&'a self, device: &F::D) -> Tensor<'a, F>. I have a feeling that this kind of approach taken further would remove a lot of enum wrappers, unwraps, and would allow to move all frameworks out of collenchyma and remove feature flags. That's not for free though: some things would have to be wrapped in smart pointers, since size of framework-related structures wouldn't be known at collenchyma's compile time. But I doubt that would add perceptable runtime cost for Leaf usecases.

  3. Tensor reshape
    Here I feel, that our idea of what a SharedTensor is, differs in a fundamental way. The role of a SharedTensor is to track the location of memory across devices for one conceptual similar piece of data. I am not a fan of giving the memory copies a way to change its shape. In my opinion, this makes it harder to reason about what's the state of a SharedTensor and adds functionality, that is not really needed. But I am open, to see what the intention for this design approach is and what benefits it holds.

  4. Problems solved
    On a higher level, could you help me what of the above problems become directly solved?

  • Add a way to set the latest device without syncing. This is absolutely necessary for performant usage that doesn't require complex syncing.
  • Change the behaviour of add_device to one that doesn't return an Error if the Tensor is already tracking memory for that device. The current behaviour is unintuitive. This change is breaking and will require adjustments in all plugins.
  • Add a convenience function that handles both add_device and sync. This should clean up end-applications and the "managed" part of plugins.
  • Track write access. This could allow us to skip syncing if we know the value was only read but not written between a back-and-forth sync. This should improve performance for the typical use case of only reading the results on Native without manually having to use the function from 1.
  1. Future
    There are some big features, that we would like to introduce soon and I am curious to see, how/if they could play out with this SharedTensor proposal.
  • Multi-device support, via tracking async actions on a shared-tensor basis.
  • Integrating functionality from ndarray, which already provides logic for slicing and co.

Conclusion:

I like your approach and your proof of concept. I just have the feeling, that the code changes might have gone a bit further than necessary. Before we move further, I would like to find a solution that gives us all the benefits while changing only what is really needed.

@alexandermorozov
Copy link
Contributor

@MichaelHirn, hi!

I agree, scope is rather big, but that's mostly to see what is possible within this approach and what is impossible, to avoid restricting future possibilities. I don't think that every feature mentioned here should be implemented. Answering your questions:

  1. Why use version numbers? Mainly to be able to tell that several memory locations contain latest version of data. Borrowing any of those locations doesn't require synchronization. Implementation is very straightforward too: on immutable borrow compare version with latest version, if they don't match, then sync. On mutable borrow do the same, but additionally bump memory's version and latest version. It's also easy to tell if SharedTensor wasn't initialized and return an error if somebody tried to borrow it for reading. It's also possible to implement the same thing with array of bool flags (or BitVec/BitSet) that indicate if each memory contains latest data or not. Well, it looks like implementation would be simpler and faster, I should try it instead of versioning.
  2. I don't have benchmarks yet. I'll add some as time permits. Main costs as I see them now are: a) use of boxes to hold each device/memory. If we do decoupling, I think that would be difficult to work around -- there is no way to know size of Device and Memory beforehand. b) Use of Any and downcasts. AFAIK under the hood downcast is a method call on a trait object that compares struct ids, so it should be reasonably fast, c) another decoupling-related thing: if Native in defined in collenchyma, and Cuda -- in collenchyma-cuda, then without feature flags only Cuda may implement sync in/sync out from Native, since Native doesn't know anything about Cuda. So we have to try two calls, since generic code doesn't know about possible transfer directions betweed objects with erased type: native_src.sync_out(cuda_dst) and cuda_dst.sync_in(native_src). First call will fail, second will do the work. So, one extra sync attempt. It could be fixed if Native is made aware of Cuda, but, well, it spoils decoupling a bit and introduces feature flags again.
  3. Reshaping Tensors doesn't affect it's parent SharedTensor. That would destroy locality and will lead to quite a few unpleasant surprises for users. In current implementation Tensor has COW dimensions data. In Leaf source there are places where during forward pass output tensors are reshaped to the form required by the next layer, on each pass. That requires SharedTensors to be mutable. Maybe that's why nearly every tensor is wrapped in ArcLock? If that's the case, then temporary Tensor that can be reshaped could help.
  4. Features:
  • Add a way to set the latest device without syncing: yes. Though it only makes sense if this will be used as write-only at the next step, so borrowing for reading is prohibited.
  • Other points are solved by replacing API: add_device() / sync() / get() / get_mut() are replaced with read() / read_write() / write_only(). Those functions allocate memory on device (if it's not allocated already), do synchronization and return reference to memory (or Tensor).
  • Write tracking is done mostly by convention: if something mutably borrows mem (read_write(), write_only()) it's expected that it'll write something to the memory, so other locations are immediately considered outdated. If memory is borrowed with write_only(), it's not syncronized, and caller is expected to overwrite memory completely. Failure to do so may result in use of uninitialized memory later. It's not perfect, but mostly cost-free and much better than nothing.

Can you tell more about multi-device and async? Multi-device as in sync between OpenCL devices and Cuda? Or between several Cuda devices? If latter, then would they share same Cuda context? From cursory glance it looks like async transfers require pinned host memory, is this the case? That may not mix well with native ndarray types... And it looks like from performance point of view collenchyma should be able to set up several Cuda -> Hosts transfers and be able to wait on all of them for completition...

@MichaelHirn
Copy link
Member

Nice, thank you so much for the clarification. Helped a lot.

I think there are some great concepts, that we should lend from and use for the 0.1 release of Collenchyma. I am very excited to move this forward with you. I will make some free time to expand my answer and suggestions on Thursday.

@hobofan
Copy link
Member Author

hobofan commented Apr 12, 2016

Wow, that looks pretty awesome!

It generally looks good to me. As already mentioned in 1. I think an enum instead of version numbers are probably faster and should be well optimizable by the compiler.

Apart from that I can only provide the bikeshedding opinon that I think MutTensor should be named Tensor and Tensor should be named TensorView. It seams that similar systems (in other languages) already follow that convention. However not a overly strong opinion on that.

@alexandermorozov
Copy link
Contributor

@hobofan Well, if you mean enums like DeviceType and MemoryType, then they are quite compatible with use of version numbers / bitsets. They don't play well with decoupling... Those enums have to be defined in the base crate, so base crate needs to explicitly enumerate Cuda, OpenCl and other known backends, mostly defeating the benefits. But full decoupling is nice -- it'll make it easier to experiment with new backends, like GPU-over-Capnproto or something.

I'll begin to work on a stand-alone PR for enchanced synchronization, since it mostly has no downsides (and big upsides!). After that I'll try to benchmark/implement decoupling, it'll become clear what it costs in runtime and if it's worth it.

@hobofan
Copy link
Member Author

hobofan commented Apr 12, 2016

@alexandermorozov
I meant using a enum like

enum SyncState {
  Uninitalized,
  Outdated,
  Latest
}

though that would require setting all other copies to Outdated every time one is updated. Not sure if thats more or less efficient than using version numbers.

@alexandermorozov
Copy link
Contributor

@hobofan Ah, I see. Well, in practice Uninitialized should be the same as Outdated -- they both mean that this memory is not for reading. So there are really only 2 states and bool will do. Bools can be packed into an integer and the integer can be set/reset in one operation. u64 will limit max number of memories to 64. Looks like that should be enough for all foreseeble use cases. Does anyone really need to keep more than 64 copies? Maybe something like holding the same tensor on multiple nodes in a cluster? If implementation turns out to be restricting, it can be easily modified to use BitSet instead of single integer in exchange for some runtime cost. Sorry I wasn't clear -- the initial proposal was to use versions, later while replying to @MichaelHirn it became clear that BitSet is enough, I mentioned it but haven't elaborated, now it seems like u64 will be even better...

@MichaelHirn
Copy link
Member

I need to push it until tomorrow, I think. I reviewed the topic re the SharedTensor extensively now, but I need to do the write up of it tomorrow. I hope I can do it tomorrow.

@alexandermorozov
Copy link
Contributor

@MichaelHirn No problem! I'm quite busy too, most likely I won't have enough mindspace to implement anything until weekend.

@MichaelHirn
Copy link
Member

I discussed most of the topics with @hobofan as well last week and we reached a conclusion on nearly all points. The summary is, that the implementations you proposed make a lot of sense and would provide great value for the future.

Also, sorry for the delay, the last days were kind of unusual and I wanted to make sure, that I can follow the reasoning of the new SharedTensor proposal.

  1. Version numbers
    I think we all agree, that they make a lot of sense. Max's and my thoughts, regarding the implementation details, were, that a boolean enum (like you proposed) makes a lot of sense, as a SharedTensor is usually not tracking to many Tensors anyways. So the overhead for setting all other Tensors to outdated might be neglectable. This would keep the logic super simple. I suspect, that the other solutions would require more edge case handling and might make debugging really hard. But we are open discussing this further if needed.
  2. Decoupling
    We are generally pro decoupling, as it gives us more flexibility, but would make maintenance harder as you need to manage another layer of crate versions. Also, I think, like you also already mentioned, this would require some benchmarks first, to make sure that the run-time performance is not suffering from it.
  3. Reshaping
    I was a bit sceptical about this at first, but it really might help to make the Leaf code cleaner and remove some ponderous parts ( like the ArcLock thing ) as a mutable borrow of the SharedTensor is not required anymore. How this turns would look exactly we might have to explore further, but in general, I think this could be promising.
  4. Features and others
    Most of the other features, as you already mentioned, would then be/become quite easy to implement. Regarding the naming of SharedTensor/Tensor, I would vote for TensorView for the immutable interface and Tensor for the mutable one.
  5. Multi-Device support
    Multi-device support as in a Backend can be created over multiple devices (instead of just one). If you pass such a Backend to Leaf e.g. it would/could make us of the multi-device setup and distribute/parallelize the work. The basic concept for this is, that the (async) events of syncing (either to another device or host)/executing kernels would be stored with the SharedTensor (or the Tensor; not sure yet). That would introduce a bit more management on the operation/functionality side, but could be abstracted behind a macro.
    I think whatever we do with the SharedTensor now, multi-device support can be applied anytime later on.

@alexandermorozov
Copy link
Contributor

  1. Great, I'll start implementing bitmasks and read_only()/read_write()/write_only() interface that returns Memory as before. Actually work is already underway, but it'll take some time, including fixing breakage in plugins and Leaf. I'll post updates on progress when logical chunks will be ready. For now I'll use u64 to store bitmasks -- that's super fast, requires no extra allocations and no access indirection. The only downside is hard limit on number of tracked memories. If need to store more arises, it'll be easy to replace u64 with slower BitSet.
  2. I'd add another cons: it's not clear yet how decoupling will interfere with multi-device support. It feels like it's possible to do without much overhead, but it's better to prototype beforehand.
  3. Reshaping with Tensor/MutTensor will also need more consideration. With straightforward refactoring it'll result in creation of Vecs for inputs, outputs, weights, etc. on each Layer::forward() step. Looks like most of those Vecs will hold just one element. For most NNs cost should be insignificant, but there are possible corner cases. It'd be nice to use slices on stack instead of Vecs, but I don't know how easy it is.
  4. I don't have any opinion on naming; names in the prototype were selected without much thought.
  5. Ah, so that's the reason for having Backend in addition to Device. Looks like autobalancing logic won't be easy to implement... But it's possible to bind each Layer to a specific device instead and fix distribution between devices on creation. In this case it'll be easy to create mixed cpu/cuda/opencl networks; that'll also remove requirement that all layers must be supported by any given backend. As for async, it's possible to start operation on SharedTensor -> Tensor transition, and wait for completion on Tensor -> Memory request. But it looks like it won't solve all cases where async can help... I had a brief look with the cuda tool and it looks like GPU is idle most of the time because of gpu <-> host transfers; currently solver requires them in several places. My not so fast GTX 960 is typically loaded at 20-30% as reported by nvidia-smi and something like 14% by profiler if I remember correctly. Of course, solver should be cleaned up to require as few transfers as possible, but I wonder how much speedup can be squeezed by aggressively stuffing pipeline in advance...

alexandermorozov added a commit to alexandermorozov/collenchyma that referenced this issue Apr 17, 2016
Remove methods `sync()`, `get()`, `get_mut()`, `remove_copy()` of
`SharedTensor` and introduce new set of methods: `read()`, `read_write()`,
`write_only()`, `drop_at()`. Signature of `SharedTensor::new()` has also
changed.

New API has following benefits:
 - limited checks of use of uninitialized memory,
 - better memory tracking: several memories can be simultaneously marked as
   up-to-date, so some synchronization operations might be skipped,
 - backing memory is automatically allocated on the first use and added to
   `SharedTensor`, even if it's immutable. Mutability is required only for
   reshaping, modifying actual data and dropping memories.

Rationale and design decisions are discussed at the corresponding bugtracker
issue.

BREAKING CHANGE: sync and memory management API of `SharedTensor`

CLOSE: autumnai#37
alexandermorozov added a commit to alexandermorozov/collenchyma that referenced this issue Apr 17, 2016
Remove methods `sync()`, `get()`, `get_mut()`, `remove_copy()` of
`SharedTensor` and introduce new set of methods: `read()`, `read_write()`,
`write_only()`, `drop_at()`. Signature of `SharedTensor::new()` has also
changed.

New API has following benefits:
 - limited checks of use of uninitialized memory,
 - better memory tracking: several memories can be simultaneously marked as
   up-to-date, so some synchronization operations might be skipped,
 - backing memory is automatically allocated on the first use and added to
   `SharedTensor`, even if it's immutable. Mutability is required only for
   reshaping, modifying actual data and dropping memories.

Rationale and design decisions are discussed at the corresponding bugtracker
issue.

BREAKING CHANGE: sync and memory management API of `SharedTensor`

CLOSE: autumnai#37
alexandermorozov added a commit to alexandermorozov/collenchyma that referenced this issue Apr 17, 2016
Remove methods `sync()`, `get()`, `get_mut()`, `remove_copy()` of
`SharedTensor` and introduce new set of methods: `read()`, `read_write()`,
`write_only()`, `drop_at()`. Signature of `SharedTensor::new()` has also
changed.

New API has following benefits:
 - limited checks of use of uninitialized memory,
 - better memory tracking: several memories can be simultaneously marked as
   up-to-date, so some synchronization operations might be skipped,
 - backing memory is automatically allocated on the first use and added to
   `SharedTensor`, even if it's immutable. Mutability is required only for
   reshaping, modifying actual data and dropping memories.

Rationale and design decisions are discussed at the corresponding bugtracker
issue.

BREAKING CHANGE: sync and memory management API of `SharedTensor`

CLOSE: autumnai#37
alexandermorozov added a commit to alexandermorozov/collenchyma that referenced this issue Apr 17, 2016
alexandermorozov added a commit to alexandermorozov/collenchyma that referenced this issue Apr 17, 2016
alexandermorozov added a commit to alexandermorozov/collenchyma that referenced this issue Apr 17, 2016
alexandermorozov added a commit to alexandermorozov/collenchyma that referenced this issue Apr 17, 2016
Remove methods `sync()`, `get()`, `get_mut()`, `remove_copy()` of
`SharedTensor` and introduce new set of methods: `read()`, `read_write()`,
`write_only()`, `drop_device()`. Signature of `SharedTensor::new()` has also
changed.

New API has following benefits:
 - limited checks of use of uninitialized memory,
 - better memory tracking: several memories can be simultaneously marked as
   up-to-date, so some synchronization operations might be skipped,
 - backing memory is automatically allocated on the first use and added to
   `SharedTensor`, even if it's immutable. Mutability is required only for
   reshaping, modifying actual data and dropping memories.

Rationale and design decisions are discussed at the corresponding bugtracker
issue.

BREAKING CHANGE: sync and memory management API of `SharedTensor`

CLOSE: autumnai#37
alexandermorozov added a commit to alexandermorozov/collenchyma that referenced this issue Apr 17, 2016
alexandermorozov added a commit to alexandermorozov/collenchyma that referenced this issue Apr 17, 2016
alexandermorozov added a commit to alexandermorozov/collenchyma that referenced this issue Apr 18, 2016
Remove methods `sync()`, `get()`, `get_mut()`, `remove_copy()` of
`SharedTensor` and introduce new set of methods: `read()`, `read_write()`,
`write_only()`, `drop_device()`. Signature of `SharedTensor::new()` has also
changed.

New API has following benefits:
 - limited checks of use of uninitialized memory,
 - better memory tracking: several memories can be simultaneously marked as
   up-to-date, so some synchronization operations might be skipped,
 - backing memory is automatically allocated on the first use and added to
   `SharedTensor`, even if it's immutable. Mutability is required only for
   reshaping, modifying actual data and dropping memories.

Rationale and design decisions are discussed at the corresponding bugtracker
issue.

BREAKING CHANGE: sync and memory management API of `SharedTensor`

CLOSE: autumnai#37
alexandermorozov added a commit to alexandermorozov/collenchyma that referenced this issue Apr 18, 2016
alexandermorozov added a commit to alexandermorozov/collenchyma that referenced this issue Apr 18, 2016
alexandermorozov added a commit to alexandermorozov/collenchyma that referenced this issue Apr 18, 2016
During refactoring (autumnai#37) several error were upgraded into panics. Those
errors may happen only if internal logic of `SharedTensor` is incorrect
and leads to inconsistent state and broken invariants.
alexandermorozov added a commit to alexandermorozov/collenchyma that referenced this issue Apr 18, 2016
Remove methods `sync()`, `get()`, `get_mut()`, `remove_copy()` of
`SharedTensor` and introduce new set of methods: `read()`, `read_write()`,
`write_only()`, `drop_device()`. Signature of `SharedTensor::new()` has also
changed.

New API has following benefits:
 - limited checks of use of uninitialized memory,
 - better memory tracking: several memories can be simultaneously marked as
   up-to-date, so some synchronization operations might be skipped,
 - backing memory is automatically allocated on the first use and added to
   `SharedTensor`, even if it's immutable. Mutability is required only for
   reshaping, modifying actual data and dropping memories.

Rationale and design decisions are discussed at the corresponding bugtracker
issue.

BREAKING CHANGE: sync and memory management API of `SharedTensor`

CLOSE: autumnai#37
alexandermorozov added a commit to alexandermorozov/collenchyma that referenced this issue Apr 18, 2016
alexandermorozov added a commit to alexandermorozov/collenchyma that referenced this issue Apr 18, 2016
alexandermorozov added a commit to alexandermorozov/collenchyma that referenced this issue Apr 18, 2016
During refactoring (autumnai#37) several error were upgraded into panics. Those
errors may happen only if internal logic of `SharedTensor` is incorrect
and leads to inconsistent state and broken invariants.
alexandermorozov added a commit to alexandermorozov/collenchyma that referenced this issue Apr 18, 2016
Remove methods `sync()`, `get()`, `get_mut()`, `remove_copy()` of
`SharedTensor` and introduce new set of methods: `read()`, `read_write()`,
`write_only()`, `drop_device()`. Signature of `SharedTensor::new()` has also
changed.

New API has following benefits:
 - limited checks of use of uninitialized memory,
 - better memory tracking: several memories can be simultaneously marked as
   up-to-date, so some synchronization operations might be skipped,
 - backing memory is automatically allocated on the first use and added to
   `SharedTensor`, even if it's immutable. Mutability is required only for
   reshaping, modifying actual data and dropping memories.

Rationale and design decisions are discussed at the corresponding bugtracker
issue.

BREAKING CHANGE: sync and memory management API of `SharedTensor`

CLOSE: autumnai#37
alexandermorozov added a commit to alexandermorozov/collenchyma that referenced this issue Apr 18, 2016
alexandermorozov added a commit to alexandermorozov/collenchyma that referenced this issue Apr 18, 2016
alexandermorozov added a commit to alexandermorozov/collenchyma that referenced this issue Apr 18, 2016
During refactoring (autumnai#37) several error were upgraded into panics. Those
errors may happen only if internal logic of `SharedTensor` is incorrect
and leads to inconsistent state and broken invariants.
alexandermorozov added a commit to alexandermorozov/collenchyma that referenced this issue Apr 19, 2016
alexandermorozov added a commit to alexandermorozov/collenchyma-blas that referenced this issue Apr 23, 2016
Refactor code CUDA and Native backend to match #autumnai/collenchyma/62 that
provides enchanced memory management and syncronization. Since memory
management is now automatic, `*_plain` variants of functions are removed.

BREAKING CHANGE: *_plain versions of API functions are removed, arguments of
their counterpart functions may have changed in mutablity.

REFERENCE: autumnai/collenchyma#37, autumnai/collenchyma#62

refactor/native: convert to the new memory management API

Convert Native backend. Code now compiles.
alexandermorozov added a commit to alexandermorozov/collenchyma-nn that referenced this issue Apr 24, 2016
Refactor code CUDA and Native backend to match #autumnai/collenchyma/62 that
provides enchanced memory management and syncronization. Since memory
management is now automatic, `*_plain` variants of functions are removed.

BREAKING CHANGE: *_plain versions of API functions are removed, arguments of
their counterpart functions may have changed in mutablity.

REFERENCE: autumnai/collenchyma#37, autumnai/collenchyma#62
alexandermorozov added a commit to alexandermorozov/collenchyma-blas that referenced this issue Apr 24, 2016
Refactor code CUDA and Native backend to match #autumnai/collenchyma/62 that
provides enchanced memory management and syncronization. Since memory
management is now automatic, `*_plain` variants of functions are removed.

BREAKING CHANGE: *_plain versions of API functions are removed, arguments of
their counterpart functions may have changed in mutablity.

REFERENCE: autumnai/collenchyma#37, autumnai/collenchyma#62
alexandermorozov added a commit to alexandermorozov/collenchyma-nn that referenced this issue Apr 24, 2016
Refactor code CUDA and Native backend to match #autumnai/collenchyma/62 that
provides enchanced memory management and syncronization. Since memory
management is now automatic, `*_plain` variants of functions are removed.

BREAKING CHANGE: *_plain versions of API functions are removed, arguments of
their counterpart functions may have changed in mutablity.

REFERENCE: autumnai/collenchyma#37, autumnai/collenchyma#62
alexandermorozov added a commit to alexandermorozov/collenchyma-nn that referenced this issue Apr 28, 2016
Refactor code CUDA and Native backend to match #autumnai/collenchyma/62 that
provides enchanced memory management and syncronization. Since memory
management is now automatic, `*_plain` variants of functions are removed.

BREAKING CHANGE: *_plain versions of API functions are removed, arguments of
their counterpart functions may have changed in mutablity.

REFERENCE: autumnai/collenchyma#37, autumnai/collenchyma#62
alexandermorozov added a commit to alexandermorozov/leaf that referenced this issue Apr 30, 2016
Use .read()/.write_only()/.read_write() instead of .sync()/.add_device()/.get()
calls.

REFERENCE: autumnai/collenchyma#37, autumnai/collenchyma#62
alexandermorozov added a commit to alexandermorozov/leaf that referenced this issue Apr 30, 2016
alexandermorozov added a commit to alexandermorozov/leaf-examples that referenced this issue Apr 30, 2016
@alexandermorozov
Copy link
Contributor

Here are links related PRs:
#62
autumnai/collenchyma-blas#15
autumnai/collenchyma-nn#48
autumnai/leaf#103
autumnai/leaf-examples#17

In my experience those PRs have introduced no regressions and can be merged. Though it's better to resolve autumnai/collenchyma-nn#46 before that, so I could port those changes to my patches.

@alexandermorozov
Copy link
Contributor

leaf-examples mnist mlp --batch-size 10 are now 16% faster (run time is down from 31.0s to 26.1s). I've not changed anything besides synchronization code, seems like there were extra syncs...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants