Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why FrameArray and TrajReadOnly? #262

Closed
swails opened this issue May 6, 2015 · 31 comments
Closed

Why FrameArray and TrajReadOnly? #262

swails opened this issue May 6, 2015 · 31 comments

Comments

@swails
Copy link
Contributor

swails commented May 6, 2015

I think this is somewhat related to one of your comments in #260, but I wonder what the rationale is to have FrameArray and TrajReadOnly as two separate objects storing the same information.

This adds complexity to the underlying API (that does not exist in mdtraj), and this complexity is not alleviated by having convenience methods to translate between them, IMO. For a library application like this one (or mdtraj), I've found that those in which I can easily memorize the entire object model and data flow design are the easiest (and most enjoyable) to use. This includes even very complex packages, like pandas and matplotlib (they try hard to keep their object models clean and simple). I may have to look up what methods to use to do a certain task, but knowing the object model and data flow makes it easy to know where to look.

Unless I'm missing something, there's nothing that TrajReadOnly can do (functionally) that FrameArray cannot, correct? The only difference is that the former is immutable. So what's the point of TrajReadOnly? Why not always use FrameArray and just get rid of the other one? It's easy to make FrameArray immutable... just don't change it. I don't see why this has to be enforced by the library, and is rather against the design principles of Python itself (i.e., "We're all consenting adults").

@hainm
Copy link
Contributor

hainm commented May 6, 2015

Hi,

this is because of historical issue. :D I started to wrap Trajin_Single in cpptraj's first (TrajReadOnly). But latter I need to have another class to hold a chunk of data in memory, so I created FrameArray (similar to FrameArray in cpptraj but pytraj has more control).

The only reason I want to expose TrajReadOnly in pytraj because it's friendly to memory. when calling load method in TrajReadOnly, only general checking is done (if file exist, n_frames, ..) but the data is not loaded to memory. User need to iterate frame by frame to get the data. This approach is very similar to MDAnalyis program.

While playing with TrajReadOnly, I wanted to slice it, like traj[0:20:2] to get a chunk of frames. There
is no way I can store new frames in TrajReadOnly, so FrameArray was born to do this job. Whenever calling FrameArray(traj_name, top_name), all data will be loaded into memory. In my experience, this is good for small size traj (like the ones people store in h5 file) but quite slow to load for large data. For example this is loading time for 17K atoms, 1000 frames, netcdf

# load whole traj into memory by FrameArray in pytraj
# I don't use %timeit because it's terriblly slow in my computer

%time io.load(filename, top_name)[:] # FrameArray
# using cpptraj's class (Trajin_Single)
%time io.load(filename, top_name) # TrajReadOnly

%time md.load_netcdf(filename, top=top_name) # mdtra's Trajectory

# pytraj fa: -1 (but not that slow)
CPU times: user 2.09 s, sys: 449 ms, total: 2.54 s # FrameArray
Wall time: 2.54 s

CPU times: user 191 ms, sys: 14 ms, total: 205 ms # TrajReadOnly
Wall time: 205 ms

CPU times: user 1.71 s, sys: 330 ms, total: 2.04 s # mdtraj
Wall time: 2.24 s

you can see that FrameArray and mdtraj have similar loading time (1.7-2.1 s), which is much slower
than TrajReadOnly (191 ms). This is the good thing about TrajReadOnly. I've playing with pytraj a lot to test it's functionality so sometimes I am not really patient to wait for 2 s for just testing something simple, TrajReadOnly can offer this job.

For FrameArray, I also like its offers: iterating over frames is really fast http://nbviewer.ipython.org/github/pytraj/pytraj/blob/master/note-books/speed_test_2_trajs.ipynb

iterating and slicing is fast because no data is copied (not like mdtraj, if I know correctly). All you do is just moving pointer over C++ vector of Frames. Whenever slicing FrameArray, like fa[0:10:20], we only create a new 'view' of the original traj (which mean updating new resulted FrameArray will change coords of original array; we can explicitly use copy method though). Even with fa[0] will result only a 'view' of Frame object in fa.

Why not an object like Trajectory in mdtraj?
Because I want to keep all methods and information for Frame object. cpptraj have Frame class and pytraj wrap it with the same name. Creating a vector of Frames (FrameArray) is easiest way to have all C++ Frames objects without copying data to external object and can keep all Frames' method too.

For example, I am experimenting api.Trajectory, which is really similar to mdtraj (just a wrapper of numpy 3D-array). https://github.com/hainm/pytraj/blob/master/pytraj/api.py
However, whenever want to use cpptraj actions/analyeses, I need to convert api.Trajectory to something cpptraj can understand (Frame object). This is quite expensive because of data copying.

So in summary, TrajReadOnly is like Universe object in MDAnalysis (in term of need frame iterating to do analysis) and FrameArray is like Trajectory object in mdtraj (in term of loading a chunk of data into memory). Each has pros and cons that I am still considering.
(I noted here: http://nbviewer.ipython.org/github/pytraj/pytraj/blob/master/note-books/speed_test_2_trajs.ipynb). I've been playing with two other packages (MDAnalysis and mdtraj) to find their good and not-good things to make pytraj better.

But you're right that simple API is much more robust than trying to make things very complicated. I will think more about designing it. ( and actually pytraj is much more friendly and efficient than the one I first introduced to you and Dan. :D ; all suggestions were/are always taken).

thanks.

Hai

@hainm
Copy link
Contributor

hainm commented May 6, 2015

btw, I created index file for all notebooks I wrote to keep track myself and for others too. You might want to have a look at it.

http://nbviewer.ipython.org/github/pytraj/pytraj/blob/master/note-books/index.ipynb

PS: matplotlib is not very easy to use in my opinion. I used to spend hours to just need to find out how to have 4 subplots in the same figures and have nice layout (the most trouble is find out how not to overlap x, y axis labels and make them look nice). It has too many options to choose (rather making default one robust). At that time, I just use my MS ppt slide to organize the figure layout :D.

(for example. even with their official website, the demo plot looks terrible
http://matplotlib.org/examples/statistics/histogram_demo_multihist.html)

Hai

@hainm
Copy link
Contributor

hainm commented May 6, 2015

ah, I really don't like FrameArray and TrajRreadOny's names. :D (Trajectory seems to be better)

@hainm
Copy link
Contributor

hainm commented May 6, 2015

@swails

oh, two more points about TrajReadOnly

  • suppose you have very large trajectory (few hundreds of GB data), or you have a long list (tens of ) of large trajectories, use TrajReadOnly is more efficient for iterative exploring in ipython notebook.

you simply just call: traj = io.load(your_traj_name, your_top_name). then you are free to get the very end frame, like frame = traj[1000000]. If you want to slice the last chunk, just traj[:1000:2000] and you will get in-memory FrameArray object.

To the best of my limited knowledge about mdtraj, you need to use md.iterload to load a chunk of data. What's about getting a random chunk? call iterload again and use if else statement?.
What's about loading few frames? mdtraj can use md.load(..., indices=...). But for pytraj, you just traj[indices] rather calling load again but still not need to load TB of data to disk.

frame iterator? it's simple: for frame in traj(start, stop, stride, mask, autoimage=True)
chunk iterator? it's simple: for chunk in traj(start=10, chunk=50).

The main idea is to use both frame iterator, chunk iterator and indexing with minimal effort.

  • cpptraj and pytraj handle remd trajectory very well.

traj = io.load_remd(filename, top_name, T=300.0) # note: data is not yet loaded to memory
whenever slicing traj[index], cpptraj will loop all remd.x.$ext to find a frame with T=300.0. You know for sure that loading remd data to memory is not really a good idea if running stuff in TIP3P, 64 replicas. I think this is very cool for interactive work in ipython (notebook too).
Hai

@swails
Copy link
Contributor Author

swails commented May 6, 2015

OK, so correct me if I'm wrong, but it seems to me that the advantage of TrajReadOnly is in trajectory iteration.

Because I want to keep all methods and information for Frame object. cpptraj have Frame class and pytraj wrap it with the same name. Creating a vector of Frames (FrameArray) is easiest way to have all C++ Frames objects without copying data to external object and can keep all Frames' method too.

Do you get a numpy array if you get, for instance, the xyz attribute of FrameArray? If not, does it implement the array API? (For example, pandas.Series is no longer a numpy.ndarray subclass, but still implements the array API so it can be used in place of numpy arrays in most cases by virtue of duck typing, as I understand it.) Also, numpy is quite efficient with memory -- unlike standard Python containers, slicing a numpy array does not copy any data -- it simply exposes a view of the same data. Which is why when you assign to a numpy slice, it changes the underlying object:

>>> import numpy as np
>>> arr = np.arange(10)
>>> arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> arr[2:8] = 0
>>> arr
array([0, 1, 0, 0, 0, 0, 0, 0, 8, 9])

Since mdtraj uses raw numpy.ndarray objects for its trajectories, all slices are views (as I hope they are on FrameArray)

Anyway, I digress. Let me suggest something for pytraj here:

  • TrajReadOnly becomes TrajectoryIterator or something (to let people know it's an optimized iterator)
  • FrameArray becomes Trajectory or something, which is the main "object" for storing trajectory data
  • pytraj.load() returns a FrameArray (or whatever you name it)
  • pytraj.iterload() returns a TrajReadOnly (or whatever you name it)

This is more consistent with Python's data model, IMO (an iterator -- like a generator -- cannot be assigned to because there's nothing to assign to, since it's not in memory). Consider:

>>> iterable = [x for x in range(10)]
>>> iterable[8] = 0
>>> iterable
[0, 1, 2, 3, 4, 5, 6, 7, 0, 9]
>>> iterable = (x for x in range(10))
>>> iterable[8] = 0
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'generator' object does not support item assignment

The way you've described it to me, this is the difference between FrameArray (top) and TrajReadOnly (bottom) -- and Python supports both because there is a clear use case.

PS: matplotlib is not very easy to use in my opinion. I used to spend hours to just need to find out how to have 4 subplots in the same figures and have nice layout (the most trouble is find out how not to overlap x, y axis labels and make them look nice). It has too many options to choose (rather making default one robust). At that time, I just use my MS ppt slide to organize the figure layout :D.

It's easy to get started. Publication-quality plots are more challenging, but you can uncover the layers of complexity in matplotlib as you need them, and through it all their object model is consistent (Artists, Patches, Line2Ds, Axes, etc.). A Line2D object is used for the axis lines, the tick lines, the error-bar lines, the error bar hats, the plot lines, ... every line on the plot that you see is an instance of the same object, so when you know how to modify one, you can modify them all. And so on. The combination of flexibility and a complicated target application (general "plotting") means there's no way to avoid complexity in the API, but they mitigate that IMO with a minimal core object model.

As for the layout -- you are looking for the tight_layout method :). (http://matplotlib.org/users/tight_layout_guide.html)

@swails
Copy link
Contributor Author

swails commented May 6, 2015

A little clarification -- TrajReadOnly is better than just a pure generator -- it's more like the range object in Python (very cool!).

Consider:

Python 3.3.5 (default, Aug 24 2014, 10:02:17) 
[GCC 4.7.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> range(10)
range(0, 10)
>>> range(10)[5]
5

@swails
Copy link
Contributor Author

swails commented May 6, 2015

Compare that to the much more limited pure generator:

Python 3.3.5 (default, Aug 24 2014, 10:02:17) 
[GCC 4.7.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> (x for x in range(10))
<generator object <genexpr> at 0x7f0e9bdeacd0>
>>> (x for x in range(10))[5]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'generator' object is not subscriptable

So it's a subscriptable generator -- I would still stress this as an iterator. Calling it TrajReadOnly detracts from how really cool it is, and why and when you should use it.

@hainm
Copy link
Contributor

hainm commented May 6, 2015

@swails: you know that I am really bad at naming stuff. For example, I used to have a method called
load_to_ParmEd_object, then I changed to _load_chem.

TrajReadOnly is not only an iterator. You can still iterate it, or slice it. :D. But definitely I need to rename it. Its current's name look ugly to see in my eye.

Hai

@swails
Copy link
Contributor Author

swails commented May 6, 2015

The name should describe what it's good for. TrajReadOnly is good in the way that xrange is good. It's a subscriptable iterator (subscription indicates that it implements __getitem__ to retrieve a single entry or a slice). Its name should reflect that, IMO.

I like load_to_ParmEd_object better than _load_chem (or load_parmed or something more descriptive). load_chem is non-descriptive to the point of obfuscated. For private methods this doesn't matter as much, but in 6 months you'll know what load_parmed does quicker than load_chem ;).

@hainm
Copy link
Contributor

hainm commented May 6, 2015

@swails

OK, so correct me if I'm wrong, but it seems to me that the advantage of TrajReadOnly is in trajectory iteration.

Yes. (python3 use lots of iterator for memory reason :D, so I follow this)

Do you get a numpy array if you get, for instance, the xyz attribute of FrameArray?

Yes. framearray.xyz and trajreadonly.xyx will return a copy of its coords. And xyz is a numpy array. To get xyx, I need to iterate all frames and copy data to numpy array. https://github.com/hainm/pytraj/blob/master/pytraj/FrameArray.pyx#L283

It's a copy because the memory layout is different (C++ vector of Frames does not have continuous chunk of memory for xyz coords). However, for single Frame, calling xyz will return a numpy array as a view of frame.xyz (thanks Dan to keep frame.xAddress() to return a double* pointer).

you can look here and find that cython is so cool https://github.com/hainm/pytraj/blob/master/pytraj/Frame.pyx#L309

Also, numpy is quite efficient with memory -- unlike standard Python containers, slicing a numpy array does not copy any data -- it simply exposes a view of the same data.

this only happen if you don't break the "contiguous" things for memory. Please check this
http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#advanced-indexing

@hainm
Copy link
Contributor

hainm commented May 6, 2015

@swails

Since mdtraj uses raw numpy.ndarray objects for its trajectories, all slices are views (as I hope they are on FrameArray)

I am experiencing api.Trajectory to use numpy xyz too. The slicing of api.Trajectory is 10000 times faster than mdtraj and iterating it is superfast too. check link here for slicing and iterating section http://nbviewer.ipython.org/github/pytraj/pytraj/blob/master/note-books/speed_test_2_trajs.ipynb

This is very promising. However I still need to know what I really need. cpptraj already have hundreds of kinds of analysis (great thanks to all developers). So if just use FrameArray or TrajReadOnly is more efficient since I don't need to copy back and forth xyz from numpy to cpptraj`' Frame object.

And honesty I don't know what's advantage of using numpy's xyz like the one in mdtraj. if user need raw data? just xyz = traj.xyz. I think most of the time user just need to play with read-only data to do analysis rather changing xyz.

To update xyzy for a specific Frame from numpy array? it's easy too

fa[0].xyz[:] = numpy_xyz_array # fa is `FrameArray`

Another point, since mdtraj use xyz directly for some of calculation, like rmsd, you can use pytraj' objects directly in mdtraj without converting too.

import mdtraj as md
md.rmsd(fa, fa, 0) # fa is `FrameArray` object.

all slices are views (as I hope they are on FrameArray)

I love to have this too but it's impossible with current implementation of cpptraj. As I said before, the memory layout is different.

@hainm
Copy link
Contributor

hainm commented May 6, 2015

Anyway, I digress. Let me suggest something for pytraj here:
TrajReadOnly becomes TrajectoryIterator or something (to let people know it's an optimized iterator)

great. noted.

FrameArray becomes Trajectory or something, which is the main "object" for storing trajectory data

Yes, noted.

pytraj.load() returns a FrameArray (or whatever you name it)
pytraj.iterload() returns a TrajReadOnly (or whatever you name it)

Noted. But what's about we can index TrajectoryIterator too? (TrajectoryIterator[0]. TrajectoryIterator[100:1000]). or this is ok (like the example you show me)

overall, I think this is great suggestion (as pytraj name). I think I am willing to break my design now with the new name (pytraj is still in dev-status anyway)

Hai

@hainm
Copy link
Contributor

hainm commented May 6, 2015

@swails

The name should describe what it's good for. TrajReadOnly is good in the way that xrange is good. It's a subscriptable iterator (subscription indicates that it implements getitem to retrieve a single entry or a slice). Its name should reflect that, IMO.

I see. Great.

I like load_to_ParmEd_object better than _load_chem (or load_parmed or something more descriptive). load_chem is non-descriptive to the point of obfuscated. For private methods this doesn't matter as much, but in 6 months you'll know what load_parmed does quicker than load_chem ;).

_load_chem is for my personal use in testing. _load_chem(parm_name) will return chemistry object. pytraj has official io.load_ParmEd to convert ParmEd object to pytraj object.

I made _load_chem because I am too lazy to type

import chemistry as chem
parm = chem.load_file(parm_name)

while in pytraj, I just io._load_chem(parm_name) (note: io is already loaded from my very early comment, so I don't need to call import pytraj.io as io again). :D

@hainm
Copy link
Contributor

hainm commented May 6, 2015

@swails

Since mdtraj uses raw numpy.ndarray objects for its trajectories, all slices are views (as I hope they are on FrameArray)

I just realize that slicing FrameArray will get a view too. But in this case, we are not slicing xyz, we slice vector. Check this code (when slicing FrameArray, a pointer is returned)

https://github.com/hainm/pytraj/blob/master/pytraj/FrameArray.pyx#L428

(but of-course, it's impossible to slice and strip atoms at the same time to get a view). For example

fa['@CA'] # return a copy since we slice and strip all but @CA atoms

@hainm
Copy link
Contributor

hainm commented May 7, 2015

@swails

I can wait to change the names, so just push few hundreds of files to pytraj/pytraj :D
thanks for your suggestion.

Hai

@hainm
Copy link
Contributor

hainm commented May 7, 2015

oops, I'just check and it turned out that slicing FrameArray (fa[:3]) will return a copy. :D
(I guess c++ vector make a copy when using v.push_back)

for a single Frame, fa[0] will return a 'view of Frame.

@swails
Copy link
Contributor Author

swails commented May 7, 2015

And honesty I don't know what's advantage of using numpy's xyz like the one in mdtraj. if user need raw data? just xyz = traj.xyz. I think most of the time user just need to play with read-only data to do analysis rather chaning xyz.

I disagree. A large number of analyses require RMSD alignment, centering and imaging, or just stripping out the solvent altogether. If I recall details of previous git log messages (they've become too numerous for me to continue following them), this can't be done with a TrajReadOnly as they're immutable. Most common tasks require this, too -- like PCA, calculating RMSDs, B-factor (atomicfluct), average structure calculations, ... etc.

I can wait to change the names, so just push few hundreds of files to pytraj/pytraj :D
thanks for your suggestion.

FWIW, git is unique in that it separates "content" (i.e., all of the lines of code, git calls them 'blobs') from the files that they're in ("tree" information). So renaming, moving, copying, etc. different files will not actually weigh down a git repository, since a lot of the content does not change.

oops, I'just check and it turned out that slicing FrameArray (fa[:3]) will return a copy. :D
(I guess c++ vector make a copy when using v.push_back)

I'm not sure how pytraj is using C++ behind-the-scenes, but if you pass a std::vector itself (rather than a reference to the vector), then I believe it's the equivalent of doing a "pass-by-value", and a new std::vector will get initialized using the copy constructor, which duplicates the vector contents. If you pass by reference, obviously this doesn't happen. But I don't know enough about C++ to know what you need to do here to enable it to return a view rather than a copy (but @Mojyt might).

As far as stripping, even mdtraj creates a copy of the trajectories in this case.

@hainm
Copy link
Contributor

hainm commented May 7, 2015

And honesty I don't know what's advantage of using numpy's xyz like the one in mdtraj. if user need raw data? just xyz = traj.xyz. I think most of the time user just need to play with read-only data to do analysis rather chaning xyz.

I disagree. A large number of analyses require RMSD alignment, centering and imaging, or just stripping out the solvent altogether. If I recall details of previous git log messages (they've become too numerous for me to continue following them), this can't be done with a TrajReadOnly as they're immutable. Most common tasks require this, too -- like PCA, calculating RMSDs, B-factor (atomicfluct), average structure calculations, ... etc.

Hi, I should be more clearer (due to my English). I meant pytraj does not really need xyz behave like the one in mdtraj (of course, xyz numpy is the center of mdtraj's design, so it's very important). What I meant here is cpptraj already has tons of actions (I am still simplifying them to be more Pythonic), so I am not sure why pytraj really need xyz numpy.

A large number of analyses require RMSD alignment, centering and imaging, or just stripping out the solvent altogether.

Yes, I agree with this. But cpptraj already have all of them and I implemented for Trajectory too (new name of FrameArray) (traj.fit_to(ref), traj.autoimage, ...)

, this can't be done with a TrajReadOnly as they're immutable

I think you're thinking about mdtraj approach when mentioning this, which mean user might want to use numpy to update coordinates for a bunch of frames at the same times (thanks to numpy vectorization). With TrajectoryIterator, user need to iterate to get Frame and modify xyz for each Frame (not really fancy like numpy but is more memory efficient for large data).

for frame in TrajectoryIterator():
    frame.xyz[:] = some_thing_cool # note: `xyz` here is a numpy array view of frame coords

This approach is the same as cpptraj and MDAnalysis. But pytraj does have mutable Trajectory so we can in-place update coordinates

traj.autoimage()
traj.strip_atoms('!CA') # inplace strip atoms. this is different from traj['@CA']
traj.fit_to(ref)

an any cpptraj actions that changing coords applying to traj will inplace-update its coords too.
For example

In [57]: traj = io.load_sample_data ()[:]                                                                                                                 

In [58]: traj[0, 0]
Out[58]: array([  3.32577000e+00,   1.54790900e+00,  -1.60000000e-06])

In [59]: for frame in traj: frame[0] = [2., 3., 4.]                                                                                                                         

In [60]: traj[0, 0] # traj[0, 0] was updated when we update `frame`
Out[60]: array([ 2.,  3.,  4.])

@hainm
Copy link
Contributor

hainm commented May 7, 2015

FWIW, git is unique in that it separates "content" (i.e., all of the lines of code, git calls them 'blobs') from the files that they're in ("tree" information). So renaming, moving, copying, etc. different files will not actually weigh down a git repository, since a lot of the content does not change.

pytraj ship python files with cythonized files so things is a bit more complicated here. When I rename "FrameArray" to Trajectory, I need to update its name in some cython header files (just like C++ header). This force me to recompile whole module and Cython added/removed so many lines to files.
But for other things, you're right. :P

As far as stripping, even mdtraj creates a copy of the trajectories in this case.

No, pytraj use inplace-stripping (traj.strip_atoms('@ca')). The another (traj['@ca']) is slicing to return a new Trajectory.

@hainm
Copy link
Contributor

hainm commented May 7, 2015

I'm not sure how pytraj is using C++ behind-the-scenes, but if you pass a std::vector itself (rather than a reference to the vector), then I believe it's the equivalent of doing a "pass-by-value", and a new std::vector will get initialized using the copy constructor, which duplicates the vector contents. If you pass by reference, obviously this doesn't happen. But I don't know enough about C++ to know what you need to do here to enable it to return a view rather than a copy (but @Mojyt might).

Yes. in pytraj the Trajectory is a wrapper of C++ vector of cpptraj' Frame. So whenever I use v.push_back(frame), vector will create a copy of frame. I did try design differently to use a vector of Frame pointers, from which I can just push_back the pointers to vector. Using pointer is very dangerous and I got segmentation faul all the time since C++ and Python deallocated the memory (in this case) at the same time. So I just gave up and pleased with current design in pytraj.

But I might be wrong since my understanding about C++ is limited too.

@hainm
Copy link
Contributor

hainm commented May 7, 2015

@swails
FYI in case you don't know yet
http://stanford.edu/~mwaskom/software/seaborn/

@drroe
Copy link
Contributor

drroe commented May 7, 2015

On Wed, May 6, 2015 at 9:04 PM, Hai Nguyen notifications@github.com wrote:

I'm not sure how pytraj is using C++ behind-the-scenes, but if you pass a
std::vector itself (rather than a reference to the vector), then I believe
it's the equivalent of doing a "pass-by-value", and a new std::vector will
get initialized using the copy constructor, which duplicates the vector
contents. If you pass by reference, obviously this doesn't happen. But I
don't know enough about C++ to know what you need to do here to enable it
to return a view rather than a copy (but @Mojyt https://github.com/mojyt
might).

Yes. in pytraj the Trajectory is a wrapper of C++ vector of cpptraj'
Frame. So whenever I use v.push_back(frame), vector will create a copy of
frame. I did try design differently to use a vector of Frame pointers, from
which I can just push_back the pointers to vector. Using pointer is very
dangerous and I got segmentation faul all the time since C++ and Python
deallocated the memory in this case at the same time. So I just gave up and
pleased with current design in pytraj.

Jason is right about the vector passing in C++. This creates a copy:

MyFunction(std::vector myvec) {}

While this will just pass the reference with no copy:

MyFunction(std::vector const& myvec) {}

The segfault seems odd to me. I'm not 100% clear on how pytraj is sitting
on top of cpptraj, but if you create a vector of pointers to frames that
exist in cpptraj there shouldn't be any double deallocation. I'm guessing
what happens is that the owner of the Frames (underlying cpptraj) isn't
always present, and when it goes out of scope the destructors for the
Frames are called, leaving you pointing to freed memory. If the current way
is not efficient we can talk about what I can do with cpptraj to make it
more efficient.

-Dan


Daniel R. Roe, PhD
Department of Medicinal Chemistry
University of Utah
30 South 2000 East, Room 307
Salt Lake City, UT 84112-5820
http://home.chpc.utah.edu/~cheatham/
(801) 587-9652
(801) 585-6208 (Fax)

@hainm
Copy link
Contributor

hainm commented May 7, 2015

@Mojyt

what I meant is

vector<Frame> v
v.push_back(frame) # --> does `v` make a copy of `frame`?

@Mojyt: I am pleased with the current implementation of cpptraj. But if there's anything we can make better cpptraj + pytraj, I will discuss with you for sure.

@swails
Copy link
Contributor Author

swails commented May 7, 2015

It's exactly the same concept -- you are passing by value rather than reference (since you declared the std::vector template to be of type Frame, rather than Frame* or Frame&)

So in response: yes, that makes a copy.

@hainm
Copy link
Contributor

hainm commented May 7, 2015

ah, I see. I will try to see if I can use Frame& in cython for pytraj.

Hai

@hainm
Copy link
Contributor

hainm commented May 7, 2015

@swails

after trial and error + google + stackoverflow, It's likely that push_back always make a copy of object (even passing by reference) :D
http://stackoverflow.com/questions/11762474/c-stl-vector-push-back-taking-reference

If I understand correctly, if using vector<Frame> and doing v.push_back(frame), the frame object is copied twice (to pass as a variable to push_back and to make a copy for vector container). If using vector<Frame&>, the frame object is only copied once (to make a copy for vector container).

this is my idea about slicing Trajectory to give a view

vector<Frame&> v_frame1, v_frame2
// create v_frame1
//try to push to v_frame2
v_frame2.push_back(v_frame1[0])
//update coords in v_frame2[0] and expect this will update coords i v_frame1[0] too. 
//But this does not happen since `push_back` always make a copy

I guess I need to create vector<Frame*> rather than vector<Frame>. (But like I said, I tried vector<Frame*> too and got segfault. The memory stuff is very complicated to me now). So, I give up and pleased with current design of pytraj, (which means slicing Trajectory will give a new copied Trajectory)

Hai

@hainm
Copy link
Contributor

hainm commented May 7, 2015

ah,

I make vector<Frame*> working for pytraj now (anyway, still feel that C++ is too complicated :D). Great. So now I can slice Trajectory and return a view of it.

new update:
wow, if using Trajectory as a wrapper of vector<Frame>, I can not subclass it in Python level due to getting double-free memory. However using vector<Frame*> does the trick and I can freely subclass Trajectory at Python level. This is great deal since I can freely update python code without recompiling Cython code.

Hai

@hainm
Copy link
Contributor

hainm commented May 7, 2015

Finally I can get a view when slicing Trajectory. Thank you guys for stimulating discussion.

In [2]: from pytraj import io
In [3]: from pytraj.testing import aa_eq

In [4]: # load sample to `Trajectory` (tz2.ortho.*)
In [5]: traj = io.load_sample_data("tz2")[:]

In [6]: mylist = [1, 5, 8]
In [7]: # slicing to get 3 frames. `aa_eq` = `assert_almost_equal to 7 decimal place
In [8]: fa0 = traj[mylist]
In [9]: fa1 = traj[mylist]

In [10]: aa_eq(fa0.xyz, fa1.xyz)
In [11]: aa_eq(fa0.xyz, traj[mylist].xyz)

In [12]: # update fa0 and make sure fa1, traj are updated too
In [13]: fa0[0, 0] = [100., 101., 102.]

In [14]: assert fa1[0, 0, 0] == fa0[0, 0, 0] == 100.
In [15]: assert traj[1, 0, 0] == 100.

In [16]: # result
In [17]: print (fa0[0 , 0], fa1[0 , 0], traj[1 , 0])
[ 100.  101.  102.] [ 100.  101.  102.] [ 100.  101.  102.]

@hainm
Copy link
Contributor

hainm commented May 7, 2015

just tested slicing speed and pytraj is 4 times faster than numpy version in mdtraj
(and 8 times faster than my older version of FrameArray with vector<Frame>

In [8]: fa
Out[8]: 
<Trajectory with 1000 frames, 17443 atoms/frame>

In [9]: m_traj
Out[9]: <mdtraj.Trajectory with 1000 frames, 17443 atoms, 5666 residues, and unitcells 
at 0x2aaab94eb320>

In [10]: s = slice(0, 1000, 5)

In [11]: %timeit fa[s]
10 loops, best of 3: 44.1 ms per loop

In [12]: %timeit m_traj[s]
10 loops, best of 3: 179 ms per loop

@hainm
Copy link
Contributor

hainm commented May 12, 2015

Cython is so cool. I just implemented very fast slicing by add more types for pytraj in Cython code.
Unbelievably the new slicing method is >= 1600 times my fastest implementation in above test (and therefore about >= 6000 times faster than mdtraj) :D

In [5]: fa = io.load("remd.000.nc", top="myparm.parm7")                                                                                             

In [6]: fa
Out[6]: 
<pytraj.Trajectory with 1000 frames: <Topology with 5634 mols, 5666 residues, 17443 atoms, 17452 bonds, PBC with box type = truncoct>>


In [7]: s = slice(0, 1000, 5)

In [8]: %timeit fa._fast_slice (s)
The slowest run took 4.24 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 27.4 µs per loop

@swails I highly recommend to use Cython if you ever want to speedup your ParmEd. It's good for
others to look at the code if they don't know to write ext in C/C++ like you.

Hai

@hainm hainm mentioned this issue May 22, 2015
Closed
@hainm
Copy link
Contributor

hainm commented Jun 5, 2015

I am closing this since names were changed.

@hainm hainm closed this as completed Jun 5, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants