Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

340 Gigabyte data and error while creating .mex from mapped_tensor_shim.c #16

Closed
kerim371 opened this issue Oct 3, 2018 · 4 comments
Closed
Assignees
Labels

Comments

@kerim371
Copy link

kerim371 commented Oct 3, 2018

Dylan,
I use windows 7 x64. When I try to compile .mex function from mapped_tensor_shim.c I can such error. I deleted the whole line 33, that includes "#define UINT64_C(c) c ## i64" and it was compiled normally. I could use MappedTensor.

I tryed to work with data of 340 Gigabytes.
mtVar = MappedTensor([r_path r_file],[8240 42654189], 'Class', 'uint8');
So I have a matrice of uint8 of size(mtVar) = [8240 42654189]. Then I use the following command to get the data (I need 42654189 elements that weigh about 42 Megabytes):
tic; mtVar(1,:); toc
The elapsed time is almost 9 hours. By the way it doesn't require much RAM or CPU unlike memmapfile, which consumes 5 Gigabytes of RAM (all my RAM) in 5 minutes and my machine is hung.
Is it possible to speedup access to such data?
By the way if try to get 10^5 elements:
tic; mtVar(1,1:10^5); toc
It takes time about 66 seconds. If I then rerun this command then elapsed time will be les than 1 second. Why?

dylan

@DylanMuir DylanMuir self-assigned this Oct 4, 2018
@DylanMuir DylanMuir added the bug label Oct 4, 2018
@DylanMuir
Copy link
Owner

Hi Kerim

I've pushed an attempted fix for the compile problem under MinGw, to branch iss16. Please pull that and see if you can successfully compile.

@DylanMuir
Copy link
Owner

Regarding access time, MappedTensor does the best it can to read data in contiguous chunks, and in the fewest possible disk accesses. Is the data located on a network drive? That can of course slow down access.

In general, accessing contiguous regions of a file is fast, while accessing bits and pieces is slow. So accessing mtVar(:, 1) will often be much faster than accessing mtVar(1, :), even if mtVar is square. So if there's a way you can store your data transposed, then accessing the elements you need will be much faster.

Regarding the second run being much faster than the first, this is a disk access caching issue. On the first run the data is actually read from the drive / network. The OS then caches this data in memory, so the second run is reading from memory rather than from disk. This all happens behind the scenes as far as MappedTensor is concerned, so there's no way for me to affect that.

@kerim371
Copy link
Author

kerim371 commented Oct 4, 2018

Thank you for reply Dylan,
You are right, now I can compile all three mex functions without any problem. And also <MappedTensor> has not errors highlighted by red. Thank you for fixing these features.

I forgot to tell you that my data is stored on a external hard drive. Now few minuts ago I launched the same code in loop, just to see will access to he data be faster:
a = zeros(1,10^7,'uint8');
N = 10^4;
tic
for n = 1:4000
a(1,(n-1)*N+1:n*N) = mtVar(1,(n-1)*N+1:n*N);
end
toc
I tried to transponse datamtVar = mtVar'; but that didn't help. But I think you meant that I should transponse data and write it on my disk and only then use mtVar(1, :) instead of using mtVar(:, 1)? If so, I think I could try it (on external disk) after the loop is completed.

How do you think, if the data were stored on a local hard drive, would time access to the same data be few seconds/minutes? I can't check it now because I don't have enough space on my local hard drive.

If there is a way to store data of a complicated format? I mean my data is recorded kind of a 8 bytes Int16, 4 bytes Int32, 1024 bytes Single and that is repeated N times started from an Offset = 5000 bytes. If I used memmapfile, then I would write somethink like:
memVar = memmapfile([r_path r_file],'Offset',5000 ,'Format',...
{'Int16',4,'a';...
'Int32',1,'b';...
'Single',256,'c'},'Repeat',N,'Writable',true);
Is it possible to do somethink similar using MappedTensor?

@DylanMuir
Copy link
Owner

I think the access will definitely not be faster in a loop — MappedTensor does the best it can do to read the data efficiently, and looping in Matlab definitely won't be faster than that.

Re transposing, yes you're right. I mean transposing the data when you write it to disk, not transposing from Matlab. If you need to read the data many times, then it might be worthwhile to use MappedTensor to transpose the data and write it back to disk; then you can use the transposed data file from then on. But if you only need to read the data once to process it, then this approach won't help you.

No, there's no way to store complicated data formats using MappedTensor, since is must appear as a single Matlab variable. You can however use several binary files for storing the different fields, and access them together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants