Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File search gets too slow with large databases #60

Closed
ftadel opened this issue Feb 9, 2018 · 5 comments
Closed

File search gets too slow with large databases #60

ftadel opened this issue Feb 9, 2018 · 5 comments

Comments

@ftadel
Copy link
Member

ftadel commented Feb 9, 2018

The issue is that all the file searches rely on comparing the filename with all the filenames in the database at every call. Function bst_get/findFileInStudies is bottleneck for processes that do a lot of operations with small files.
Issue reported by Emily: http://neuroimage.usc.edu/forums/t/processes-slowing-way-down/5280/4

Possible solution: manage a large hash table with all the filenames of the database?
(maybe this could be manage internally in bst_get.m)

@ftadel ftadel changed the title File search gets too slow with databases File search gets too slow with large databases Feb 9, 2018
@ftadel
Copy link
Member Author

ftadel commented Feb 12, 2018

Attached: Results of a profiling session illustrating the problem.
Open file0.html in any browser and start navigating. Issue is self explanatory.

profile_results.zip

@martcous
Copy link
Member

In function bst_get/findFileInStudies, can't we check whether the file we're looking for exists in each protocol study, instead of going through every files of the study?

@martcous
Copy link
Member

If we decide to go for a major refactoring, I would build a tree structure. I feel like a hierarchical structure is more appropriate than a hash table, since files are organized in protocols, studies, subjects, conditions and so forth.

@ftadel
Copy link
Member Author

ftadel commented Feb 23, 2018

In function bst_get/findFileInStudies, can't we check whether the file we're looking for exists in each protocol study, instead of going through every files of the study?

You mean checking the actual file on the hard drive?
This would be awfully slow.

If we decide to go for a major refactoring, I would build a tree structure. I feel like a hierarchical structure is more appropriate than a hash table, since files are organized in protocols, studies, subjects, conditions and so forth.

Indeed, this flat structure is quite bad. It dates back from way before I joined.
But I don't think it is that urgent to redesign everything now... It works OK for most people like this, so we don't want to introduce any major disruption in the way the database is organized. If we change the database structure itself, we'd have to deal with the update of thousands of old databases. I've done this before, it's painful.

Maybe we could do something relatively light on top of the existing structure, that would keep all the file names in the current database in a cache, so that we can find faster the corresponding study/subject/file indices.
A Java HashTable object (keys=filenames, values=indices in the bst database) in which we dump everything could be a fast way to find a file. If we want to go for a tree, we'd have to parse the string first (splitting based on file separators) and check the number of elements. Getting to the correct information would be faster, but with a terrible overhead.

You could try benchmarking a few of these techniques:

  • write a small function that creates a large hashtable from a large brainstorm database with tens of thousands of files (I'm sure you can find this somewhere on the BIC network), trying searching thousands of files in a loop, evaluate the time it takes with the Matlab profiler
  • repeat the same a different data structure

Does it make sense?
I'm on a bus now, but we could discuss next week if you need help

@martcous
Copy link
Member

Tentative speedup implemented in 6d6a453.

@ftadel ftadel closed this as completed Nov 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants