File search gets too slow with large databases #60

ftadel · 2018-02-09T10:28:06Z

The issue is that all the file searches rely on comparing the filename with all the filenames in the database at every call. Function bst_get/findFileInStudies is bottleneck for processes that do a lot of operations with small files.
Issue reported by Emily: http://neuroimage.usc.edu/forums/t/processes-slowing-way-down/5280/4

Possible solution: manage a large hash table with all the filenames of the database?
(maybe this could be manage internally in bst_get.m)

ftadel · 2018-02-12T17:20:03Z

Attached: Results of a profiling session illustrating the problem.
Open file0.html in any browser and start navigating. Issue is self explanatory.

profile_results.zip

martcous · 2018-02-22T22:29:27Z

In function bst_get/findFileInStudies, can't we check whether the file we're looking for exists in each protocol study, instead of going through every files of the study?

martcous · 2018-02-22T22:33:19Z

If we decide to go for a major refactoring, I would build a tree structure. I feel like a hierarchical structure is more appropriate than a hash table, since files are organized in protocols, studies, subjects, conditions and so forth.

ftadel · 2018-02-23T15:37:12Z

In function bst_get/findFileInStudies, can't we check whether the file we're looking for exists in each protocol study, instead of going through every files of the study?

You mean checking the actual file on the hard drive?
This would be awfully slow.

If we decide to go for a major refactoring, I would build a tree structure. I feel like a hierarchical structure is more appropriate than a hash table, since files are organized in protocols, studies, subjects, conditions and so forth.

Indeed, this flat structure is quite bad. It dates back from way before I joined.
But I don't think it is that urgent to redesign everything now... It works OK for most people like this, so we don't want to introduce any major disruption in the way the database is organized. If we change the database structure itself, we'd have to deal with the update of thousands of old databases. I've done this before, it's painful.

Maybe we could do something relatively light on top of the existing structure, that would keep all the file names in the current database in a cache, so that we can find faster the corresponding study/subject/file indices.
A Java HashTable object (keys=filenames, values=indices in the bst database) in which we dump everything could be a fast way to find a file. If we want to go for a tree, we'd have to parse the string first (splitting based on file separators) and check the number of elements. Getting to the correct information would be faster, but with a terrible overhead.

You could try benchmarking a few of these techniques:

write a small function that creates a large hashtable from a large brainstorm database with tens of thousands of files (I'm sure you can find this somewhere on the BIC network), trying searching thousands of files in a loop, evaluate the time it takes with the Matlab profiler
repeat the same a different data structure

Does it make sense?
I'm on a bus now, but we could discuss next week if you need help

martcous · 2018-05-11T18:24:08Z

Tentative speedup implemented in 6d6a453.

ftadel changed the title ~~File search gets too slow with databases~~ File search gets too slow with large databases Feb 9, 2018

ftadel closed this as completed Nov 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File search gets too slow with large databases #60

File search gets too slow with large databases #60

ftadel commented Feb 9, 2018

ftadel commented Feb 12, 2018

martcous commented Feb 22, 2018

martcous commented Feb 22, 2018

ftadel commented Feb 23, 2018

martcous commented May 11, 2018

File search gets too slow with large databases #60

File search gets too slow with large databases #60

Comments

ftadel commented Feb 9, 2018

ftadel commented Feb 12, 2018

martcous commented Feb 22, 2018

martcous commented Feb 22, 2018

ftadel commented Feb 23, 2018

martcous commented May 11, 2018