New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing a large number of files increases Albert's memory usage drastically #1

Closed
hotice opened this Issue Jan 19, 2015 · 19 comments

Comments

9 participants
@hotice

hotice commented Jan 19, 2015

Albert uses just about 9-10 MiB of RAM on my system which is really great. But after I set it to index some folders containing a large number of files (they contain roughly 280,000 files), its memory usage jumped to 280-300 MiB of RAM.

Tested under Ubuntu 14.10 64bit.

@ManuelSchneid3r

This comment has been minimized.

Show comment
Hide comment
@ManuelSchneid3r

ManuelSchneid3r Jan 19, 2015

Member

Yes, the reason for alberts performance is that the files are beeing indexed in a way the search algorithm needs them. The word match produces an inverted index, which means there is the same amount of entries as you have words in your files names. The fuzzy index is more complex and has, depending on the config, about three to five times of the memory usage.

The key of further development is to get an space efficient index structure for the file index, which is planned. An other idea is to reduce the amount of files, e.g. by mimetype, which is planned too, but in far future.

If you really need all of those 280,000 files to be launchable from albert, then I would say you are an advanced user and have to accept the amount of memory used. (Although the first idea would still help reducing mem-usage) If not you may still choose the folders which should be indexed (and especially which not).

I will leave it open and come back if one of the ideas is done.

Member

ManuelSchneid3r commented Jan 19, 2015

Yes, the reason for alberts performance is that the files are beeing indexed in a way the search algorithm needs them. The word match produces an inverted index, which means there is the same amount of entries as you have words in your files names. The fuzzy index is more complex and has, depending on the config, about three to five times of the memory usage.

The key of further development is to get an space efficient index structure for the file index, which is planned. An other idea is to reduce the amount of files, e.g. by mimetype, which is planned too, but in far future.

If you really need all of those 280,000 files to be launchable from albert, then I would say you are an advanced user and have to accept the amount of memory used. (Although the first idea would still help reducing mem-usage) If not you may still choose the folders which should be indexed (and especially which not).

I will leave it open and come back if one of the ideas is done.

@hotice

This comment has been minimized.

Show comment
Hide comment
@hotice

hotice Jan 19, 2015

Thanks for the answer!

hotice commented Jan 19, 2015

Thanks for the answer!

@anibalardid

This comment has been minimized.

Show comment
Hide comment
@anibalardid

anibalardid Feb 13, 2015

Hi Manuel . I really love this app. It's cute, fast and functional !

Do you have ETA about new updates ?

regards !

anibalardid commented Feb 13, 2015

Hi Manuel . I really love this app. It's cute, fast and functional !

Do you have ETA about new updates ?

regards !

@simrc

This comment has been minimized.

Show comment
Hide comment
@simrc

simrc May 2, 2015

Hi Manuel , congratulations for ìdea , software designed very well ! I have a bug , after a few days of use , begins to show results in Chinese , and no longer looks in the directory , and not giving any results . I hope to be helpful , and to have as soon as new updates on this great application !

simrc commented May 2, 2015

Hi Manuel , congratulations for ìdea , software designed very well ! I have a bug , after a few days of use , begins to show results in Chinese , and no longer looks in the directory , and not giving any results . I hope to be helpful , and to have as soon as new updates on this great application !

@ManuelSchneid3r

This comment has been minimized.

Show comment
Hide comment
@ManuelSchneid3r

ManuelSchneid3r May 2, 2015

Member

Yes currently this is the main problem, I know. The pluginsystem is mainly done. Atm I am working on the "ports" of the modules to plugins. I especially spend a lot of time in the files plugin. I needed a lot of time trying to figure out how to reduce the memory usage. I tried different spaceefficient, in-memory data structures like folder maps or radix trees, but all of them are quite cumbersome to handle. Currently I am testing a sqlite database containing the data. I guess this will be the solution for the future, if if does not slow down the lookup too much. This has one essential advantage: No optimizing for space anymore and not limit for future ideas that may imply a larger memusage, e.g. metadata, aliases and stuff like that, since the megabytes will not hurt on the disk.
The Chinese stuff is somehow related to the serialization. This will not be an issue anymore with the database.

@anibalardid I guess not, this is still a hobby project. If some people volunteer for contribution I might make I to an organization project. But atm my work on albert and the releases are heavily dependend on my studies.

Regards

Member

ManuelSchneid3r commented May 2, 2015

Yes currently this is the main problem, I know. The pluginsystem is mainly done. Atm I am working on the "ports" of the modules to plugins. I especially spend a lot of time in the files plugin. I needed a lot of time trying to figure out how to reduce the memory usage. I tried different spaceefficient, in-memory data structures like folder maps or radix trees, but all of them are quite cumbersome to handle. Currently I am testing a sqlite database containing the data. I guess this will be the solution for the future, if if does not slow down the lookup too much. This has one essential advantage: No optimizing for space anymore and not limit for future ideas that may imply a larger memusage, e.g. metadata, aliases and stuff like that, since the megabytes will not hurt on the disk.
The Chinese stuff is somehow related to the serialization. This will not be an issue anymore with the database.

@anibalardid I guess not, this is still a hobby project. If some people volunteer for contribution I might make I to an organization project. But atm my work on albert and the releases are heavily dependend on my studies.

Regards

@drgibbon

This comment has been minimized.

Show comment
Hide comment
@drgibbon

drgibbon Jul 16, 2015

@ManuelSchneid3r I wonder if using an established search library might make things easier? For example, Recoll (a nice QT desktop search program) uses the xapian backend and it's extremely fast and efficient, even with very large amounts of files.

drgibbon commented Jul 16, 2015

@ManuelSchneid3r I wonder if using an established search library might make things easier? For example, Recoll (a nice QT desktop search program) uses the xapian backend and it's extremely fast and efficient, even with very large amounts of files.

@ManuelSchneid3r

This comment has been minimized.

Show comment
Hide comment
@ManuelSchneid3r

ManuelSchneid3r Oct 6, 2015

Member

Since using selective indexing this should not be an issue anymore. Please reopen if im wrong.

Member

ManuelSchneid3r commented Oct 6, 2015

Since using selective indexing this should not be an issue anymore. Please reopen if im wrong.

@ManuelSchneid3r

This comment has been minimized.

Show comment
Hide comment
@ManuelSchneid3r

ManuelSchneid3r Oct 6, 2015

Member

@drgibbon Yes xapian is a nice idea. It will somewhen get a dedicated plugin, but there are tons of other things to do.

Member

ManuelSchneid3r commented Oct 6, 2015

@drgibbon Yes xapian is a nice idea. It will somewhen get a dedicated plugin, but there are tons of other things to do.

@baltazarortiz

This comment has been minimized.

Show comment
Hide comment
@baltazarortiz

baltazarortiz Apr 7, 2016

Still an issue, unfortunately - uses about 1.5 GB of memory to index my home folder. Arguably worth the fast search speed versus other launchers I've tried, though something like recoll's solution would be interesting. I'll have to look into how that works once I learn a bit more about the Albert code.

baltazarortiz commented Apr 7, 2016

Still an issue, unfortunately - uses about 1.5 GB of memory to index my home folder. Arguably worth the fast search speed versus other launchers I've tried, though something like recoll's solution would be interesting. I'll have to look into how that works once I learn a bit more about the Albert code.

@ManuelSchneid3r

This comment has been minimized.

Show comment
Hide comment
@ManuelSchneid3r

ManuelSchneid3r Apr 7, 2016

Member

Recolls backend is xapian, which has been discussed. The scope is an other, which is not less relevant though. Xapian is a software system containung lexers, indexers and searches too with addidional scientific features like stemming and sophisticated search algorithms like BM25 etc. There are reason why a (future) xapian extension (XE) and the file extension (FE) will never be merged:

  • FE uses util::offlineIndex. util::offlineIndex supports a static tokenizer and 2 searches, prefix and fuzzy, with their index counterparts, inverted and qgram index. Xapian has its own tokenizer, indizes and searches. Which btw means, that, afaik, xapian does not support error tolerant search atm.
  • FE supports usage counters, xapian does not (afaik)
  • However xapian has the advantage of document fulltext search. Which clearly separates its use case from the FE. FE is on file level for most mimetypes. Xapian is on document content level, which is fine though.

So they are different use cases for each, and there is definitvely the need for a xapian (or any other text analysis tool) integration. But acutally I have plenty of core stuff to do. I'd appreciate if you take part in development and integrate it, but please communicate it if you do so.

Back to topic: The memory problem is still the same, the isolation of non relevant mimetypes is just a temporary solution. The current architecture is completely in memory and (simplified) as follows:

Whereever you see stars ther is room for even naive optimizations.

A serious problem is the size of QString. There are some ideas that may or may not be trivial to implement:

  • Use radix/prefix trees
  • Use some special maps that do not store the key value (hashmaps)
  • Completely outsource indexing with e.g. databases like SQLite

But thats all highly practically and theoretically involved and takes a lot of time. Further getting this huge amount of files indexed is a high level requirement, my 20000 files do not even have an impact on memory.

@baltazarortiz how many files did you index? Another aspect is that albert does not require the complete 1.5 gb peranently. It was allocated while indexing and freed after it. The kernel will may get some of it back if it really needs it. Well virtual memory management is complicated.

Member

ManuelSchneid3r commented Apr 7, 2016

Recolls backend is xapian, which has been discussed. The scope is an other, which is not less relevant though. Xapian is a software system containung lexers, indexers and searches too with addidional scientific features like stemming and sophisticated search algorithms like BM25 etc. There are reason why a (future) xapian extension (XE) and the file extension (FE) will never be merged:

  • FE uses util::offlineIndex. util::offlineIndex supports a static tokenizer and 2 searches, prefix and fuzzy, with their index counterparts, inverted and qgram index. Xapian has its own tokenizer, indizes and searches. Which btw means, that, afaik, xapian does not support error tolerant search atm.
  • FE supports usage counters, xapian does not (afaik)
  • However xapian has the advantage of document fulltext search. Which clearly separates its use case from the FE. FE is on file level for most mimetypes. Xapian is on document content level, which is fine though.

So they are different use cases for each, and there is definitvely the need for a xapian (or any other text analysis tool) integration. But acutally I have plenty of core stuff to do. I'd appreciate if you take part in development and integrate it, but please communicate it if you do so.

Back to topic: The memory problem is still the same, the isolation of non relevant mimetypes is just a temporary solution. The current architecture is completely in memory and (simplified) as follows:

Whereever you see stars ther is room for even naive optimizations.

A serious problem is the size of QString. There are some ideas that may or may not be trivial to implement:

  • Use radix/prefix trees
  • Use some special maps that do not store the key value (hashmaps)
  • Completely outsource indexing with e.g. databases like SQLite

But thats all highly practically and theoretically involved and takes a lot of time. Further getting this huge amount of files indexed is a high level requirement, my 20000 files do not even have an impact on memory.

@baltazarortiz how many files did you index? Another aspect is that albert does not require the complete 1.5 gb peranently. It was allocated while indexing and freed after it. The kernel will may get some of it back if it really needs it. Well virtual memory management is complicated.

@baltazarortiz

This comment has been minimized.

Show comment
Hide comment
@baltazarortiz

baltazarortiz Apr 7, 2016

I'm indexing ~800,000 files with fuzzy search on, so like I said before, I'm not expecting any program to be able to go through all of that without some amount of memory usage. It hasn't noticeably impacted system performance so my system must either not be needing more or is able to grab some of the freed memory like you said. I'm pretty new to software development (and especially open source work), but I've been wanting to learn more about QT, so this could be something fun to look into in what free time I have right now :)

On a side note, are the indexing information and other settings supposed to be stored anywhere? Every time I reboot, Albert loses any settings changes I've made appearance wise or to any of the plugins.

baltazarortiz commented Apr 7, 2016

I'm indexing ~800,000 files with fuzzy search on, so like I said before, I'm not expecting any program to be able to go through all of that without some amount of memory usage. It hasn't noticeably impacted system performance so my system must either not be needing more or is able to grab some of the freed memory like you said. I'm pretty new to software development (and especially open source work), but I've been wanting to learn more about QT, so this could be something fun to look into in what free time I have right now :)

On a side note, are the indexing information and other settings supposed to be stored anywhere? Every time I reboot, Albert loses any settings changes I've made appearance wise or to any of the plugins.

@somas95

This comment has been minimized.

Show comment
Hide comment
@somas95

somas95 Jun 8, 2016

Different plugins for diferent DE's backends would be nice: one for tracker (gnome) and one for Nepomuk (KDE)
Thanks for this wonderful program

somas95 commented Jun 8, 2016

Different plugins for diferent DE's backends would be nice: one for tracker (gnome) and one for Nepomuk (KDE)
Thanks for this wonderful program

@ManuelSchneid3r ManuelSchneid3r modified the milestone: v0.8.12 Sep 30, 2016

@ManuelSchneid3r ManuelSchneid3r self-assigned this Oct 5, 2016

@idkCpp idkCpp referenced this issue Oct 18, 2016

Closed

High Ram Usage #269

@ManuelSchneid3r ManuelSchneid3r modified the milestones: v0.8.12, v0.9, 0.9.1 Jan 4, 2017

@ayoisaiah

This comment has been minimized.

Show comment
Hide comment
@ayoisaiah

ayoisaiah Jan 28, 2017

Is it possible to add a feature where you could sort of ignore certain folders in the index no matter the directory they appear in? For example, I have several node_modules/ folders scattered around in my home directory with thousands of files usually. I would like to ignore them globally.

ayoisaiah commented Jan 28, 2017

Is it possible to add a feature where you could sort of ignore certain folders in the index no matter the directory they appear in? For example, I have several node_modules/ folders scattered around in my home directory with thousands of files usually. I would like to ignore them globally.

@ManuelSchneid3r

This comment has been minimized.

Show comment
Hide comment
@ManuelSchneid3r

ManuelSchneid3r Jan 28, 2017

Member

Will come.

Member

ManuelSchneid3r commented Jan 28, 2017

Will come.

@PaulBGD

This comment has been minimized.

Show comment
Hide comment
@PaulBGD

PaulBGD Mar 12, 2017

Definitely need ignoring node_modules, those folders are huge!

PaulBGD commented Mar 12, 2017

Definitely need ignoring node_modules, those folders are huge!

@ManuelSchneid3r

This comment has been minimized.

Show comment
Hide comment
@ManuelSchneid3r
Member

ManuelSchneid3r commented Mar 12, 2017

@PaulBGD

This comment has been minimized.

Show comment
Hide comment
@PaulBGD

PaulBGD Mar 12, 2017

@ManuelSchneid3r Oh wow, thanks! I figured this issue would be updated.

PaulBGD commented Mar 12, 2017

@ManuelSchneid3r Oh wow, thanks! I figured this issue would be updated.

@ManuelSchneid3r

This comment has been minimized.

Show comment
Hide comment
@ManuelSchneid3r

ManuelSchneid3r Apr 15, 2017

Member

Does v0.11 help to reduce the size? It does not touch the way things are stored. It just offers the user the opportunity to reduce the indexed files to those really needed. Unfortunately everything else would be a tradeoff between space and speed.

Member

ManuelSchneid3r commented Apr 15, 2017

Does v0.11 help to reduce the size? It does not touch the way things are stored. It just offers the user the opportunity to reduce the indexed files to those really needed. Unfortunately everything else would be a tradeoff between space and speed.

@ManuelSchneid3r ManuelSchneid3r modified the milestones: v0.11, v0.12 Apr 15, 2017

@ManuelSchneid3r

This comment has been minimized.

Show comment
Hide comment
@ManuelSchneid3r

ManuelSchneid3r May 14, 2017

Member

@i am closing this issue. There will be no way to asymptotically reduce the space complexity, only factors. Maybe I am wrong, maybe Patricia-Trie or some other data structure may help. But they do not come for free as well. I even stopped in investigating how to improve the space requirements (takes too much time). Therefore the best way to reduce space is to set proper filters. There are MIME filters. A proper implementation of (global) name filters will come soon. For now there are .albertignore files.

I can not imagine that it is a realistic use case that anybody needs several tens or hundred of thousands of files at hand.

Member

ManuelSchneid3r commented May 14, 2017

@i am closing this issue. There will be no way to asymptotically reduce the space complexity, only factors. Maybe I am wrong, maybe Patricia-Trie or some other data structure may help. But they do not come for free as well. I even stopped in investigating how to improve the space requirements (takes too much time). Therefore the best way to reduce space is to set proper filters. There are MIME filters. A proper implementation of (global) name filters will come soon. For now there are .albertignore files.

I can not imagine that it is a realistic use case that anybody needs several tens or hundred of thousands of files at hand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment