-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lyrics: Avoid searches with empty metadata and repeated searches for the same metadata #2635
Comments
|
i researched bloom filters for this recently and found two main implementations: pybloom, which seems abandoned but has an active fork (that doesn't seem to have been picked up by distros, btw), and pybloomfiltermmap, which seems to be a Cython-based approached with mmap which is faster by a few orders of magnitude. i'm not sure this is such a critical path that we'd need that mmap version, but i guess this is something to consider... |
|
a bloom filter would look something like this: diff --git a/beetsplug/lyrics.py b/beetsplug/lyrics.py
index 57265a46..871c52b3 100644
--- a/beetsplug/lyrics.py
+++ b/beetsplug/lyrics.py
@@ -32,6 +32,8 @@
import six
from six.moves import urllib
+from pybloom import ScalableBloomFilter
+
try:
from bs4 import SoupStrainer, BeautifulSoup
HAS_BEAUTIFUL_SOUP = True
@@ -641,6 +643,9 @@ def __init__(self):
self.config['google_engine_ID'].redact = True
self.config['genius_api_key'].redact = True
+ # bloom filter of non-existent lyrics
+ self.bf = ScalableBloomFilter()
+
# State information for the ReST writer.
# First, the current artist we're writing.
self.artist = u'Unknown artist'
@@ -858,12 +863,17 @@ def get_lyrics(self, artist, title):
"""Fetch lyrics, trying each source in turn. Return a string or
None if no lyrics were found.
"""
+ # if we already failed to find that pair, just skip the request
+ if artist + '-' + title in self.bf:
+ return
for backend in self.backends:
lyrics = backend.fetch(artist, title)
if lyrics:
self._log.debug(u'got lyrics from backend: {0}',
backend.__class__.__name__)
return _scrape_strip_cruft(lyrics, True)
+ # remember our failure
+ self.bf.add(artist + '-' + title)
def append_translation(self, text, to_lang):
import xml.etree.ElementTree as ETit significantly speeds up the startup here, at least. |
|
Hmm; good point. Here are a few scattered comments:
|
|
yep, i definitely got the bloom filter backwards here, sorry. we should remember success and exit if the element is not in the filter. i was wondering if a filtering on empty combinations is one thing, but there are also actual duplicates across compilations and so on that we could use. |
|
Sure, beets does handle lots of large collections. But I'd hazard a guess that it's a minority of one's library that will fail to match, meaning that we wouldn't have to keep track of all 50k songs in such a set. It's worth measuring, at least! 🚀 |
|
okay let's see here: let's bump that up to 100k songs, just for fun. let's use this benchmarking script: #!/usr/bin/python3
import datetime
import logging
import humanize
import os
import psutil
import random
import string
import sys
from pybloom import ScalableBloomFilter, BloomFilter
from pybloomfilter import BloomFilter as BloomFilterMmap
class Timer(object):
"""this class is to track time and resources passed"""
def __init__(self):
"""initialize the timstamp"""
self.stamp = datetime.datetime.now()
def times(self):
"""return a string designing resource usage"""
return 'user %s system %s chlduser %s chldsystem %s' % os.times()[:4]
def rss(self):
process = psutil.Process(os.getpid())
return process.memory_info().rss
def memory(self):
return 'RSS %s' % humanize.naturalsize(self.rss())
def diff(self):
"""a datediff between the creation of the object and now"""
return datetime.datetime.now() - self.stamp
def __str__(self):
"""return a string representing the time passed and resources used"""
return 'elasped: %s (%s %s)' % (str(self.diff()),
self.times(),
self.memory())
def randomwords(n, length=10):
for c in range(n):
yield ''.join(random.choice(string.lowercase) for i in range(length))
def load_set(n):
s = set()
for word in randomwords(n):
s.add(word)
return s
def load_bloom(n):
bf = ScalableBloomFilter()
for word in randomwords(n):
bf.add(word)
return bf
def BloomFilterMmapWrapper():
return BloomFilterMmap(100000, 0.1, 'filter.bloom')
def BloomFilterWrapper():
return BloomFilter(capacity=100000, error_rate=0.1)
def run_tests(n):
logging.basicConfig(level=logging.DEBUG, format='%(message)s')
for tool in (set,
BloomFilterMmapWrapper,
BloomFilterWrapper,
ScalableBloomFilter):
timer = Timer()
s = tool()
for word in randomwords(n):
s.add(word)
logging.info('%r loaded %d entries: %s', tool, n, timer)
timer = Timer()
for word in randomwords(n):
if word in s:
pass
logging.info('%r %d lookups: %s', tool, n, timer)
if __name__ == '__main__':
run_tests(int(sys.argv[1]))and let's run it: $ python set-tests.py 100000
<type 'set'> loaded 100000 entries: elasped: 0:00:00.680757 (user 0.72 system 0.02 chlduser 0.0 chldsystem 0.0 RSS 26.5 MB)
<type 'set'> 100000 lookups: elasped: 0:00:00.664798 (user 1.39 system 0.02 chlduser 0.0 chldsystem 0.0 RSS 27.1 MB)
<function BloomFilterMmapWrapper at 0x7f3763fa1848> loaded 100000 entries: elasped: 0:00:00.670700 (user 2.06 system 0.02 chlduser 0.0 chldsystem 0.0 RSS 19.0 MB)
<function BloomFilterMmapWrapper at 0x7f3763fa1848> 100000 lookups: elasped: 0:00:00.657779 (user 2.72 system 0.02 chlduser 0.0 chldsystem 0.0 RSS 19.0 MB)
<function BloomFilterWrapper at 0x7f3763fa18c0> loaded 100000 entries: elasped: 0:00:01.274390 (user 3.98 system 0.02 chlduser 0.0 chldsystem 0.0 RSS 19.0 MB)
<function BloomFilterWrapper at 0x7f3763fa18c0> 100000 lookups: elasped: 0:00:01.093647 (user 5.07 system 0.02 chlduser 0.0 chldsystem 0.0 RSS 19.0 MB)
<class 'pybloom.pybloom.ScalableBloomFilter'> loaded 100000 entries: elasped: 0:00:07.461861 (user 12.52 system 0.02 chlduser 0.0 chldsystem 0.0 RSS 19.3 MB)
<class 'pybloom.pybloom.ScalableBloomFilter'> 100000 lookups: elasped: 0:00:07.427330 (user 19.91 system 0.03 chlduser 0.0 chldsystem 0.0 RSS 19.3 MB)observations:
If we go up one more order of magnitude, to 1 million songs, results are roughly similar as well for most tools except the ScalableBloomFilter which explodes to 87 seconds execution time (instead of 6 seconds for Anyways. All this to show that |
|
results with 100 character words and 1M entries: $ python set-tests.py 1000000
<type 'set'> loaded 1000000 entries: elasped: 0:00:56.268277 (user 56.18 system 0.08 chlduser 0.0 chldsystem 0.0 RSS 219.3 MB)
<type 'set'> 1000000 lookups: elasped: 0:00:55.556960 (user 111.71 system 0.09 chlduser 0.0 chldsystem 0.0 RSS 227.4 MB)
<function BloomFilterMmapWrapper at 0x7fd285180848> loaded 1000000 entries: elasped: 0:00:56.018806 (user 167.63 system 0.11 chlduser 0.0 chldsystem 0.0 RSS 48.9 MB)
<function BloomFilterMmapWrapper at 0x7fd285180848> 1000000 lookups: elasped: 0:00:55.751069 (user 223.31 system 0.11 chlduser 0.0 chldsystem 0.0 RSS 48.9 MB)
<function BloomFilterWrapper at 0x7fd2851808c0> loaded 1000000 entries: elasped: 0:01:02.873864 (user 286.16 system 0.11 chlduser 0.0 chldsystem 0.0 RSS 48.8 MB)
<function BloomFilterWrapper at 0x7fd2851808c0> 1000000 lookups: elasped: 0:01:01.404508 (user 347.55 system 0.12 chlduser 0.0 chldsystem 0.0 RSS 48.8 MB)
<class 'pybloom.pybloom.ScalableBloomFilter'> loaded 1000000 entries: elasped: 0:02:33.855005 (user 501.24 system 0.14 chlduser 0.0 chldsystem 0.0 RSS 52.6 MB)
<class 'pybloom.pybloom.ScalableBloomFilter'> 1000000 lookups: elasped: 0:02:30.389974 (user 651.39 system 0.14 chlduser 0.0 chldsystem 0.0 RSS 52.6 MB)
|
|
Awesome! I’m a sucker for quantitative measurement. I would have never have guessed that the scalable Bloom filter implementation would be so slow, for example… it’s hard to think of a circumstance where that would be useful. Anyway, sounds like a normal cache with |
I thought it could be a good idea to optimize lyrics searches. here, when i want to fetch all the lyrics in my collection, the first results are like this:
a few things come to mind here:
Network access is expensive, way more than disk or memory. We should limit is as much as possible, if only to avoid being blocked at the provider (e.g. #2632 for example).
While we could try and tweak our searches to (say) avoid searching certain patterns like (say) empty artists or titles, it would seem even better to use a bloom filter for this purpose. It would delegate to the backends the responsibility of filtering out ridiculous requests, but we would do them at least only once. And since the resulting search from item A and B may be the same, we would also avoid sending duplicate requests to the backend.
Opinions?
The text was updated successfully, but these errors were encountered: