Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fallback for item access to album's attributes #2988

Open
wants to merge 8 commits into
base: master
from

Conversation

Projects
None yet
4 participants
@FichteFoll
Copy link
Contributor

commented Jul 20, 2018

Allows queries (especially for pathspecs) based on an album's flexattrs
while operating on items.

Fixes #2797.

I'll leave formulating the changelog up to you.

@sampsyo
Copy link
Member

left a comment

Thank you for diving into this! This is a tricky issue, despite being a relatively small diff, and I appreciate your patience while we think through it.

I've left one minor comment inline. I also have a few general discussion points, from minor to major:

  1. We have an existing mechanism for formatting album-level data for items. This would need to be removed if we move the "merging" behavior lower in the abstraction hierarchy.
  2. You asked about caching issues. In our current system, Item and Album objects are not "live," in that changing them through one reference does not magically change their values in another context. They represent a sort of snapshot of the database, which is why they have a load method for updating values from the database. In that sense, caching the associated album might be even more reasonable than reloading the values every time.
  3. This kind of change would completely hide item-level data, and eliminate the possibility for items to differ from each other on album-level fields. This is probably OK, but we have to consider the consequences. For example, in beets currently, it's possible to "work around" album-level fields you don't like and make some values different on a per-track basis. For example, if you decide you want country to vary across the tracks in a certain album, you can accomplish that if you really want to. With this change, however, those differences would be silently hidden. Again, this is probably OK, but we'd need to be clear about what's changing.
return self[key]
else:
return default

This comment has been minimized.

Copy link
@sampsyo

sampsyo Jul 22, 2018

Member

This is really just a matter of aesthetics, but one alternative would be to have get just do the necessary exception handling. That is, __getitem__ would continue to contain all the "real" logic and include the raise at the end. Then, get would change to contain:

try:
    return self[key]
except KeyError:
    return default

That would help keep the docstring for get sensible, for example.

This comment has been minimized.

Copy link
@FichteFoll

FichteFoll Jul 22, 2018

Author Contributor

It might have been a bit of premature optimization, but afaik exception handling in Python isn't exactly performant (creation of exception object, traceback, caching of frames etc.) and combining the code paths was pretty trivial. I also liked that I could use this to have a short implementation in Item.get.

The docstring issue doesn't bother me personally because it properly describes what get can do. It does have a different function signature than dict.get, which we need for the with_album parameter anyway, but it's still compatible. I think I prefer Optionally over Alternatively, however.

I would have made those parameters keyword-only, but that is only possible in Python 3.

@FichteFoll

This comment has been minimized.

Copy link
Contributor Author

commented Jul 22, 2018

  1. Not sure what you mean by this. Formatting works exactly like it did before. However, I missed that self.model_keys is fetched from model.keys(True), which includes the album fallback (and results in the item's formatter for a field being requested rather than the album's). I'll override self.model_keys in FormattedItemMapping's __init__.

  2. I see. So, I could keep an Item-internal reference to an album and just run load on that before trying to access album attributes (or keys()). Is load smart enough not to update itself when there haven't been any changes to the database since the last fetch? If it doesn't, then this change would be more involved and requires some care to actually make the caching useful. (I was thinking about storing the same album reference in multiple items for example, but I'd need to call load for each item individually anyway because we can't be sure the database hasn't been updated in the meantime.) Otherwise, just keeping a lazy-loaded album attribute per item individually would probably already be an improvement.

  3. I'm not sure I understand the problem you're describing because, as far as I'm aware, all of this is still possible. Attribute access on the item itself is prioritized over the album fallback, for their standard fields and even for flexattrs. _setitem was not modified, so you can also still set an item's field or flexattr to override an album's.

@sampsyo

This comment has been minimized.

Copy link
Member

commented Jul 22, 2018

Not sure what you mean by this. Formatting works exactly like it did before. However, I missed that self.model_keys is fetched from model.keys(True), which includes the album fallback (and results in the item's formatter for a field being requested rather than the album's). I'll override self.model_keys in FormattedItemMapping's init.

Thanks for catching that!

What I'm worried about is not a direct conflict or anything—just that we're implementing the same logic ("fallback" between item and album attributes) twice. If evaluating the expression item.field already looks up field in item's album, then ideally we would not need FormattedItemMapping—the plain old FormattedMapping from dbcore would do the trick.

But as you discovered, there's subtlety about which formatter gets used. Maybe there's an elegant way to provide a merged view without the duplication, but maybe this division of responsibilities is OK.

I see. So, I could keep an Item-internal reference to an album and just run load on that before trying to access album attributes (or keys()). Is load smart enough not to update itself when there haven't been any changes to the database since the last fetch? If it doesn't, then this change would be more involved and requires some care to actually make the caching useful. (I was thinking about storing the same album reference in multiple items for example, but I'd need to call load for each item individually anyway because we can't be sure the database hasn't been updated in the meantime.) Otherwise, just keeping a lazy-loaded album attribute per item individually would probably already be an improvement.

No, load always loads the latest data. (Otherwise, we'd need some mechanism on the side for tracking when the database has changed—which likely would be no faster to check than just loading from the database.)

I'm not sure I understand the problem you're describing because, as far as I'm aware, all of this is still possible. Attribute access on the item itself is prioritized over the album fallback, for their standard fields and even for flexattrs. _setitem was not modified, so you can also still set an item's field or flexattr to override an album's.

OK, good point! I had missed that existing values on items take precedence. That means, unless I'm mistaken, that item-level fixed attributes always take precedence—because it's impossible to remove them. Sounds good!

@FichteFoll

This comment has been minimized.

Copy link
Contributor Author

commented Jul 22, 2018

just that we're implementing the same logic ("fallback" between item and album attributes) twice

Yes, I noticed that too, but I believe we still need the formatter for the reasons you mentioned (and also because it performs other tasks such as alias mapping).

So, I tried implementing a lazy-loaded and cached album property for internal use, next to get_album, but I quickly realized that this doesn't really improve things a lot and it certainly hurts readability in a way that it makes code complex. You'd think you can turn a couple album = self.get_album(); if album: … into just if self.album:, but the property getter still needs to call load on every access. In the end, it comes down to a load call vs a get_album call, and with the former you'd end up having to juggle an internal-only album object for yet another way to access an item's album.

@sampsyo

This comment has been minimized.

Copy link
Member

commented Jul 23, 2018

So, I tried implementing a lazy-loaded and cached album property for internal use, next to get_album, but I quickly realized that this doesn't really improve things a lot and it certainly hurts readability in a way that it makes code complex. You'd think you can turn a couple album = self.get_album(); if album: … into just if self.album:, but the property getter still needs to call load on every access. In the end, it comes down to a load call vs a get_album call, and with the former you'd end up having to juggle an internal-only album object for yet another way to access an item's album.

I see—thanks for giving it a try, and I can see how that would be less than ideal.

Anyway, this is shaping up well! It might be a good idea to run a few simple performance tests to make sure we aren't doing something terrible to the time required to run beet list, for example.

@FichteFoll

This comment has been minimized.

Copy link
Contributor Author

commented Jul 26, 2018

~ λ hyperfine "beet list" -m 2
Benchmark #1: beet list
  Time (mean ± σ):      7.207 s ±  0.018 s    [User: 5.438 s, System: 0.759 s]
  Range (min … max):    7.194 s …  7.220 s

~/code/beets ∃ hyperfine "python -m beets list" -m 2
Benchmark #1: python -m beets list
  Time (mean ± σ):     17.757 s ±  0.093 s    [User: 12.581 s, System: 2.135 s]
  Range (min … max):   17.691 s … 17.823 s

Well, not looking so bright. It's a >100% slowdown. This'd need some smart caching, probably. I do wonder why the difference is so high, though. I mean, the ItemFormatter needed to access the item's album before as well. Maybe keys is run more often than I expected?

Also, I should probably add some documentation about this change.

Benchmark tool: https://github.com/sharkdp/hyperfine

@sampsyo

This comment has been minimized.

Copy link
Member

commented Jul 26, 2018

Hmm, that is a little worrisome. Let's dig a little deeper and see if we can't mitigate some of the effects. (Thanks for the tip about hyperfine, btw!)

@FichteFoll

This comment has been minimized.

Copy link
Contributor Author

commented Jul 26, 2018

(I just found out about hyperfine today as I browsed the fd Readme by the same author.)

This is probably the point where I would start to look into profiling as I'm still not too familiar with the code base and believe this would provide a good starting point. Have you ever done this in python and have some recommendations for tools or other tips? (I haven't.)

@sampsyo

This comment has been minimized.

Copy link
Member

commented Jul 26, 2018

I think that cProfile, in the standard library, is still probably the best profiler out there. One tip I do have, however, is that SnakeViz is a really nice browser-based GUI for viewing/navigating profile data.

@FichteFoll FichteFoll force-pushed the FichteForks:pr/item-album-fallback branch 2 times, most recently from 13c868f to b6bf829 Sep 14, 2018

@FichteFoll

This comment has been minimized.

Copy link
Contributor Author

commented Sep 14, 2018

Took a look back at this. I used py-spy for some quick effort-less profiling and it was quite obvious that the majority of the time is being spent with database access in get_album (or rather the album property, as I changed it). Just uncommenting the album.load() code removes the entire performance impact, but it also means the albums we're trying to print could be outdated.

I considered the simplest solution forward to be what I suggested earlier:

  1. Make the item cache its album field and provide access through a property. The album returned by this property is read only since it is, well, cached. I decided against preventive measures here and instead made the property "hidden" with an underscore and provided documentation.
  2. Only load database model objects when they have changed by tracking a revision number that I added to the database and increase on each mutating transaction. Had to tweak this for a little while until it passed all tests, but I suppose this is fairly safe going forward now. I added a comment clarifying on the possibility of race conditions, but as long as the _db_lock is aquired, we are fine.

Let me know what you think.

Also, I wasn't sure if I should add a section regarding the API to the changelog. It would mention the fallback of item access on Item and that re-loading is now lazy, although the latter should be transparent.

@FichteFoll

This comment has been minimized.

Copy link
Contributor Author

commented Sep 14, 2018

Benchmarks: first is with this PR, second is 1.4.7.

~/code/beets ¬ hyperfine "python -m beets list" -m 20 && hyperfine "beet list" -m 20
Benchmark #1: python -m beets list

  Time (mean ± σ):      6.188 s ±  0.216 s    [User: 4.582 s, System: 0.761 s]

  Range (min … max):    6.005 s …  6.948 s

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark #1: beet list

  Time (mean ± σ):      6.036 s ±  0.044 s    [User: 4.455 s, System: 0.749 s]

  Range (min … max):    5.956 s …  6.115 s

FichteFoll added a commit to FichteFoll/dotfiles that referenced this pull request Sep 15, 2018

[beets] Update beets path config
Utilize album fields for special formatting of doujin releases.
Requires a currently unmerged PR to beets.

beetbox/beets#2988
@sampsyo

This comment has been minimized.

Copy link
Member

commented Sep 16, 2018

Wow! This is extremely cool! Very nice work!

You’re right that the “revision” trick is fragile, since it requires us to intervene on all updates of the database and only avoids races because we implement our own internal global lock, but this seems like a great trade-off for the performance win it affords. This seems worth doing independent of the new query behavior you’ve introduced here.

As an aside, we’re already doing something similar for memozing %aunique strings, which are otherwise very expensive to recompute:
https://github.com/beetbox/beets/blob/b6bf82933ed3e8233e200079c92e2577a5ad5040/beets/library.py

Perhaps we should move this mechanism to reuse the revision mechanism (to detect when it’s time to invalidate the memoization table).

I do like the idea of adding a note about the API change “for developers” in the changelog. The new fallback behavior is worth documenting, even if _cached_album itself isn’t.

@FichteFoll

This comment has been minimized.

Copy link
Contributor Author

commented Sep 17, 2018

I see. Yes, that would probably be useful for the aunique feature and might even warrant a proper implementation (i.e. a "public API"), but I'd rather not do this in here.

Regarding the changelog, I can do that. I deliberately made _cached_album an internal property (with the underscore) because I wasn't confident in exposing it. It is, after all, kind of a workaround, although it's the best I could think of. But as long as it's private, it can be changed.

By the way, as you can see from the commit that references this PR, I started using this in production and haven't encountered any issues so far. I probably could have done that before as well, as I wasn't too concerned about the performance, but for this PR it was a must.

@sampsyo

This comment has been minimized.

Copy link
Member

commented Sep 17, 2018

OK, great. Since it’s a sensitive change, it might be wise to put out a call for testing so folks can try it out with funky configurations. I’ll post something to Discourse.

@FichteFoll

This comment has been minimized.

Copy link
Contributor Author

commented Sep 17, 2018

Before you check: The CI failure is unrelated to the PR and caused by some error with curl when trying to download the Python 3.4 image for flake8 checking.

@sampsyo

This comment has been minimized.

Copy link
Member

commented Sep 18, 2018

(Thanks. I restarted that Travis job and everything’s fine.)

@FichteFoll

This comment has been minimized.

Copy link
Contributor Author

commented Oct 30, 2018

Any updates on this? Doesn't seem like the discourse thread attracted much attention.

FichteFoll added some commits Jul 20, 2018

Add fallback for item access to album's attributes
Allows queries (especially for pathspecs) based on an album's flexattrs
while operating on items.

Fixes #2797.
Override FormattedItemMapping.model_keys
model_keys was inferred from `self.keys(True)`, which would include the
fallback album keys. Since FormattedItemMapping has its own algorithm
for album attributes, we only care about the item's *actual* keys.
Cache an item's album with a property
Use Album.load() whenever the album is requested, which causes it to be
reloaded from the database.

Drawback: This adds a slowdown of 100% (6.2s to 12.6s) to `beet list`
on my setup.
Implement database and model revision checking
Prevents reloading a model from the database when it hasn't changed.

Now we're back to almost the same speed as before the addition of album
field fallbacks.

@FichteFoll FichteFoll force-pushed the FichteForks:pr/item-album-fallback branch from db9103d to 29acdd6 Nov 16, 2018

@FichteFoll

This comment has been minimized.

Copy link
Contributor Author

commented Nov 16, 2018

Rebased to fix the merge conflict on the changelog. AppVeyor has some errors in the setup phase with chocolatey.

@FichteFoll

This comment has been minimized.

Copy link
Contributor Author

commented Jan 7, 2019

Still nobody using this, it seems. 😞

Let me know when you intend to merge this, so I only need to fix the changelog conflict once (or you do it 🤷‍♂ ).

@FichteFoll

This comment has been minimized.

Copy link
Contributor Author

commented Mar 25, 2019

Someone was asking for this a few days ago on IRC, but I missed them and couldn't point towards this PR.

Anyway, I've been using this branch for half a year now with exactly 0 issues so far. I don't use the entire feature set of beets, but importing and path styles based on album flexattrs, which is my primary use case, are just fine.

I'll try to remember making a new speed comparison since my library grew a bit over time, but I don't expect it to be much different compared to the last time.

@kergoth

This comment has been minimized.

Copy link

commented May 14, 2019

FYI, I ran into a couple of issues with this, mostly relating to types in the fallback, both in path format queries and in beet ls. See https://discourse.beets.io/t/ranges-not-working-in-beet-ls-with-album-fields-in-item-track-context/

@arcresu

This comment has been minimized.

Copy link
Member

commented May 14, 2019

I wasn't aware of this when I threw together the diff on the discourse thread @kergoth mentioned. I'll just reproduce it here:

diff --git a/beets/library.py b/beets/library.py
index 16db1e97..71b6db22 100644
--- a/beets/library.py
+++ b/beets/library.py
@@ -526,7 +526,17 @@ class Item(LibModel):
 
     @classmethod
     def _getters(cls):
-        getters = plugins.item_field_getters()
+        def atoi(f, ag):
+            def ig(i):
+                a = i.get_album()
+                if a:
+                    return ag(a)
+                else:
+                    return cls._type(f).null
+            return ig
+        getters = {f: atoi(f, g)
+                   for f, g in plugins.album_field_getters().items()}
+        getters.update(plugins.item_field_getters())
         getters['singleton'] = lambda i: i.album_id is None
         getters['filesize'] = Item.try_filesize  # In bytes.
         return getters
diff --git a/beets/ui/__init__.py b/beets/ui/__init__.py
index 327db6b0..c3adc72d 100644
--- a/beets/ui/__init__.py
+++ b/beets/ui/__init__.py
@@ -1145,7 +1145,10 @@ def _setup(options, lib=None):
         plugins.send("library_opened", lib=lib)
 
     # Add types and queries defined by plugins.
-    library.Item._types.update(plugins.types(library.Item))
+    at = plugins.types(library.Album)
+    at.update(library.Item._types)
+    at.update(plugins.types(library.Item))
+    library.Item._types = at
     library.Album._types.update(plugins.types(library.Album))
     library.Item._queries.update(plugins.named_queries(library.Item))
     library.Album._queries.update(plugins.named_queries(library.Album))

This wasn't intended to be a final implementation, but my approach was a little bit different in that I thought the album-item relationship was something beets-specific and therefore should be reflected in library.py rather than dbcore. I used the existing getter mechanism. The atoi function takes an album-level getter and converts it into an item-level one that fetches the item's album and delegates to the original getter. Item-level properties still have precedence, as in this PR.

I did find that it was necessary to also change Item._types in order to get queries to work as intended since otherwise the album-level fields don't have type information when accessed on Items.

Note that we recently picked up a helper for memoisation in another PR:

beets/beets/util/__init__.py

Lines 1037 to 1057 in 909fd1e

def lazy_property(func):
"""A decorator that creates a lazily evaluated property. On first access,
the property is assigned the return value of `func`. This first value is
stored, so that future accesses do not have to evaluate `func` again.
This behaviour is useful when `func` is expensive to evaluate, and it is
not certain that the result will be needed.
"""
field_name = '_' + func.__name__
@property
@functools.wraps(func)
def wrapper(self):
if hasattr(self, field_name):
return getattr(self, field_name)
value = func(self)
setattr(self, field_name, value)
return value
return wrapper

@FichteFoll

This comment has been minimized.

Copy link
Contributor Author

commented May 19, 2019

Thanks for the headsup. I suspect that the problem with ranges is related to me not updating the items' type information, as you did in your diff. I was entirely new to the code base before working on this, so I just never considered that to be relevant.

The lazy_property is similar so something I drafted earlier in the process but ended up scraping because of what I outlined in an earlier comment (#2988 (comment)). The problem here is that the cached album is a snapshot of the database at whatever time it was first accessed, but the db may change during runtime and the lazy property will have no way to consider that fact.

I'll take a closer look at your getter approach when I find some time to work on this again.

(I'd like to mention that I cannot use beets without this feature anymore, so even if there is a huge update going on, I'll continue using my fork until I updated the PR for the changes.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.