Labels plugin #101

Closed
wants to merge 6 commits into
from

Projects

None yet

3 participants

@steinitzu
Contributor

This plugin lets you attach arbitrary text labels to your music and use them to filter the library.

Adds a new command labels which either filters by labels or attaches them to items (or albums) matching a query.

Labels are stored in a pretty straight forward many-to-many tagging system in the database using three new tables: labels which contains the label strings + join tables labels_items and labels_albums.

Just has the most basic features for now, but I thought this might be a nice way to generate playlists and other things.

@sampsyo
Member
sampsyo commented Feb 26, 2013

This is pretty awesome, @steinitzu. Thanks for the idea and creative implementation.

We're actually in the middle of implementing functionality that's very similar to this for a slightly more general purpose. I've been calling the feature "flexible attributes" (although "labels" might be a better name). The idea is very similar to what you have implemented here except that labels are key/value instead of just strings. That is, plugins can add arbitrary fields to items and albums that have a single, defined interface. This would allow us to accomplish what you have here as well as some other important features needed by other plugins.

The initial implementation I've been hacking on is here: https://github.com/sampsyo/beets/compare/flexattr
although it needs a lot of work to be complete.

In short, I'd like to incorporate the schema changes you've done in this plugin into beets core, along with an extension to do full key/value pairs. Then we can change the implementation of this plugin to use that functionality (ignoring the values), thereby simplifying it and opening up opportunities for other plugins.

The main substantive difference with my implementation is that it's a denormalized schema, which will have somewhat higher space overhead but (since I imagine these attributes being small) that shouldn't be a huge problem.

Any chance I can ask for your help in making this a core beets feature that any plugin can use?

(Ping @pedros, who's also interested in hacking on this.)

@steinitzu
Contributor

Yeah, I like that idea, @sampsyo . Would pave the way for all kinds of hackery.

I think labels would need some way to distinguish between labels and attributes from other plugins, they wouldn't really work as actual Item attributes.
Could probably be solved with just one extra table along with your item_attributes, just with item_attributes ids to say "this attr is a label".

Might be cleaner though if item_attributes was normalized, then attrs could be joined with items in any number of ways, and there wouldn't be any conflict on either side, like:

attributes <- attributes_items -> items
attributes <- labels_items -> items
attributes <- attributes_albums -> albums 
 ..... etc...

I may be missing something though, I'll take a better look when I get home.

I'm definitely willing to help with what I can.

@sampsyo
Member
sampsyo commented Feb 27, 2013

Interesting point. I totally agree that it makes sense to avoid conflicts between different plugins' uses of the same name. But, ideally, it would be great to do this without introducing a proliferation of extra tables for each plugin.

In an email conversation with @psilva, we discussed using namespacing to address this. That is, the attributes table would have an additional column and, when looking up attributes, a combination of a namespace and a key is used. The interface on the Python side would looks something like this:

item['labels']['foo'] = True

or maybe:

item.set_flexible_attribute('labels', 'foo', True)

but with a better name.

This, of course, is orthogonal to whether we go normalized or denormalized—I'm still not 100% sure which is the best choice there. (Denormalized might simplify the code; normalized might have space/performance benefits.)

@pedros
Collaborator
pedros commented Feb 27, 2013

Adrian Sampson notifications@github.com writes:

This, of course, is orthogonal to whether we go normalized or
denormalized—I'm still not 100% sure which is the best choice
there. (Denormalized might simplify the code; normalized might have
space/performance benefits.)

The problem with the normalized approach is that it would require, I
think, separate tables for each user of the service (ie plugins), since
then columns would vary across plugin.

A somewhat workable way around the denormalized table is to use a view,
like so:

CREATE VIEW unified_items AS
    SELECT * FROM items
    JOIN item_attributes AS ia
    ON items.id == ia.entity_id
    GROUP BY items.id;

However, the above has at least three problems:

  1. SELECT * FROM unified_items; now returns multiple rows per item,
    and I don't think the existing code accounts for this
  2. We'd therefore need to do something like:

CREATE VIEW unified_items AS
SELECT GROUP_CONCAT(ia.key),
GROUP_CONCAT(ia.value),
items.all_other_fields FROM items
JOIN item_attributes AS ia
ON items.id == ia.entity_id
GROUP BY items.id;

and then parse the concatenated strings in python
3. In SQLite, INSERT, DELETE, and UPDATE operations on a view do
not reflect back on the originating tables, and instead require a
trigger to do their work.

Pedro

@steinitzu
Contributor

The namespaces would be great. I can see how separate join tables per plugin could get really ugly really fast.
This would also pretty much defeat the purpose of a normalized schema, since wouldn't make any sense to keep a namespace column with the attributes AND normalize it.

There are the space benefits and (possibly) some performance benefits of normalizing, but I would think you'd need a colossal library for that to even begin to be noticeable.

@pedros : a LEFT join in that view should also return items which don't have any special attributes so it could be used anywhere where items would normally be queried. Shouldn't cause too much of a performance overhead.

I think the concatenated strings method looks fine. I can't think of an alternative solution that doesn't involve something along the lines of executing a second query for each retrieved item.

But making where clauses over these strings would be awful.
Would probably go with something like this for select queries (or another join):

SELECT * FROM unified_items 
WHERE id IN (
    SELECT entity_id FROM item_attributes
        WHERE (key = 'someattr' AND value = 'somevalue')
            OR (key= 'someotherattr' AND VALUE = 'someothervalue')
            AND namespace = 'someplugin') ;

Could make a great big Query class which searches both the flexattrs and regular fields that way OR integrate it into existing classes.

@sampsyo
Member
sampsyo commented Feb 28, 2013

Thanks for all the thoughts. It sounds like we're converging on a single-table, denormalized approach with namespaces.

I like the string concatenation idea, even if it seems a little bit hacky. And that join for WHERE clauses makes a lot of sense.

I'll keep hacking on this branch. If either of you are interested, I can make you a committer and you can try out some ideas in this repo too.

@pedros
Collaborator
pedros commented Mar 6, 2013

@sampsyo: sorry for the delay; I'm trying to finish up a project this week, let me get back to you early next week about this stuff.

@sampsyo
Member
sampsyo commented Mar 6, 2013

Awesome. No rush at all here, of course. I'll plan on getting 1.1.0 out the door relatively quickly so this can get the attention it deserves for 1.2.0.

See also @steinitzu's last few commits here: https://github.com/steinitzu/beets/commits/flexattr

@steinitzu
Contributor

I put up a pull request with the current progress so you can guys can tinker with it too.
Can probably get back to it in a few days.

@sampsyo
Member
sampsyo commented Mar 9, 2013

I just pushed a few more experiments to the flexattr branch. The main upshot of this version is that no registration is necessary—namespaces and keys are completely arbitrary.

There are also now two different ways to get and set flexattrs. The first way is not too different from @steinitzu's version: item.flexattrs['namespace']['key']. The second makes it somewhat friendlier to use these from the command line: item.__getattr__('namespace-key') (and __setattr__ also works). The hyphen indicates a flexattr; no other attributes should have hyphens in them.

So this now works:

$ beet mod foo-bar=baz
$ beet ls -f '${foo-bar}'
baz
baz
baz
...
@steinitzu
Contributor

Nice!
One problem though. That version of ResultIterator can get incredibly slow if your beets db is on a network share. Each select query adds up to 3ms of latency on my setup.

There might be some way to get consistent group_concat ordering, like subselects mentioned here.
Though I did some tests of this nature and I haven't been able to make it use anything but the insert order. It ignores any ORDER BY clauses but that wouldn't be a problem in this case.

@sampsyo
Member
sampsyo commented Mar 10, 2013

Interesting; that subselect from SO seems like it might be worth pursuing. (I'm somewhat reluctant to rely on undocumented/observed behavior of SQLite since it might change in any version.) We could also get by with a left join and smart grouping in the ResultIterator.

Out of curiosity, what's your use case for putting the DB itself on a network share? I've been toying with ideas about baking in network communication to a remote DB that would, ideally, avoid problems like this. I also hear (anecdotally) that SQLite can run into trouble when accessed over something like NFS.

@steinitzu
Contributor

I'm somewhat reluctant to rely on undocumented/observed behavior of SQLite since it might change in any version.

True, it's iffy at best. Depends how its implemented I guess, not sure whether the ordering is decided on a query level or is "random" for each group_concat in a query. If all the concats in the query can be trusted to have the same order it would be fine regardless of what that order is.
The subselects supposedely work due to some implementation accident, so it may not be the most reliable solution.

We could also get by with a left join and smart grouping in the ResultIterator.

You mean something like the unified_items view without group_concat? Then let ResultIterator merge the duplicate rows into one. Could work.

Or possibly select all the attributes whenever items are retrieved and stick them in a {entity_id : attributes} dict for ResultIterator to pick from. That might be a little faster with large resultsets since the speed of cursor.fetchx seems to be greatly affected by the result's column count, which could become noticable when the item resultset is multiplied.
Example with an unrealistically large resultset (~30.000 rows):

def testfetch(query):
    conn = sqlite3.connect('/tmp/dbeetlib.db')
    cur = conn.cursor()
    cur.execute(query)
    all = cur.fetchall()

In [57]: %timeit testfetch('SELECT id, title, album, artist FROM items')                            
10 loops, best of 3: 67 ms per loop

In [58]: %timeit testfetch('SELECT * FROM items')
1 loops, best of 3: 417 ms per loop    

Out of curiosity, what's your use case for putting the DB itself on a network share?

I import and mess with the database from different devices sometimes, so I just keep it there with the music. I share the lib with my brother who also imports things every now and then. I could just go through ssh or something, but I find this more convenient.
Haven't had any major issues with it (samba, not nfs) but I've heard about some locking weirdness and such. Mostly just the performance overhead, though I've run into some oddly slow queries on local databases too. (also absolute paths in the database which is an issue I was gonna look into at some point).

A network DB could be doable. Shouldn't be too hard to implement with the isolated Transaction class and sqlite having a fairly standard syntax.
I haven't felt a great need for it myself (I'd use it if available) but I can see it being useful for a larger library shared between more people/processes.

@sampsyo
Member
sampsyo commented Mar 10, 2013

To add one more to the torrent of possibilities for this query: here's something somewhat messy that I experimented with:

SELECT *,
GROUP_CONCAT(item_attributes.namespace || "/" ||
             item_attributes.key || "/" ||
             item_attributes.value)
    AS flexattrs
FROM items LEFT JOIN item_attributes
ON items.id = item_attributes.entity_id
GROUP BY items.id;

Then the "flexattrs" result column comes back as "ns/key/value,ns/key/value" and can be safely split back out (as long as we can forbid / and , characters, which is somewhat iffy for values!).

@steinitzu
Contributor

Cool hack.

Who says delimiters have to be single characters?
We could just use some stupid strings like: (^ o ^)/

On Sun, Mar 10, 2013 at 8:36 PM, Adrian Sampson notifications@github.comwrote:

To add one more to the torrent of possibilities for this query: here's
something somewhat messy that I experimented with:

SELECT *,
GROUP_CONCAT(item_attributes.namespace || "/" ||
item_attributes.key || "/" ||
item_attributes.value)
AS flexattrs
FROM items LEFT JOIN item_attributes
ON items.id = item_attributes.entity_id
GROUP BY items.id;

Then the "flexattrs" result column comes back as
"ns/key/value,ns/key/value" and can be safely split back out (as long as we
can forbid / and , characters, which is somewhat iffy for values!).


Reply to this email directly or view it on GitHubhttps://github.com/sampsyo/beets/pull/101#issuecomment-14688700
.

@steinitzu
Contributor

@sampsyo: any progress?

If you don't have anything unpushed, I'd like to do some hacking. :)

Given that your group_concat method works, we could go back to a unified_x view along with respective parsing code ,re-implement my previous mods to CollectionQuery (in a less hacky way) for searching the flexattrs.
,keep the hyphen syntax.
,keep the un-registered fields.

Yes?

@sampsyo
Member
sampsyo commented Mar 16, 2013

I've just pushed one outstanding change and merged with changes from master, so hack away.

Yes, let's go with the string-concatention business. However, if possible, I'd slightly prefer to avoid using a view in SQLite. This is just for future code readability/maintainability—it will be easier to change how we deal with item queries in the future if it only appears in one place (without leaving a permanent change in the user's database). To facilitate this, I just pushed a change that eliminates all but one place that we do a SELECT from the items table. (This only happens in the Library.items() method now.) So it should be feasible now to just do the big, hairy join in that method and handle the result in ResultIterator without making any schema changes.

Does that make sense? (I like views in principle, but I think this will just make for better long-term code maintainability.)

And yes, we have some hacking to do on how to make queries work with this. I'm imagining the relevant change will have to go in AnyFieldQuery. Maybe we'll need a new base class too (FlexibleAttributeQuery?)—not sure about that.

@steinitzu
Contributor

I can see how the view could cause some confusion. We could though make the view writable using triggers, and even call it items and rename the table to _items to nudge everyone towards the view.
It's not crucial though, and adding a view later is probably easier than removing it if it turns out to be a bad decision.

One issue is the column ambiguity when joining. the id column needs to be prefixed with item. in WHERE and ORDER clauses so it doesn't conflict with item_attributes.id. but my dirty fix for that was just wrapping the thing in another SELECT so the query looks like:

 `SELECT * FROM (
      SELECT items.*, GROUP_CONCAT(...) FROM items LEFT JOIN ....) 
   WHERE ...;`  

I made some mockups for FlexibleAttributeQuery classes. They need to be somewhat more aware of their context because of all the subselects. They have to know whether they'll be used with an AND or OR joiner (if any) and what entity is being used to select from the right attributes table. CollectionQuery among other things also need a few more changes to fully support this.

I pushed another commit here earlier: steinitzu@7c2ed97

Gonna write some tests now and see if any of it actually works :)

@sampsyo
Member
sampsyo commented Mar 18, 2013

Interesting; I hadn't thought of that naming conflict issue—thanks for fixing that in my original SQL.

The rest of this looks good too. But maybe we should take those "-||-" and "-;;-" magic strings and turn them into constants?

@steinitzu
Contributor

Aight, got a more functional version now https://github.com/steinitzu/beets/commits/flexattr , just needs some polishing.

I was confused by the new callback system and the lack of REGEXP operator in sqlite (always thought it was built in), so I re-added it (steinitzu@2cb5383) . Could the flexattr one be made to work like the regular RegexpQuery?

@sampsyo
Member
sampsyo commented Mar 19, 2013

Seems like the right direction. Thanks!

Yes, we'll need to merge flexattrs into the new callback system. I'm not sure of the details yet, but we would ideally avoid duplication between the field version and the flexattr version of a given type of query. (That would, for example, let fuzzy queries work for both kinds "for free".)

@steinitzu
Contributor

Yeah, that would be nice.

I don't know how that would work though, cause we need that WHERE EXISTS... clause. Replacing that with a function call would require parsing the concat strings in the callback for every row, at least the way I'm seeing it.
Not sure though, callbacks hurt my brain.

@sampsyo
Member
sampsyo commented May 25, 2016

Closing this for now, since we have flexible attributes these days. It's worth revisiting the set of features, though, which should be much easier now!

@sampsyo sampsyo closed this May 25, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment