Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for multi-valued tags #505

Open
ryanakca opened this issue Jan 25, 2014 · 63 comments
Open

Add support for multi-valued tags #505

ryanakca opened this issue Jan 25, 2014 · 63 comments
Labels

Comments

@ryanakca
Copy link

@ryanakca ryanakca commented Jan 25, 2014

Please add support for multi-valued tags. For example, this could be used for the multi-valued occurrences of PERFORMER / COMPOSER / ENSEMBLE / etc. recommended by http://age.hobba.nl/audio/mirroredpages/ogg-tagging.html .

@sampsyo

This comment has been minimized.

Copy link
Member

@sampsyo sampsyo commented Jan 25, 2014

Related to the genre-specific feature request in #119. See also some discussion in a half-baked pull request, #437.

This will be a reasonably large task, involving changing the interface to MediaFile to support lists, making commensurate changes to the database abstraction, and sorting out the mapping for each of the file formats we support.

@pprkut

This comment has been minimized.

Copy link
Collaborator

@pprkut pprkut commented Feb 7, 2014

Musicbrainz now supports multiple values for artist as well. While keeping the names themselves as a list might be cumbersome (you'd need to keep track of the join phrases), supporting lists for the artist ids should fit right into this use case.

@dan-boehm

This comment has been minimized.

Copy link

@dan-boehm dan-boehm commented Mar 7, 2014

Congratulations to @geigerzaehler for what was obviously quite a lot of work for PR #527. As I understand it, this laid the foundation for multi-value tags, specifically adding support for genre lists.

I am very interested in seeing this support added for the artist tags as well. I would be willing to spend some time on this myself. I have read through #527's thread and spent a small amount of time looking over those commits. I obviously have a lot of stuff to search through and the information is definitely there for me to find, but I was hoping somebody more familiar with this could give me a few tips on some files/commits that I might want to begin with.

@dan-boehm

This comment has been minimized.

Copy link

@dan-boehm dan-boehm commented Mar 7, 2014

To clarify a bit about what I think @pprkut was talking about, MusicBrainz does support multiple artists in its schema. It also provides access to that through the API. Here are a few examples:

http://musicbrainz.org/ws/2/release-group/38d1077e-c270-49a8-92ca-87e7eb8d9fe4?inc=artists
http://musicbrainz.org/ws/2/release/965f06a9-5e30-4fe5-9601-2989ebdb69f5?inc=recordings+artist-credits
http://musicbrainz.org/ws/2/release-group/bdaeec2d-94f1-46b5-91f3-340ec6939c66?inc=artists
http://musicbrainz.org/ws/2/recording/724aac2d-d800-4979-a4e9-8d9287ba57fb?inc=artists

So the information is there. The only concern I'd have about whether or not to implement this is that not every player supports multi-artist tags. Perhaps the best option would be to provide a config option to either use the multi-artist tags or join them as a single string (as it works now).

@sampsyo

This comment has been minimized.

Copy link
Member

@sampsyo sampsyo commented Mar 7, 2014

Hi, @Boehemyth—glad you're interested! Yes, the recent work in MediaFile is the first step toward making this work more generally.

The next step—which is arguably even more challenging—is to cleanly extend the database abstraction to store multi-valued fields. All of this should be done in the beets.dbcore package, which is our generic library for representing objects stored in a SQLite database. The documentation there and in Sphinx (see the "developers" section on RTD) should provide a reasonable intro, but please let me know if you have any questions about how things currently work.

I have no idea how to best to best go about adding this feature set to dbcore. We should somehow allow every kind of field (fixed, flexible, and computed) to have a list value while still—for compatibility and simplicity where lists are less likely to be useful—providing a non-list version of the field. That is, it would be great if code could refer to item.artist and get a single string even if there's another route to get the full list of artists.

It's also not clear what the underlying SQLite schema should be. I'm open to suggestions if anyone has ideas.

Thanks for volunteering!

@dan-boehm

This comment has been minimized.

Copy link

@dan-boehm dan-boehm commented Mar 7, 2014

That is, it would be great if code could refer to item.artist and get a single string even if there's another route to get the full list of artists.

I agree. MusicBrainz returns a seperator for each artist except the last one. If we were to store these seperators in a new field as a list in the same order as the artists in the artist field, we could write a function to construct the string.

It's also not clear what the underlying SQLite schema should be. I'm open to suggestions if anyone has ideas.

I'm not exactly sure how we should go about this either. I think our best bet is to store the list as a single string with a seperator character for the individual elements. I think that's how some of the audio tags handle this anyhow. Maybe it makes sense to just stick with that? I'd have to look back at that PR to remember how it was determined to be handled. The icing on the cake is that we wouldn't need any changes to the current database schema.

It may also be possible to store entire python data structures as BLOBs in the database. I'd have to investigate that to be sure, but I personally like the other option better anyways.

We should somehow allow every kind of field (fixed, flexible, and computed) to have a list value while still—for compatibility and simplicity where lists are less likely to be useful—providing a non-list version of the field.

The one limitation to the string idea is that it won't work for other SQLite datatypes. I don't think that that will be a problem. From where I stand, there aren't any non-string fixed attributes that could benefit from a list. I'm not as sure about flexible or computed fields. I can't think of a case myself, but maybe somebody else can?

@pedros

This comment has been minimized.

Copy link
Collaborator

@pedros pedros commented Mar 7, 2014

From where I stand, there aren't any non-string fixed attributes that could benefit from a list. I'm not as sure about flexible or computed fields. I can't think of a case myself, but maybe somebody else can?

The EchoNest API provides the raw analysis data they use in determining their higher-level attributes like danceability, etc. For example, they do beat detection, the result of which is simply an array of sample or frame indices with confidences or probabilities for each. Of course, this is a massive amount of data that I doubt anyone would be interested in keeping inside a beets database. Anyway, it's a (far-fetched) use case for lists of floats.

@sampsyo

This comment has been minimized.

Copy link
Member

@sampsyo sampsyo commented Mar 7, 2014

Here are two reasons for adding list-valued fields everywhere:

  • Attachments (see #111) should be built on flexible attributes. We should allow multiple attachments of the same type, which will be much easier if we have list-valued attributes. Attachments need to be stored as SQLite BLOBs (not strings) to avoid Unicode issues.
  • Consistency. It would be really nice not to have to worry about types when storing lists.

Querying would also be complicated by packing multiple values into a string. We'd have to split each value and compare against the query rather than just pattern-matching.

@dan-boehm

This comment has been minimized.

Copy link

@dan-boehm dan-boehm commented Mar 7, 2014

These are good points. Unfortunately, that means we have to do this the standard way, which means implement new tables, which I was really hoping to avoid, because it really will be a pain.

I'll use album-artists as an example. We'd have to create a new table that lists the album/albumartist relationships. Then for queries, we have to query the album on that table to get all of the artists. I know it sounds pretty round-about, but that method actually does work remarkably well. I do understand what you mean now about how much work this will end up being.

The issue is that if we really want to make this work for EVERY field, things are going to get messy very quickly. There are some things we could do to help clean it up a bit. For instance, to use the artist example again, we could have an Artist table that stores relevant info about an artist. We'd still need an album-artist table and item-artist table, but at least now you only have two artist relationship tables instead of 6 (including artist_sort and mbartist_id).

Also, I'm not entirely sure how that would work for flexible fields. The question is can we assume a one-to-many relationship. For instance, in the case of Attachments, each attachment is related to exactly one Album or Item right? If that can be assumed about flexible fields, we could probably just create a new table for each flexible field, instead of adding the flexible field to the album or item table.

@dan-boehm

This comment has been minimized.

Copy link

@dan-boehm dan-boehm commented Mar 7, 2014

Perhaps we might be able to get away with not implementing lists with a few of the fixed fields that we know doesn't need this functionality. Dates for instance? Or Track/Disk numbers?

@sampsyo

This comment has been minimized.

Copy link
Member

@sampsyo sampsyo commented Mar 7, 2014

The issue is that if we really want to make this work for EVERY field, things are going to get messy very quickly.

I actually don't know whether I agree with this—I think creating one-off tables for a few fields where lists are important could be messier in the long run! I'm certain that, in the future, we'll think of new fields (existing and novel) that will need to be listified, and needing to retrofit each of these independently sounds a bit like a nightmare. For maintainability, it would be awesome to have a generic facility for list-valued fields that can be reused with zero (or near-zero) additional code.

Ideally, this would work by devising an ingenious and efficient mechanism for adding lists to every field, whether we need it or not. This would make listification orthogonal to the field set (abstraction FTW!). I realize that this might be needlessly complex, though—in which case we can consider a design that can't provide lists for every field but needs only tiny code changes to support multiple values for a new field. (Stuff like dates could be excluded if that's harder for some reason, as you proposed.)

@geigerzaehler

This comment has been minimized.

Copy link
Collaborator

@geigerzaehler geigerzaehler commented Mar 7, 2014

I would be willing to spend some time on this myself.

That's awesome @Boehemyth. Personally, I think that there is not that much to work left in the MediaFile API, apart from refactoring, refining, and documenting. Adding support for lists in tags was motivated by the native ability of most tagging formats to store those. We should keep away any additional logic that would be better fitted to database models. But I noticed you already shifted towards the database issue.

As for the discussion that’s unfolding I just would like to point out that it’s probably best to delay writing code and first put something written up on the wiki. There is already a lot of great conceptual stuff there.

It also might be time to push for a document oriented database.

@geigerzaehler

This comment has been minimized.

Copy link
Collaborator

@geigerzaehler geigerzaehler commented Mar 7, 2014

We also have to consider the user interface: How will we present lists to the user and let them modify these lists.

@dan-boehm

This comment has been minimized.

Copy link

@dan-boehm dan-boehm commented Mar 7, 2014

I think creating one-off tables for a few fields where lists are important could be messier in the long run! I'm certain that, in the future, we'll think of new fields (existing and novel) that will need to be listified, and needing to retrofit each of these independently sounds a bit like a nightmare. For maintainability, it would be awesome to have a generic facility for list-valued fields that can be reused with zero (or near-zero) additional code.

By listifying every field, we would essentially be having a separate table for every field. This is inefficient, messy, and (I believe) unnecessary. However, your concern is valid. I think the best solution would be one where we can incrementally convert fields to lists without having to convert everything at once. This does mean we would have to support both list/non-list fields, which will be a pain up-front, but shouldn't be too much of a problem in the long-run. So long as it doesn't take much effort to move a field to the list system down the road if it is deemed necessary.

I have an idea for this that I'm currently looking at. I'll post again when I have something to present.

We also have to consider the user interface: How will we present lists to the user and let them modify these lists.

@geigerzaehler brings up a really good point, and I honestly don't know right now. I kind of want to focus on the backend implementation right now.

@sampsyo

This comment has been minimized.

Copy link
Member

@sampsyo sampsyo commented Mar 7, 2014

Yes, it would be great to avoid a separate table for every list-valued field! I can only think of two ways around this, both of which have obvious issues:

  • Move everything to flexible fields (key/value schema).
  • Leave SQLite behind entirely, as suggested by @geigerzaehler.

These obvious roadblocks are exactly why this issue has stagnated for so long…

@dan-boehm

This comment has been minimized.

Copy link

@dan-boehm dan-boehm commented Mar 7, 2014

As for the discussion that’s unfolding I just would like to point out that it’s probably best to delay writing code

I agree about the code-writing part. It's best to have a plan before getting too deep.

first put something written up on the wiki.

I can do that once I have a more concrete suggestion. Where on the wiki would I post such a thing?

It also might be time to push for a document oriented database.

That would probably make things easier, but I have no experience in such a thing.

@dan-boehm

This comment has been minimized.

Copy link

@dan-boehm dan-boehm commented Apr 7, 2014

I know it has been awhile, but I've been looking into this issue in my spare time. Based on what I've learned about python's sqlite3 package, relational database design, and how beets currently works, I've come to a number of conclusions:

  1. The "correct" way of putting lists into a sqlite database is to create a separate table that uses foreign keys to link back to the main entry, and then have another column in this reference table for items in the list. Obviously this is a problem if we want to implement lists for every field.
  2. It would be possible to define another boolean in the schema definitions for items and albums in library.py. This boolean would define whether it is a list field or not. The idea is to try to make the conversion to lists for things we need to without doing so with things we obviously would not. This would be a very complicated solution. So much so that as I kept looking into all the things that would have to be changed, I would frequently get lost and frustrated with just how difficult this would be. I think it's possible, but it would be a real huge PitA.
  3. The final option I came up with is probably the easiest for us to implement. I noticed that no matter the type of field, it will always convert the value to a string before storing it to the database. I was surprised by this, because the sqlite3 package has a way of doing this same sort of thing for you, but this is good because we can do regular expressions for queries on TEXT columns from directly within SQLite.

SQLite provides the REGEXP keyword that tells it to use a user-defined function to see if there is a regular expression match.

import sqlite3

def re_fn(expr, item):
    reg = re.compile(expr)
    return reg.search(item) is not None

conn = sqlite3.connect(:memory:)
conn.create_function("REGEXP", 2, re_fn)

Now we can do something like this:

cur = conn.cursor()
cur.execute('SELECT FROM ? WHERE ? REGEXP ?', (tableName, columnName, pattern))

And it will return any row where the column matches the patern.

If we were to use a separator character that we know won't conflict with any character in any value, we can store it like that. For illustration's sake, let's say we used a semicolon for the seperator character. Then we can query using a regular expression like

pattern = '^(.*;)*(' + queryValue + ')(;.*)*$'

I haven't dove into the Query class enough to understand exactly how querying the database in beets works, but I think that this could be an acceptable solution. Can someone with a better understanding of beets think of any major problems or complications with this idea?

@sampsyo

This comment has been minimized.

Copy link
Member

@sampsyo sampsyo commented Apr 8, 2014

Thanks for all the thoughts on this! We should indeed get moving on list-valued fields.

I'm a little confused about your comment about conversion to strings. I am pretty sure we don't do this—can you explain why you think we're converting all values to strings?

In that light, packing all lists as separated strings is somewhat worrisome. Serializing and deserializing is a lot of overhead to pay for every query. Especially for numeric fields, where this will require a bunch of manipulation on each row to do any comparison.

Another way we could do this would be to effectively make all fields into flexible fields: put it all in one big id/key/value table. There are also indirection overheads there, but at least the possibility of optimization (unlike with serialization to strings).

@dan-boehm

This comment has been minimized.

Copy link

@dan-boehm dan-boehm commented Apr 9, 2014

I'm a little confused about your comment about conversion to strings. I am pretty sure we don't do this—can you explain why you think we're converting all values to strings?

I'm sorry. I think I didn't look at the types.py carefully enough. I misinterpreted a comment in there, and I didn't look at the code enough to confirm what I thought I read.

Serializing and deserializing is a lot of overhead to pay for every query. Especially for numeric fields, where this will require a bunch of manipulation on each row to do any comparison.

I agree. Especially since number values aren't being stored as strings. But I still think it's the best option without doing a complete redesign of the database schema. And I definitely think it's the best option if we want list support for every field.

Another way we could do this would be to effectively make all fields into flexible fields: put it all in one big id/key/value table. There are also indirection overheads there, but at least the possibility of optimization (unlike with serialization to strings).

How would we optimize such a table? The only thing I can think of to do would be to index by the item/album #. This would still be extremely inefficient. We'd literally need one SQL for every field insert/update/read. Items have almost 30 fields. That's 30 INSERT SQLs just for adding a single item to the database. Add support for lists and that number can be even bigger. Reading/Updating the database would be just as inefficient unless we also indexed by column name, but that will just make inserts take even longer, and it still wouldn't be as fast as a single query to the database.

Unless I'm missing a better way to optimize this, I'm having a really hard time agreeing that this would be preferable to serializing/deserializing strings.

@sampsyo

This comment has been minimized.

Copy link
Member

@sampsyo sampsyo commented Apr 9, 2014

You make lots of great arguments for why the key/value approach would also be inefficient. Clearly, what we really want here is a document-oriented database and we're working around the fact that we don't have one.

Maybe the most helpful thing to do here would be to hold a "bake-off". To me, it's not completely obvious to me which approach is faster—key-value or string serialization. An ambitious contributor could try implementing tiny, toy versions of each. Both implementations should implement exactly the same interface. They could then write a couple of benchmarks that demonstrate the performance characteristics and argue—with data!—that one is faster than the other.

Any takers?

@geigerzaehler

This comment has been minimized.

Copy link
Collaborator

@geigerzaehler geigerzaehler commented Apr 10, 2014

I think the effort would be better spent on moving to a document oriented database.

@dan-boehm

This comment has been minimized.

Copy link

@dan-boehm dan-boehm commented Apr 10, 2014

@geigerzaehler
I looked into both of the possibilities given in the wiki. One of them would result in the same problems we are currently having, because you have to define tables and schema anyways. The other, I think it was called SQLiteDBM, will not suit our needs. It essentially works by pickling/unpickling python structures and storing them in a sqlite database. This would result in worse efficiency problems than what we are proposing above, because we would no longer be able to query the database directly.

@dan-boehm

This comment has been minimized.

Copy link

@dan-boehm dan-boehm commented Apr 10, 2014

@sampsyo
I think that's probably the best thing to do, even if it does seem like a pain. Perhaps we should begin by defining exactly what we want the interface to be?

I can't help but think that there has got to be some software out there already that can do this the way we want. I did a little looking into the document-oriented databases that were mentioned in the wiki, and while they would be really awesome to work with, I think they're a step back from our current database. Perhaps somebody else would like to look into it to see if they agree. Still, I can't help but feel there should already be a solution to this.

@sampsyo

This comment has been minimized.

Copy link
Member

@sampsyo sampsyo commented Apr 10, 2014

Yes, there are suspicious performance problems with some of the alternative databases. But if we're going to be comparing with data, it might also make sense to compare against those just to make sure our intuition is correct!

One project that caught my eye (not finished, but has the right idea for our use case): https://github.com/dw/acid/


And yes, it would totally be a good idea to settle on the interface. We should come up with something clean and regular. I don't exactly know what it should look like, but I like the idea of allowing simple access when you just need a single value alongside list access when you need all the data. Not sure if that's possible to support elegantly, but it's nice to dream. 😉

@dan-boehm

This comment has been minimized.

Copy link

@dan-boehm dan-boehm commented Apr 10, 2014

Acid does look pretty interesting. At a quick glance, it seems quite a bit more powerful than the other options I looked at. I am having a hard time figuring out exactly where the project is at. Do you know anything about that?

@sampsyo

This comment has been minimized.

Copy link
Member

@sampsyo sampsyo commented Apr 10, 2014

No, it's a little opaque in terms of project status. Maybe it's worth contacting the author?

@ghost

This comment has been minimized.

Copy link

@ghost ghost commented Feb 1, 2016

I'm interested in this too as it's one of the issues keeping me from having beets manage my music library.

Right now I just use beets to pre-process / import my music into a temporary folder (it's great for this and still saves a lot of time) then I use a separate tag editor to create multi-value entries for the artist field and manually copy the files into my main library from there.

@m-urban

This comment has been minimized.

Copy link
Contributor

@m-urban m-urban commented Feb 2, 2016

I haven't followed the entire discussion, but I wonder If we are making this too complicated, trying to come up with the perfect solution.
One suggestion was to serialize lists to a string with a unique separator. It was mentioned that this would introduce a lot of overhead because such strings would need to be de-serialized in each query. While this might generally be true, I think that the actual problem is a lot less complex.

Which tags could particularly benefit from the list feature? I think, it is mostly string values, which aren't any more difficult to query when containing a separator for multiple values. I am thinking artists, composers, performers, lyricists, conductors, genres, etc. It was stated before that numbers would be problematic. Maybe I am just lacking imagination, but from the "standard tags", I cannot think of an integer value where a list would make sense. Year? Track? Disc?

@sampsyo

This comment has been minimized.

Copy link
Member

@sampsyo sampsyo commented Feb 2, 2016

That's a good point—it might be worth at least prototyping a delimiter-based solution that doesn't alter the database schema. A couple of questions worth investigating first:

  • Does SQLite support strings with null bytes in them?
  • How much of a performance hit is there when querying by delimited substring vs. by equality?
@tjconnelly

This comment has been minimized.

Copy link

@tjconnelly tjconnelly commented Jul 6, 2016

fwiw this is the feature blocking my total adoption as well. i'm working on a playlist manager, so i can hack a workaround using playlists as tags in the interim, but meanwhile:

general +1, and furthermore a vote to make the tags queryable.

i'm just getting my feet wet, but could concerns around 4 above be handled in the configuration? ie, could there be a setting added to customize which attrs are available for search in a given library and construct from there?

thx for eyeballs and code; i love that beets exists.

@sampsyo

This comment has been minimized.

Copy link
Member

@sampsyo sampsyo commented Jul 7, 2016

Maybe so! That comes with its own set of strange decisions, of course: what happens when you change a field from single-valued to multi-valued? What if a tag you set as multi-value can only be single-valued in the underlying tag format?

Overall, the ideal solution (for me) would treat all tags as potentially multi-valued, but under most circumstances you could treat them as single-valued with no additional complexity—you'd just be setting and retrieving the 0th element of the value list. Alternative querying and update mechanisms would let you access the whole list.

@antlarr

This comment has been minimized.

Copy link
Contributor

@antlarr antlarr commented Apr 4, 2017

Does SQLite support strings with null bytes in them?

Just for the records, I did a quick test and it seems it's supported:

>>> import sqlite3
>>> conn = sqlite3.connect('test.sqlite')
>>> c = conn.cursor()
>>> c.execute('create table songs (id INTEGER PRIMARY KEY ASC AUTOINCREMENT, genre TEXT );')
<sqlite3.Cursor object at 0x7f4949ad5260>
>>> g=u'Rock\0Blues'
>>> c.execute('INSERT INTO songs(genre) VALUES (?)', (g,))
<sqlite3.Cursor object at 0x7f4949ad5260>
>>> b=c.execute('SELECT genre from songs')
>>> b.fetchall()
[('Rock\x00Blues',)]

The problem is I don't think this can't be used nicely to query the database. Also, if you run the sqlite3 command line interface, it seems not to be supported:

> sqlite3 test.sqlite 
SQLite version 3.17.0 2017-02-13 16:02:40
Enter ".help" for usage hints.
sqlite> select genre from songs;
Rock
@pprkut

This comment has been minimized.

Copy link
Collaborator

@pprkut pprkut commented Apr 5, 2017

Rather than strings with a specific separation character, maybe storing JSON arrays/objects could be an option.
Sqlite does have native support for working with JSON now, although it's probably not very widespread yet: https://sqlite.org/json1.html

@antlarr

This comment has been minimized.

Copy link
Contributor

@antlarr antlarr commented Apr 6, 2017

That's quite interesting. I didn't know that sqlite extension. Right now the options would be:

  • Store the list joining the elements with a separator. This is what's currently implemented in #2503 .
  • Convert to json array and store that. That could be easily implemented, and would allow (not so easily, and considering I didn't get it wrong) to do complex queries using the json1 extension you linked.
  • Use a separate table (could the flexible attributes table be used for that? or should another table be created for this?) and remove the column from items/albums tables. In my opinion, this would allow to use simpler code in the future.

What do you think? I think I would prefer the third option (the normalized database schema) better, but I'm open to other opinions.

@pprkut

This comment has been minimized.

Copy link
Collaborator

@pprkut pprkut commented Apr 6, 2017

I think starting off with a simple implementation using a separation character is fine, if there is a forward-migration path (i.e. we could switch to a different implementation later).
That implementation is already (almost) there and would make a lot of users perfectly happy, so it makes sense to consider that.

I mentioned JSON because it's still readable as raw value in the database and slightly more structured than a separation character. It would also open up opportunities to use JSON values for other more complex fields, like for example storing the raw values from acousticbrainz. The downside of JSON is that it's slightly less useful as is in paths, so we'd need to think about how to normalize it for that.

Separate tables would be great for organizing the data properly, but carry significant performance drawbacks when querying information from the database. We see this already with the flexible attributes. The problem is that values stored in the items table in a single cell allow for a single result set for multiple items. A separate, flexible attribute like table would require one query/result set per item. If you query your library for 5000 tracks, that's a difference of 5000 queries. (There's probably ways to cut down the amount of queries, but that would likely result in larger code changes).

@sampsyo

This comment has been minimized.

Copy link
Member

@sampsyo sampsyo commented Apr 6, 2017

I suddenly like the idea of JSON as a near-term solution—it might be less of a headache than null separators. In particular, the path formatting thing isn't exactly a problem: we already use custom formatting routines to convert internal values into strings before formatting templates. So users should not be able to notice the internal storage format.

Of course, it would make querying a bit harder unless we have a SQLite with JSON support.

@antlarr

This comment has been minimized.

Copy link
Contributor

@antlarr antlarr commented Apr 6, 2017

At least in openSUSE, sqlite has json support by default. Do you know if that's a reasonable dependency that can be added to beets? If so, I'll try to use change my PR to use json values. BTW, would it make sense to add a database scheme versioning? I'm thinking it would be interesting if we don't mix regular strings values with json values in a given column, so some code to "migrate" the column values would be needed (that is, in database format v1, genre, etc store strings, while in v2, json values are stored).

Just for the records:

sqlite> CREATE TABLE songs (id INTEGER PRIMARY KEY ASC AUTOINCREMENT, genre TEXT );
sqlite> insert into songs (genre) VALUES (json_array('rock', 'blues'));
sqlite> insert into songs (genre) VALUES (json_array('pop'));
sqlite> select songs.id, json_genre.value from songs, json_each(songs.genre) as json_genre where json_genre.value = 'rock';
id|value
1|rock
sqlite> select songs.id, json_genre.value from songs, json_each(songs.genre) as json_genre where json_genre.value like '%o%';
id|value
1|rock
2|pop

So this looks promising :)

@sampsyo

This comment has been minimized.

Copy link
Member

@sampsyo sampsyo commented Apr 7, 2017

Good question—I don't know how pervasive that version of SQLite is. That would require some careful searching into the state of various OSes. On the other hand, nothing stops us from storing JSON in our database even if we can't use the fancy queries over it yet.

I suppose you're right that the nice thing about the null-separated encoding is that there is no migration required at all. A single-element list is exactly the same as a plain string.

@sampsyo

This comment has been minimized.

Copy link
Member

@sampsyo sampsyo commented Apr 7, 2017

About versioning: so far, we've been able to avoid it. The schema benefits from being so incredibly simple that we never really need to grapple with complicated migrations. We might need to explore it now, of course!

@mafrasi2

This comment has been minimized.

Copy link

@mafrasi2 mafrasi2 commented Aug 25, 2017

I don't know how pervasive that version of SQLite is. That would require some careful searching into the state of various OSes

As I'm very interested in this feature, I did some research on this:

Debian (and probably Ubuntu): enabled in stretch since 2015-10-19
                              disabled in jessie and older
Arch    : enabled since 2016-01-19
Fedora  : enabled since 2016-01-22
CentOS  : disabled
OpenSUSE: enabĺed
FreeBSD : disabled by default, but simple to enable
Windows : enabled
Mac OS X: disabled according to https://docs.python.org/3/library/sqlite3.html#f1

The biggest barrier is probably Mac OS X.

@sampsyo

This comment has been minimized.

Copy link
Member

@sampsyo sampsyo commented Aug 25, 2017

Hmm; that is tricky. As usual, macOS is behind.

Maybe we can architect this feature so it degrades gracefully on systems without the appropriate SQLite feature?

@divanikus

This comment has been minimized.

Copy link

@divanikus divanikus commented Aug 6, 2018

While you guys thinking of a some generic solution I came up with the following hack of mediafile.py

class ListMediaField(MediaField):
    """Property descriptor that retrieves a list of multiple values from
    a tag.

    Uses ``get_list`` and set_list`` methods of its ``StorageStyle``
    strategies to do the actual work.
    """
    def __get__(self, mediafile, _):
        values = []
        for style in self.styles(mediafile.mgfile):
            values.extend(style.get_list(mediafile.mgfile))
        return [_safe_cast(self.out_type, value) for value in values]

    def __set__(self, mediafile, values):
        for style in self.styles(mediafile.mgfile):
            style.set_list(mediafile.mgfile, values)

    def single_field(self):
        """Returns a ``MediaField`` descriptor that gets and sets the
        first item.
        """
        options = {'out_type': self.out_type}
        return ListMediaSingleField(*self._styles, **options)



class ListMediaSingleField(MediaField):
    """Property descriptor that retrieves a list of multiple values from
    a tag.

    Uses ``get_list`` and set_list`` methods of its ``StorageStyle``
    strategies to do the actual work.
    """
    def __get__(self, mediafile, _):
        values = []
        for style in self.styles(mediafile.mgfile):
            values.extend(style.get_list(mediafile.mgfile))
        return "; ".join([_safe_cast(self.out_type, value) for value in values])

    def __set__(self, mediafile, value):
        values = value.split("; ")
        for style in self.styles(mediafile.mgfile):
            style.set_list(mediafile.mgfile, values)

Combined with

lastgenre:
  canonical: yes
  count: 5
  separator: '; '

Works pretty good to me. At least for mp3s and flacs.
I know that's not a production quality solution, but it just works in my case: beets can both read and write multiple genre tags.

@GuilhermeHideki

This comment has been minimized.

Copy link
Member

@GuilhermeHideki GuilhermeHideki commented Oct 18, 2018

I was tinkering with the join table for flexible attributes and lists (i guess this is kind of horizontal partitioning). I don't know how much "magic" it would need for creating the query from user input + "metadata" table, like in CMSs which you can customize the fields. The idea would be like this:

CREATE TABLE songs (
  id INTEGER PRIMARY KEY AUTOINCREMENT, 
  title TEXT
  --   ...default fields
);

-- list type
CREATE TABLE songs_artists (
  id INTEGER, 
  ord INT DEFAULT 1, 
  name TEXT NOT NULL, 
  joiner TEXT, -- feat
  PRIMARY KEY (id, ord),
  FOREIGN KEY (id) REFERENCES songs(id), 
)
# example
fields = OrderedDict({
  'title': 'songs.title',
  'artists': 'GROUP_CONCAT(songs_artists.joiner || songs_artists.name, "")'
})

joins = [
  'JOIN songs_artists on songs.id = songs_artists.id
]

filter_subselect = f"SELECT id from songs_artists where {criteria}"
sql = f"SELECT {','.join(fields.values())} FROM songs {joins} WHERE songs.id IN (filter_subselect)"

# fetch and create objects / DTOs etc
@sampsyo

This comment has been minimized.

Copy link
Member

@sampsyo sampsyo commented Oct 18, 2018

Sure! This seems like about the right design to make arbitrary files into lists. I'd be interested to investigate what the potential performance trade-offs are…

@Freso

This comment has been minimized.

Copy link
Member

@Freso Freso commented Feb 20, 2019

Maybe we can architect this feature so it degrades gracefully on systems without the appropriate SQLite feature?

For PostgreSQL I've seen people recommend converting dbs with JSON fields to regular string fields and rely on downstream software to decode the JSON when dealing with pgsql versions prior to native JSON support. Maybe beets could also just use json if we can't get/store "real" JSON objects from native SQLite?

(This also seems compatible with the json1 extension: "The json1 extension (currently) stores JSON as ordinary text."[1])

@jackwilsdon

This comment has been minimized.

Copy link
Member

@jackwilsdon jackwilsdon commented Feb 20, 2019

@Freso that'd mean we'd need two versions of our queries to degrade nicely for versions of SQLite without json1 support, as the main advantage to using the JSON column type are the functions supported by the SQLite json1 extension.

@Freso

This comment has been minimized.

Copy link
Member

@Freso Freso commented Feb 20, 2019

@jackwilsdon I think there are some other great advantages, e.g., that you can store lists of non-string objects (incl. other lists!), which is not possible with the "CSV" approach (regardless of what we decide our "C" to be). (We haven't found a use case for these though, but it's nice to go down a path that is known to be easily extendable.)

I do agree that it's a bother, and I wouldn't mind if it was decided to be a beets 2.0.0 release feature that simply has a hard requirement for json1 SQLite support. Older distributions and builds that don't have it enabled would then have to use beets <2, which is also working fine for most cases. (Mostly I was just commenting to hopefully get the issue moving along a bit. (Five years ago, @sampsyo said "We should indeed get moving on list-valued fields."…) :))

@pdf

This comment has been minimized.

Copy link
Contributor

@pdf pdf commented Feb 20, 2019

What's wrong with just using the standard relational DB option of a join table? Losing indexing and query support seems like a poor trade-off for storing simple tabular data as JSON.

@jackwilsdon

This comment has been minimized.

Copy link
Member

@jackwilsdon jackwilsdon commented Feb 20, 2019

@Freso I'm certainly not against storing it as JSON, my point was mainly around we the fact that we wouldn't be able to use any advantages of json1 until we deprecated support for non-json1 versions of SQLite, which is less than ideal.

I guess one thing to look at is % of installations that support json1 — we'd probably have to go digging in different distros or maybe outsource some of the statistics to the community if they can't be found elsewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
You can’t perform that action at this time.