Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multivalue genre tag support #2503

Open
wants to merge 31 commits into
base: master
from

Conversation

Projects
None yet
4 participants
@antlarr
Copy link
Contributor

commented Apr 4, 2017

These changes allow beets to store multiple genre tags for multivalues, with the intent of fixing #505 . Note that this is not ready to be merged yet, but it's a beginning of discussion on how to properly fix the issue.

Note that these commits store lists in album.genre and item.genre properties, while storing them as text strings in the database using a separator. Also, I checked that if a user has something like this as path definition:

paths:
    default: $albumartist/$genre/$album%aunique/$genre/$track - $title

then $genre is formatted as a string correctly . In my test case, creating a directory called "Rock;Blues" .

This can break some complex configurations. For example, I had this as album_fields definition to add a album_genre_for_path tag that would contain the most used genre in an album:

album_fields:
    album_genre_for_path: |
        if 'album_genre' in globals():
            return album_genre.title()
        if hasattr(items[0], 'album_genre'):
            return items[0].album_genre.title()
        genres = {}
        cnt, candidate = 0, ''
        for item in items:
            genres[item.genre] = genres.get(item.genre, 0) + 1
            if genres[item.genre] > cnt:
                cnt, candidate = genres[item.genre], item.genre.title()
        return candidate

When evaluating this tag, beet broke sine item.genre is no longer hashable. I think this is a step better than silently breaking the directory structure. I fixed it easily by replacing the code above with:

    album_genre_for_path: |
        if 'album_genre' in globals():
            return album_genre.title()
        if hasattr(items[0], 'album_genre'):
            return items[0].album_genre.title()
        genres = {}
        cnt, candidate = 0, ''
        for item in items:
            if isinstance(item.genre, list):
                values = item.genre
            else:
                values = [item.genre]
            for genre in values:
                genres[genre] = genres.get(genre, 0) + 1
                if genres[genre] > cnt:
                    cnt, candidate = genres[genre], genre.title()
        return candidate

Then I also tested what happens when reading a file written by another application with multiple genre tags:

Python 3.6.1 (default, Mar 23 2017, 13:04:44) [GCC] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import beets.mediafile
>>> mf=beets.mediafile.MediaFile('01 - Big Brother & the Holding Company -  Bye, Bye Baby.flac')
>>> mf.genre
['Rock', 'Album Rock']

Which is what I would have expected.

@mention-bot

This comment has been minimized.

Copy link

commented Apr 4, 2017

@antlarr, thanks for your PR! By analyzing the history of the files in this pull request, we identified @sampsyo, @Kraymer and @brunal to be potential reviewers.

@antlarr

This comment has been minimized.

Copy link
Contributor Author

commented Apr 4, 2017

I tested this with flac and mp3 files (that is, with APE tags, where multiple genre tags are written and ID3 tags, where \0 bytes are used to separate genres in a single ID3 tag). Everything works as expected due to ListMediaField (btw, congratulations to whoever implemented it and the *ListStorageStyle classes, they are really nicely done)

The remaining questions are:

  • Should a global parameter be added to the configuration? If yes, should it be named multivalue_separator or ... ? Also should it default to ';' ? '; ' ? ', ' ? This should be also set as the default value for tmpl_first, so it behaves correctly when handling the genre tag.
  • What to use for queries in the StringList type? I think using SubstringQuery is probably "good enough", but that can be discussed as it could return some unexpected matches . I'm thinking on someone querying for "ck;Bl" for example, but on the other side, maybe someone would like to query for "Rock;Blues" to get items with both values (although that wouldn't match genres stored as "Blues;Rock" which should be considered ).
  • Should a config option be added (in the mood of id3v23), like ... use_single_tags_with_separator (defaulting to False) just in case any user prefers to use the current behavior in case (s)he is using the music files with another application that doesn't support multivalue genre tags yet?
@sampsyo
Copy link
Member

left a comment

Very cool! This is diff is surprisingly small—I expected this to be a far more disruptive change.

A couple of ideas come to mind for improving backwards compatibility, neither of which we necessarily have to use:

  • We could keep a field called genre that just gets the joined string, and add a new genres field for the list. That's what MediaFile previously did.
  • We could somehow allow a string to be assigned into the genre field, which would then be treated as a single-element list.

The only thing that makes me somewhat uncomfortable here is the age-old slippery slope question. How many other fields would it make sense to transform into lists? Should artist be a list? albumartist? If we make too many changes like this, will we disrupt simpler code that just cares about a single value?

Anyway, genre is clearly the best use case for this, so maybe it's not worth worrying too much about other tags.

Show resolved Hide resolved beets/dbcore/types.py Outdated
Show resolved Hide resolved beets/dbcore/types.py Outdated
Show resolved Hide resolved beets/dbcore/types.py Outdated
@antlarr

This comment has been minimized.

Copy link
Contributor Author

commented Apr 5, 2017

Btw, I guess you'll doubt that getall('TIPL') can be replaced with get('TIPL') (mainly because I doubted about it too :) ). So here is the test I did with mutagen to check that it doesn't allow two TIPL frames:

>>> import mutagen
>>> m=mutagen.File('full.mp3')
>>> m.tags.get('TIPL')
>>> frame=mutagen.id3.Frames['TIPL'](encoding=mutagen.id3.Encoding.UTF8, people=[['aa','bb'],['cc','dd']])
>>> m.tags.add(frame)
>>> m.save()

Here, I checked with okteta (an hex editor) that the file included the added TIPL frame.

>>> frame2=mutagen.id3.Frames['TIPL'](encoding=mutagen.id3.Encoding.UTF8, people=[['ee','ff'],['gg','hh']])
>>> frame
TIPL(encoding=<Encoding.UTF8: 3>, people=[['aa', 'bb'], ['cc', 'dd']])
>>> frame2
TIPL(encoding=<Encoding.UTF8: 3>, people=[['ee', 'ff'], ['gg', 'hh']])
>>> m.tags.add(frame2)
>>> m.tags.getall('TIPL')
[TIPL(encoding=<Encoding.UTF8: 3>, people=[['ee', 'ff'], ['gg', 'hh']])]
>>> m.save()

and here, I checked again with an hex editor that the frame wasn't just added, but it replaced the previous TIPL frame.

@sampsyo
Copy link
Member

left a comment

Things are coming along nicely! Here are a few more comments on the latest.

if value is not None:
if value:
formatted[key] = u'; '.join(value)
elif value is not None:

This comment has been minimized.

Copy link
@sampsyo

sampsyo Apr 5, 2017

Member

This, for what it's worth, is one of the things that worries me about making arbitrary fields into lists: now client code will frequently need to have boilerplate for the form if isinstance(value, list):. This sort of thing puts additional weight on the side of having an explicit way to access lists, or some other foolproof conversion to produce a scalar value.

This comment has been minimized.

Copy link
@antlarr

antlarr Apr 5, 2017

Author Contributor

I understand what you mean. Note that list fields (genre, arranger ... ) now are always lists, the problem comes with code like that which tries to work with all fields the same way. Would you prefer that I create a class that inherit from lists and reimplements __str__ so it returns '; '.join(self) ? That way code like that could be changed to:

if value:
    formatted[key] = str(value)

But still, that's not perfect neither.

Show resolved Hide resolved beets/mediafile.py Outdated
@@ -582,7 +582,10 @@ def set(self, mutagen_file, value):
"""Set an individual value as the only value for the field using
this style.
"""
self.set_list(mutagen_file, [value])
if value is None:

This comment has been minimized.

Copy link
@sampsyo

sampsyo Apr 5, 2017

Member

That's odd—I'm a little surprised that ListStorageStyle is allowed to return None from get. That seems like it makes things more complicated for clients in general. Maybe that's worth fixing?

This comment has been minimized.

Copy link
@antlarr

antlarr Apr 5, 2017

Author Contributor

In fact, I'd say it's ok. get is implemented as:

          try:
              return self.get_list(mutagen_file)[0]
          except IndexError:
              return None

So None is returned when get_list returns an empty lists, that is, when the tag doesn't exist in the file. This allows to identify when a tag doesn't exist and allows to remove a tag from a file. Do you think that it would be better to return an empty string and always interpret an empty value as a non-existant tag? (so setting ... say artist to '' would delete the tag entry)

@antlarr

This comment has been minimized.

Copy link
Contributor Author

commented Apr 5, 2017

I just noticed I had this written in a browser tab but forgot to click comment. I'll post it for the records, even if it's probably old now.

Very cool! This is diff is surprisingly small—I expected this to be a far more disruptive change.

Yes, I also was surprised, but most of the work was already implemented in ListMediaField, the *ListStorageStyle classes and mutagen.

A couple of ideas come to mind for improving backwards compatibility, neither of which we necessarily have to use:

We could keep a field called genre that just gets the joined string, and add a new genres field for the list. That's what MediaFile previously did.

Yes, I saw that, but the genres tag is (afaik) not used by any other application, but genre is.

We could somehow allow a string to be assigned into the genre field, which would then be treated as a single-element list.

Have a look at the test_write_genre_list_from_string test I added in test_mediafile.py to test fad5809 :). I think that's even better since it allows assigning a string but automatically converts it to a list.

The only thing that makes me somewhat uncomfortable here is the age-old slippery slope question. How many other fields would it make sense to transform into lists? Should artist be a list? albumartist? If we make too many changes like this, will we disrupt simpler code that just cares about a single value?

Hmm, I don't think that would be too much work. I can give it a try if you want. (This is now done)

def negated_clause(self):
clause, subvals, table = self.col_clause()
_table = self.model_cls._table
primary_key = '%s.id' % (_table)

This comment has been minimized.

Copy link
@antlarr

antlarr Apr 12, 2017

Author Contributor

@sampsyo Do you think it's worth it to add a get_primary_key_field to Model so instead of hardcoding id I call self.model_cls.get_primary_key_field() which would iterate over its fields and search which has a PRIMARY_ID type? I think that would be more elegant, but I'm not sure if it's worth the effort.

@@ -842,11 +882,15 @@ def _fetch(self, model_cls, query=None, sort=None):
"""
query = query or TrueQuery() # A null query.
sort = sort or NullSort() # Unsorted.
where, subvals = query.clause()
where, subvals, new_tables = query.clause()

This comment has been minimized.

Copy link
@antlarr

antlarr Apr 12, 2017

Author Contributor

@sampsyo should I still allow Query objects to maybe return 2-tuples just in case any external plugin still uses the old API? It's easy to check for unpack errors but it's a bit ugly, so I preferred to ask your opinion.

Show resolved Hide resolved .travis.yml Outdated

@antlarr antlarr force-pushed the antlarr:multivalue-genre-tag branch 3 times, most recently from dcf1c93 to 72ab5c9 Apr 12, 2017

antlarr added some commits Apr 4, 2017

Add a StringList type
A StringList type is a list of strings that can be formatted as a string
(using a separator character/string) to be stored in the database or used
as path component.

By default, I used ';' as separator, but I guess that should be made
configurable through a global config parameter (though I guess most people
would use ';' or ', ').

Also, probably query should be something different from
query.SubstringQuery, but that can be changed later.
Use the new StringList type for genre and store it using multiple tags
genres is removed, but that's not standard, nor needed anymore.
Make lastgenre return the genres as a list
This removes the need for a separator parameter, which now should
be made global.
Define a null value for StringList, parse '' correctly and use '; ' a…
…s separator

- Define a null value as []
- Parse '' as the null value (and not as ['']).
- Use the same separator ('; ') as tmpl_first does by default, so tmpl_first
keeps working as usual.
Allow to set a multivalued string to a ListMediaField and a test for it
Allow to set item.genre = u'Rock; Blues' and automatically convert that
to a string, so we keep better backward compatibility. Also added a test
for that
Store empty lists as null in the database
Make to_sql and from_sql return/accept None values
Fix bpd plugin with the new genre as a list
Convert it to a string before adding the genres list to a string
Accept None value in ListStorageStyle.set
ListStorageStyle.get can return None, so set should accept it
Accept [] in MP3ListStorageStyle.store
MP3ListStorageStyle.fetch can return [], so store should accept it.
Inherit MP3PeopleStorageStyle from ListStorageStyle
MP3PeopleStorageStyle now gets/sets lists of people.

Note that MP3PeopleStorageStyle should only be used with TIPL tags
so there's no need to use getall, since it'll always return only one
element (or None).
Don't print fields whose value is an empty list
Also remove an unneeded piece of code I added in a previous commit.

antlarr added some commits Apr 6, 2017

Add `multivalue_separator` config option so '; ' is not hardcoded any…
…more

Use the `multivalue_separator` config option whenever splitting or
joining multivalue fields
Use tuples instead of lists for multivalue fields
This adds a very simple TupleMediaField class that inherits
ListMediaField and just converts the result of __get__ to a tuple
since ListMediaField is also used for ImageListField, and I didn't
know if that should be changed to tuples too.

Changed all tests to use tuples accordingly.
Store/Load StringList type as json and migrate non-json values in the db
The StringList type now uses json to store the list of strings in the database.

Also, since sampsyo expressed he tried to avoid using database scheme
versioning, I added a migration method that I think should be quick enough
to be run on every start (although I think I still would prefer to use
versioning, which would allow for this to be run once). This iterates on
every entry in a StringList field that sqlite recognizes as "not valid json"
 and converts it to json. After the database is migrated once, this should
return 0 rows so the migration would take only the time it takes sqlite
to test JSON_VALID() on StringList types.
Convert valid json strings to json arrays too
There may be cases where genre contains values like `"Symphonic Rock"`
since `JSON_VALID('"Symphonic Rock"')` is True for sqlite, we have
to check too that the json value is actually an array.
Add JSonSubstringListQuery class and other query related changes
The new JSonSubstringListQuery class implements a SubstringQuery over a
StringList type using json sqlite functions in the database. For this,
I had to change the clause() method return tuple to add a new
element which is a list of tables to add to the FROM in the final
sql expression. So now, JSonSubstringListQuery returns something like:

```
("json_genre1.value like ? escape '\\'",
 ('Rock',),
 ('json_each(genre) as json_genre1',)
)
```

Since json_each is a table that has a column named id, This would
collide with the id column from items/albums, so I changed
Database.fetch from `SELECT * ...` to
`SELECT DISTINCT items.id, items.title, items.artist, items.genre ...`
I added a DISTINCT since using multiple tables can repeat results.

Now it's possible to do things like:
`beet ls genre:Classic`
which returns Classical music as well as 'Classic Rock' music
`beet ls genre:Classic genre:Rock`
which returns music tagged as Classic AND Rock (or Classic Rock)
`beet ls genre:Classic , genre:Rock`
which returns all Classical music and all Rock music

All of that, using json data, so:
beet ls 'genre:["Rock'
doesn't return any result.
Implement JSonSubstringListQuery negated clause
Negating json sql queries is not as simple as prefixing the where clause
with NOT, so now NotQuery first checks if the subquery has a
negated_clause function, and uses it if possible.

I implemented JSonSubstringListQuery.negated_clause to return
expressions like:

```
items.id NOT IN (SELECT items.id
                   FROM items,
                        JSON_EACH(items.genre) AS json_genre1
                  WHERE json_genre1.value LIKE 'Rock')
```

This allows to run commands like:

`beet ls genre:Blues ^genre:Rock`
Add JSonRegexpListQuery class and parametrize the regexp uses
Add a new JSonRegexpListQuery class that implements regular expressions
queries on StringList fields.

To use it without hardcoding any class names, I had to add more
flexibility to the query `prefixes` dictionary. Now, a type defines not
only what query class is used with it, but what query class is used for
regular expressions and the `prefixes` dictionary not only associates
each prefix with a Query class, but can associate a prefix with a
function that gets a field name and returns a Query class to use.
Check if sqlite supports json functions and do equivalent code in python
Sqlite includes the json extension since version 3.9.0 and there aren't
many distributions that don't have it, but there are still a few
important distributions that have sqlite < 3.9.0 (Debian stable,
centOS, openSUSE Leap and Ubuntu Trusty), so in those cases, fallback
to use equivalent python code, which is slower, but at least we don't
leave any distribution behind.

Btw, Arch, Debian stable backports|testing|unstable|experimental, Fedora
24|25|26|Rawhide, FreeBSD, Gentoo, Mageia, Manjaro, openBSD, openSUSE
Tumbleweed and Ubuntu 16.04 LTS|16.10|17.04 already contain at least 3.11.0.
Fix documentation related to multivalue_separator
`multivalue_separator` is no longer used to store multivalue fields
in the database, so remove the sentence that says that.
Do not fail migrating the database when iterating on new fields
If a new field is added to the model, the migration happens before
it's added, so it can try to recover the values of a non-existing column.
In that case, just ignore the error since there's obviously nothing
to migrate there.

@antlarr antlarr force-pushed the antlarr:multivalue-genre-tag branch from 7b0b7fc to 06f7c04 May 30, 2017

@arcresu

This comment has been minimized.

Copy link
Member

commented May 19, 2019

@antlarr thanks for your work on this PR. It's been a while since it's had any attention, but it seems that multi-valued tags are still a feature that people would like to see!

Just a heads up that we're planning on splitting MediaFile out into a standalone Python module & git repository during the current beets release cycle, and this is clearly going to affect this PR. Your branch is anyway conflicting with master already, but after the split it'll mean that the changes will also need to be split into two separate sets. Are you still interested in working on this? If so it might be sensible to start with the MediaFile changes at beetbox/mediafile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.