Quality-based Trumping during import #116

Open
ghost opened this Issue Feb 28, 2013 · 54 comments

Projects

None yet
@ghost
ghost commented Feb 28, 2013

This issue was automatically migrated from Google Code.
Original author: telejes...@gmail.com (February 16, 2012 01:16:19)
Original issue: google-code-export/beets#341

@hchapman
hchapman commented Apr 3, 2013

This would be superb. In my spare time I was going to try to write something like this.

@sampsyo
Member
sampsyo commented Apr 3, 2013

Awesome! Now that we have a central place in the importer workflow where replacement decisions are made, this should be relatively straightforward to implement -- it's just a matter of coming up with the right design for how it should work.

On Wed, Apr 3, 2013 at 9:47 AM, Harrison Chapman notifications@github.com
wrote:

This would be superb. In my spare time I was going to try to write something like this.

Reply to this email directly or view it on GitHub:
#116 (comment)

@hchapman
hchapman commented Apr 7, 2013

This is my first working with beets, so mind how naive I am.

I've written a really silly (and still crufty) plugin which tries to use eyed3 to read the Xing header of the mp3 to determine the encoding preset (e.g. v0, cbr 320) and adds a $quality name template variable. It'd be nice to know the presets because if I were quality trumping, I'd personally take v0 over a 320, but a 320 over a v2, and I do not believe the ABR is enough information to do this reliably always.

I'd like to make this into a plugin, since right now it depends on eyed3, which I do not believe beets presently does. Any chance that the plugin docs are out of date, and that the database is now extensible by plugins? It would be nice to be able to beets ls quality:v0 or similar to help sort through the library without having to inspect every single mp3, and I don't if there is another way to implement this type of behavior.

I haven't given any look into modifying the replacement decisions via a plugin yet: my brief scan of the plugin docs didn't seem to include much information on this. I'll have to scan the source.

@mister-walter

Check out this thread on the Google Groups forum for information on storing arbitrary data in the beets library. Sampsyo indicated that such a feature (flexattrs) might be coming in beets version 1.2.

@sampsyo
Member
sampsyo commented Apr 8, 2013

Very cool! Nice work, and not silly at all. 🐤 🌟 I really like this approach.

Regarding the dependency on eyed3: yes, we'll need to replace that with an equivalent parser for Mutagen/MediaFile. Do you know how eyed3 determines the LAME preset? Is it by parsing the "encoder" string? If so, then we can re-implement that in a fairly straightforward way.

As @mister-walter mentioned, yes, we're currently in the middle of building support for arbitrary metadata fields in the database. See also #101 for the current status of this feature. I'm planning to make it the first big feature in 1.2, which will start development in earnest soon when 1.1 is out the door (very nearly ready). This is yet another good use case for that feature.

@hchapman
hchapman commented Apr 8, 2013

I think eyed3 uses the xing header which is in the blank first frame of the mp3. Encoders like LAME use it, and possibly some others as well.

I don't know how eyed3 uses this data to determine/guess the encoding preset! It's open source though...

@sampsyo
Member
sampsyo commented Apr 8, 2013

Cool. That's worth investigating to see how complex the logic involved is.

@hchapman
hchapman commented Apr 8, 2013

By lifting the LameHeader class and three utility functions from eyed3, I've added super rudimentary LAME header support to mutagen in this branch (super rudimentary means that it just tries to find the LAME header, and spits it out while it opens an mp3 file, with no actual functionality). It's understandable that this code isn't in mutagen, which doesn't seem to be concerned with this extra data (especially since it may or not have an equivalent in any other type of media file), but it was interesting to see how it's certainly not impossible to add it.

To test:

import mutagen.mp3
mutagen.mp3.Open(MP3_FILE_NAME)
@sampsyo
Member
sampsyo commented Apr 8, 2013

Cool; thanks. Looks like this header is actually substantially more complex than I originally imagined—the Xing header is yet another arcane binary format. FWIW, there's discussion (years old now) on the Mutagen bug tracker for including similar functionality:
http://code.google.com/p/mutagen/issues/detail?id=66

There's a patch but it needs some cleaning up. Maybe the best course of action here would be to pick up that patch, which seems to have been abandoned by the original author, clean it up, add tests, and land it in Mutagen for the next release. (The Mutagen maintainers have been pretty receptive lately.)

Who knew this little bit of data would be so onerous to recover?

@hchapman
hchapman commented Apr 8, 2013

That explains a lot; I didn't even consider mutagen wouldn't be hosted on github, and I didn't think to look at googlecode/bitbucket. I'll give that a look over.

@hchapman

The patch applies to present master branch! It works for one of my songs, but none of the test files included with mutagen at present have any preset data :(.

@hchapman

I've put the functionality merged into today's master/default (whatever it's called with hg!) with testing on my Bitbucket fork of mutagen. I've also told the mutagen issues tracker. I haven't heard anything on the issue yet, though.

@sampsyo
Member
sampsyo commented Apr 15, 2013

Awesome, thanks for taking action! Let's see what the authors think; if necessary, we can email the Quod Libet mailing list in a few days.

@hchapman

Well- it does not seem like there is any "way" for a plugin to, at present, influence the importer in checking for duplicates, or even to modify the manual duplicate resolver in the UI (e.g. make it say "we have this album as MP3 in the database, but you're importing a FLAC"). Correct me if I'm wrong! I'm going to think about how to add this broad functionality in.

@sampsyo
Member
sampsyo commented Apr 17, 2013

That's correct; the hook doesn't exist yet. We'll need to add it in the resolve_duplicate method of ImportSession in importer.py.

@hchapman

Ideally the plugin (at least one which knows the user's preference to replace MP3 with FLAC) would be UI-agnostic, and as of now resolve_duplicate just throws a NotImplementedError (and the implementation in ui doesn't call the super method as of yet). My initial thought was to add the hook right before resolve_duplicate is called.

I'm not used to thinking this much about software infrastructure. It's fun! Please let me know if the issues thread is not an appropriate place for my musing...

@sampsyo
Member
sampsyo commented Apr 17, 2013

Yes, absolutely—you're right that invoking the plugin before the call to resolve_duplicate makes more sense. That way, subclasses can override that method without replacing this new functionality. Good call.

And yes, absolutely, this issue tracker is the perfect place for this kind of discussion! To me, this is more organized and therefore useful than the mailing list or ad-hoc emails... there's some hope of coming back to the discussion later when we need to remember the decisions we made.

@hchapman

Moral conundrum: I'm not used to parallel programming so... It'd be nice to add task duplicate data into Importer task, so that plugins could add or remove duplicates automatically by some score calculation... It's easy to me to just carry along duplicate data as, say, a list, rather than calling _check_duplicates twice. But this crumbles under the completely possible scenario where a user has e.g. FLAC and MP3 in the import query for the same album, and the user would like the FLAC to auto trump the MP3. But it's not clear to me that we can just reliably hold on to any ad-hoc album data added to recents in user_query and expect it to still be correct by the time the better quality comes through...

I originally hoped this wouldn't need any sweeping infrastructure changes. Just saying where I'm stumped, and how the coroutine solution is pretty brilliant and still magic to me..

@sampsyo
Member
sampsyo commented Apr 23, 2013

Yes, the recent check is a complicated wrinkle in all of this.

One simplification that may help: all tasks hit the database in order. That means that, in the apply_choices coroutine, duplicate albums are already guaranteed to be in the database, even if they were just imported. That means that if the plugin can make all of its decisions there, without user input, then its job is simplified.

So a plugin could prompt the user (or not) based on the limited data available in recent during user_query and then do the heavyweight comparisons later, in apply_choices. Would that make sense?

@hchapman

Here's a branch with a preliminary solution. I've moved everything dealing with duplicates (including the user query...) to the apply_changes. It's pretty much entirely untested, and I'm presently not really happy with the implementation.

This beetsplug uses it and it also uses my branch of mutagen (about which I've still heard nothing, although I've still only just posted on the googlecode issue) to get preset data. It's a rudimentary quality trumper which prefers FLAC > V0 > 320 > [etc...]

@hchapman

I've updated my branch so that all tests pass, but I haven't written any new tests. Curious question: there was an issue in _duplicate_check where cur_artist threw an AttributeError... But the same never happened in the item counterpart...

@hchapman

It just hit me why moving a user query for the duplicate check is not a good solution. I'm also trying to work in a score system for duplicate replacement calculation instead of a binary take-it-or-leave-it

@sampsyo
Member
sampsyo commented Apr 24, 2013

Great; I'm glad that makes sense. Yes, all communication with the user has to be done in the same pipeline stage.

@hchapman

I've changed my strategy to a "trump score" system paralleling the distance that's in now for MB matching. Progress on this branch is preliminary, but works for a really rough manual test I've run myself. Automated tests don't fail, but there are no new ones to test my code.

@sampsyo
Member
sampsyo commented Apr 24, 2013

Interesting approach! I like the change of recent from a set to a dict so we keep around the album itself rather than just its identifier. We should definitely use this (but ensure that it doesn't leak memory...).

And the scoring system is interesting too, but I can also see a more specific interface (which might ask "given these albums, which is the best?" instead of "what's the score for this album?") being somewhat simpler. It would push some of the intelligence into the plugins rather than burdening the import flow more, which is already frighteningly complex. Does that make sense?

On Apr 24, 2013, at 1:33 PM, Harrison Chapman notifications@github.com wrote:

I've changed my strategy to a "trump score" system paralleling the distance that's in now for MB matching. Progress on this branch is preliminary, but works for a really rough manual test I've run myself. Automated tests don't fail, but there are no new ones to test my code.


Reply to this email directly or view it on GitHub.

@hchapman

The primary reason behind the score would be an (immutable, i.e. not changing after the tracks are imported into the library) number which wouldn't require holding on to any further data of items being imported. Perhaps plugins would be able to declare which properties about recent imports to hold on to? This doesn't really parallel anything in beets that I can think of at present however... Comparing relatively would involve holding on to some of this recent import information specifically, and since the changes haven't been applied yet, I'm not sure how messy this could get.

An example reason for this would be quality trumping-- right now quality trumping involves having the mutagen file info somewhere and I don't know for sure if the actual import/moving into library could cause some sort of rare race conditions when trying to grab/parse this information from tasks long finished.

Certainly with a score system the idea would be to present the user with a choice if the new import score doesn't exceed some threshold (which would ideally only happen if the new import completely eclipses an old; better quality, all the same tracks, same length, same release year...), perhaps with some delta information to explain to the user why we're scoring duplicate choices as such.

@sampsyo
Member
sampsyo commented Apr 24, 2013

Scoring seems just fine as a general approach to this problem; I'm just questioning whether the logic that handles scores should be done in importer.py or in qualitytrump.py. If at all possible, it would be great to move as much as possible (e.g., finding the max, totaling up track_scores and album_scores, etc.) to plugins to simplify the importer itself as much as possible. Of course, it's possible there's no clean way to do that.

To avoid races, one approach you could take in the plugin is to calculate scores when import tasks are first created (handling the import_task_start event) so you have all scores available for later comparisons. You can keep an internal map from tasks to scores or just assign your scores right on the task objects (thanks, Python!).

@telejester

Trump attachment processing is a bit less open-ended by adding the semantic framework of #111. In my original ticket for Trumping on gcode, I mentioned image handling and cue and log files.

In that vein, I just added some more specific ideas to the attachment ticket (gcode 109). In the name of redundancy prevention, please see https://code.google.com/p/beets/issues/detail?id=109#c21

Trump processing needs to consider the attachments of each candidate album and item as they are moved. Images ought to be trumpable as well, with higher-res images replacing lower-res versions of the same image. I pointed at a resolution and scale-independent image comparison library which ought to help a lot in determining "sameness" although I can't vouch for how well it works in real life.

Cue and log files are interesting in that they are only valid if an album consists of the complete set of items they refer to. Any item or album trumping means the cue and log have to go, even if the new version doesn't have replacements.

@telejester

I also have no idea what to do with assets which have been trumped. I suppose the options are "delete" (which I find terrifying) and "mark and move", but to where? A "discard" path?

@hchapman

beets already seems to be content skipping/removing albums at user query, so perhaps the trumping could (possibly unless the user informs otherwise) provide a summary of/query for the destructive actions it will take at the end of an import? I am not entirely clear if this fits into the structure/morals of the project however.

@telejester

I wouldn't want to change the ui assumptions that much.

The more I think this through, the more I like adding a 'discard' path
which has the same organization as the library tree except items and
attachments which are moved there are removed from the beets db.

There ought to be "delete" config option for the brave (or
disk-space-limited) but I think a nondestructive default is wise.

I like this option because the discarded assets would be kept organized
and tagged, but outside the library tree and database. That way, if
there is a bug or some error is made, the affected assets could be found
and re-imported easily but as far as beets is concerned, once discarded
they are out of the db, don't exist any more, and are no longer beets'
responsibility.

@sampsyo
Member
sampsyo commented May 14, 2013

Interesting! I like the "discard pile" idea. We could even use the system trash for this purpose, as in #149.

@sampsyo
Member
sampsyo commented May 24, 2013

A couple of notes on this:

  • The project called Dnuos has code that detects LAME presets in MP3 headers.
  • See also #287, which proposes using album-level fields in path formats—specifically quality/bitrate information.
@hchapman

Dnuos is, I believe, the source of the original mutagen patch proposed
years ago, which I tried to make admissible (but have since heard nothing).
On May 24, 2013 12:14 PM, "Adrian Sampson" notifications@github.com wrote:

A couple of notes on this:

  • The project called Dnuos has code that detects LAME presetshttps://bitbucket.org/brodie/dnuos/src/01f6eb7e49bf3bf8fce964e3630e202754e21334/dnuos/audiotype.py?at=default#cl-487in MP3 headers.
  • See also #287 #287, which
    proposes using album-level fields in path formats—specifically
    quality/bitrate information.


Reply to this email directly or view it on GitHubhttps://github.com/sampsyo/beets/issues/116#issuecomment-18414680
.

@sampsyo
Member
sampsyo commented May 24, 2013

Aha, good to know.

Maybe we should ask on the mailing list about merging that patch? Or find a way to do the detection in MediaFile instead?

@squisher

Ping, how come this patch is collecting dust? :-) I need something like this before I can attempt to actually put beets to use...

@sampsyo
Member
sampsyo commented Nov 18, 2013

There are a few outstanding issues awaiting a champion:

  • Add LAME preset detection. This is actually orthogonal and should maybe have its own ticket.
  • Add logic for short-circuiting the duplicate detection. This should be in core. In its most basic form, a config option makes the decision.
  • Either in core or in a plugin, develop a flexible/complex set of rules for making decisions automatically. This can come after the basic work is done.

Please feel free to chip in on any of the steps.

@daks
daks commented Dec 8, 2013

This issue (the original one on googlecode) resume exactly what I want. I'll track it.

@johtso
Collaborator
johtso commented Mar 5, 2014

This would be a very useful feature to have!

@Alderian

Hi,

This would be very nice, I'm doing this manually with 'duplicates' plugin using "beet dup -F -f '$path $bitrate'"... and then a manual rm... very simple, but painful

It would be useful to have the list of duplicates any time an import finds them. I mean, program will find a file already in library and have the newcomer. What if the program just detail the tags, that are different, of the two of them, this way I could choose wisely. This could be done the same way the program tells whenever its correcting tags. Ie:

/foo/bar/Megadeth/Rust in Peace/06 Lucretia.mp3
Tagging track: Megadeth - Lucretia
(Similarity: 100.0%)
This item is already in the library!
Candidates:

  1. Old (mp3, 128kbps)(encoder, bitrate)
  2. New (mp3, 196Kbps)(encoder, bitrate)
    1, 2 to keep only that file, **[S]**kip new, Keep all, Remove old?

I'm considering that I might want to have lots of versions of any file.

With that information, one can easily decide what to do... also, I guess that adding this would be easier, just to wait for the other options. I don't know how to program in python to add this functionality myself :( I might learn, fork and contribute :)

Just my 2 cents

Thanks for this great program...

@aristidesfl

I'm also interested in having V0 V2 320kb available to use in paths

@aristidesfl

Would it be possible to specify which identifiers are used to detect duplicates?
Just like %aunique identifiers, but for duplicate management, so that beet importdoesn't ask about duplicates if, for example, the formats are different.

@sampsyo
Member
sampsyo commented Nov 16, 2014

@aristidesfl Sounds like a reasonable extension, yes.

@untitaker untitaker referenced this issue in google-code-export/beets May 9, 2015
Open

Quality-based Trumping during import #341

@Schweinepriester
Contributor

push :)

@lazka
Contributor
lazka commented Aug 13, 2015

mutagen trunk (next version) will now expose lame version and bitrate mode. Feedback welcome.

See https://bitbucket.org/lazka/mutagen/issues/66

@sampsyo
Member
sampsyo commented Aug 13, 2015

Fantastic! Awesome work. I'll look into exposing this in beets.

@ekjaker
ekjaker commented Oct 9, 2015

Fantastic idea this addition to have this at import!

"Would it be possible to specify which identifiers are used to detect duplicates?
Just like %aunique identifiers, but for duplicate management, so that beet importdoesn't ask about duplicates if, for example, the formats are different."

+1 on that!!

And also, at import the option to set default actions would be very helpful, like always take new. Ofcourse, it this gets implemented to its full extent with a scoring system based on bitrate, availability of log/cue,... would be even better than a basic new/old option.

@Joshfindit

Sorry to complicate this, but I have something to add:

Joint stereo vs Separate stereo.
Personally, I'd much rather have a 192kbps separate stereo copy than a 320kbps joint stereo copy.

@lazka
Contributor
lazka commented Nov 30, 2015

Personally, I'd much rather have a 192kbps separate stereo copy than a 320kbps joint stereo copy.

Why?

@smlx
smlx commented Nov 30, 2015

Personally, I'd much rather have a 192kbps separate stereo copy than a 320kbps joint stereo copy.

The mid-side stereo transform is lossless, and reduces redundant encoding of common information between channels.

@Joshfindit

I notice a difference.

On Nov 30, 2015, at 8:15 AM, smlx notifications@github.com wrote:

Personally, I'd much rather have a 192kbps separate stereo copy than a 320kbps joint stereo copy.

The mid-side stereo transform is lossless, and reduces redundant encoding of common information between channels.


Reply to this email directly or view it on GitHub.

@aristidesfl

@Joshfindit did you double blind test / abx?

@Joshfindit

Yes, many times over the years.

For those that are skeptical, I'll put these on the table:

  • at higher bitrates, I hear it much less. At 320, it might be one part of one song on an album, for example.
  • I always use headphones that go above 20k, and seek out clarity of sound (yes, I don't always use devices that output that high, but I do when I really want to enjoy the music)
  • I have hearing damage in the mid-vocal range. This may cause my brain to overcompensate with the higher frequencies. (Have not tested it, but it's a point for the skeptics, so there it is)
  • there is some social aspect to this as well; many tracks that I have had in the past have just been poorly ripped; the joint stereo is more of a canary in these cases

On Nov 30, 2015, at 1:31 PM, Aristides notifications@github.com wrote:

@Joshfindit did you double blind test / abx?


Reply to this email directly or view it on GitHub.

@m8ram
m8ram commented Feb 6, 2016

I'm experimenting with beets for the first time and noticed that when I import a directory containing both ogg vorbis and MP3 files it imports only the MP3 files (exactly the opposite of what I want).
Is there a configuration option to change this or do I need to wait for this feature to be implemented?

@lazka lazka referenced this issue in quodlibet/mutagen Jun 25, 2016
Closed

Lame VBR Preset #66

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment