New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDP adding too much magic to hash containers? #75

Closed
Timbus opened this Issue Nov 18, 2015 · 11 comments

Comments

Projects
None yet
4 participants
@Timbus

Timbus commented Nov 18, 2015

So, I have a strange issue using DDP in combination with JSON::XS. I'm not entirely sure who is at fault here but I thought I'd have a better chance bugging you and not MLehmann..

Here's the issue:

use DDP;
use JSON; 
my %teh = map {$_,1} qw(apple banana xylophone zephyr); 
np %teh; 
say JSON->new->canonical->encode(\%teh);

#Output:
{"banana":1,"xylophone":1,"apple":1,"zephyr":1}

Note the hash is unordered.. even though 'canonical' should order the keys alphabetically. This problem does not occur if I don't pass the hash to DDP, which means something in DDP is changing it (and Devel::Peek confirms that DDP adds a ton of magic to the hash).

Looking at JSON::XS source shows that it sorts differently based on SV magic, but I'm not enough of a wizard to know what it's actually doing. All I know is that someone is doing something wrong, somewhere.

@dur-randir

This comment has been minimized.

Contributor

dur-randir commented Nov 18, 2015

This is mostly JSON::XS fault. Since version 2.25, it gave up sorting tied hashes, but has never implemented that check completely. Since you rarely encounter non-tied SvRMAGICAL hashes, that lived up until now.

But it's still a strange decision on Data::Printer's part to use Hash::Util::FieldHash that eats an enormous amount of memory for the task it's used for and also permanently slows down any access to the structure you've passed to it.

@Timbus

This comment has been minimized.

Timbus commented Nov 19, 2015

Oh.. That's not good to hear on both points.
So should this issue be raised with Marc L? Or would it be better to report this to the cpanel fork of JSON::XS?

Regardless I'm probably going to have to drop DDP after reading your second paragraph. I'm currently trying to use it to format data in log output, but I think I'd end up with a significant performance hit as the code I'm working with already bottlenecks fairly heavily on logging.

@Timbus Timbus closed this Nov 19, 2015

@dur-randir

This comment has been minimized.

Contributor

dur-randir commented Nov 19, 2015

I'll provide a suitable patch to Cpanel folks in a couple of days. As for Lehmann, you can try, but i'm quite skeptical about it)

And about memory/performance hit - it matters only in a really high-pressured applications. If you examine just some structures from time to time - it's OK (as we do at $work, and that's why I'm so paranoid about performance). If you'll just throw away what you've just DDP'ed (like a request vars) - it's also OK. The only case when it really matters - is when you constantly dump long-living objects. For me, Data::Printer is nevertheless a tool of great convenience.

@Timbus

This comment has been minimized.

Timbus commented Nov 19, 2015

Sadly, the exact use is for examining pieces of a very large tree of database structures that sometimes get dumped to disk as JSON (hence this issue). And it's being done in logs, not only for debugging -- I think this is the worst possible scenario for DDP :(
I'll definitely keep using it when I need to debug chunks of data, but not for logging.

Thank you very much for submitting a patch for JSON though, it's very appreciated :)

@garu

This comment has been minimized.

Owner

garu commented Jan 14, 2016

@Timbus thanks for reporting, and apologies for not saying anything back when you did.

Perl itself does not guarantee that the internal representation of a variable remains the same, and other people have bumped into a similar issue (read the comments below the article for a complete explanation of the issue). The gist of it is just like @dur-randir++ said: fieldhash() is used to store the object's unique id and thus avoid traversing forever on circular structures, but that also changes some internals of the object itself - all (mostly) transparent to the developer, except when deep XS magic is involved.

Again like @dur-randir said, this is mostly a JSON::XS issue (last I checked mlehmann argued that it is actually an issue in Perl and refuses to patch his code over a bug in Perl).

Finally, regarding any noticeable slowdowns or bumps in memory usage: @dur-randir I am very open to a PR that replaces fieldhash() usage completely if you (or anyone for that matter) can do it, together with some sort of benchmark showing us the gains - and of course with some tests so we can make sure everything is still running smoothly :D

Thank you both again!

@dur-randir

This comment has been minimized.

Contributor

dur-randir commented Jan 26, 2016

@garu, it's great that you're concerned about a quality of your modules. I'll try to summarize what's wrong with fieldhash().

  • While it's true that internal representation of a variable can change, perl itself almost exclusively looks at the variable's public flags to determine it's content. So, most XS modules won't care about internal representation, only about public flags.
  • fieldhash() not only changes internal representation, but also sets SVs_RMG flag for the SV, thus pessimising core codepaths related to hash/array access and confusing XS modules (like you've seen in Clone and JSON::XS). But setting this flag can easily be skipped - as the attached magic vtable has no valid callbacks and is used only as a private storage.
  • fieldhash() increased memory consumption comes not only from the magic marker attached to the variable, but it also stores weak backref - so, after call to p(), for each reference in the structure printed, there are additional 2_MAGIC + HE + 3_SV allocated and hanging around forever. That seems MAGIC+HE+3*SV too much for me)
  • Even after you remove a variable from fieldhash'd hash (or completely destroy that hash), variable still retains all MAGIC pointers with their additional data, so you free just HE+SV.

I can write a custom module that will be much more like a direct 'exists $p->{_seen}->{_object_id($item)}' equivalent with two available trade-offs:

  • forever store MAGIC pointer in the variable, but avoid garbage collection pass afterwards
  • clean MAGIC pointer after p() traversal, consuming a bit more cpu+memory in the process, but releasing it afterwards

References will still be extended to PVMG type, but without any flags set/other observable effects.

But this has one problem - this will be a XS module, and DDP currently has only pure-perl or core deps. I can make it optional and provide a common layer for both it and fieldhash() usage, if you approve this path.

@dur-randir

This comment has been minimized.

Contributor

dur-randir commented Jan 26, 2016

Furthermore, the following code leaks 1 SV per iteration:

use Clone qw/clone/;
use Hash::Util::FieldHash qw/id/;

my $foo = {a=>1};
id $foo;

id(clone $foo) while 1;
@Timbus

This comment has been minimized.

Timbus commented Mar 22, 2016

Uh, I know this is pretty late and isn't related to my issue anymore, but I personally think @dur-randir 's addition would be fantastic and I'd certainly be happy to use it, although I sadly cant speak on behalf of @garu approving it or not.

I think I'll reopen the issue. At the very least the title is still relevant.

@Timbus Timbus reopened this Mar 22, 2016

@nrdvana

This comment has been minimized.

Contributor

nrdvana commented Apr 17, 2016

Usually when i walk a possibly recursive data structure I use a %seen hash of refaddr(). Is there a problem that FieldHash solves that refaddr can't?

@garu

This comment has been minimized.

Owner

garu commented Apr 22, 2018

Sorry for the long wait!

@nrdvana the problem with using refaddr is that Perl will reuse references to destroyed objects:

use strict;
use warnings;
use Scalar::Util qw(refaddr);
{
    package Foo;

    sub new {
        my $class = shift;
        my $string = shift;
        return bless {}, $class;
    }
}

for(1..10) {
    my $obj = Foo->new;
    print "Object's reference is " . refaddr($obj) . "\n";
}

perl poc.pl
Object's reference is 140320331281432
Object's reference is 140320331312472
Object's reference is 140320331312472
Object's reference is 140320331312472
Object's reference is 140320331312472
Object's reference is 140320331312472
Object's reference is 140320331312472
Object's reference is 140320331312472
Object's reference is 140320331312472
Object's reference is 140320331312472

So if we want to make sure we are looking at the right object at all times, we need something like fieldhashes. That said, since the main usage of Data::Printer expects to forget about variables after they are printed, and variables are not supposed to go away while we are printing them (and we hold the extra references) then we should be ok with refaddr, so I dropped fieldhashes altogether. Let's see how it goes!

@garu garu closed this Apr 22, 2018

@dur-randir

This comment has been minimized.

Contributor

dur-randir commented Apr 25, 2018

t/011-class.t fails on all pers except 5.18 exact for this reason - ICanHazStringMethodTwo->new returns the same object address as ICanHazStringMethodOne->new. I see two options for the API:

  • reset $seen at the start of the Data::Printer::Object->parse() call (I like this one more)
  • say that if you want to parse unrelated object, you should construct different Data::Printer::Object objects (and change tests)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment