Use code region boundaries for ParseAPI::Function::less comparator #1668

batuzovk · 2024-01-11T11:22:16Z

Function::region() returns a pointer to CodeRegion class. Comparing pointers directly with 'less' has a lot of issues, ranging from nondeterministic order of functions in the resulting array to it not always having a well defined behavior according to standards.

Here I tried to recreate what I believe was the intended implementation of the predicate: sort based on region address first, function address second.

hainest · 2024-01-12T15:18:47Z

@kupsch In the (extremely) degenerate case that an archive has two .o files with the same function definition that the linker hasn't thrown away yet (e.g., an ODR violation), do we care that this changes the sort for those?

parseAPI/h/CFG.h

bigtrak · 2024-01-12T15:39:18Z

The way I read ODR ...

That is not an ODR violation -- yet -- the c++ linker removes the duplicates.
ODR is per translation unit, not globally. So multiple .o in an archive with identical
symbols are not a violation. During linking when the linker identifies
all the identical "sets of tokens" and uses just one is the resolution point. There isn't a violation .. yet
it is valid. If two non-identical things exist, the linker should create the violation
at that point. If it comes from the linker or later then ODR should apply. At the .o and
archive member level ODR doesn't yet apply across files / translation units.

hainest · 2024-01-12T15:50:40Z

@bigtrak It applies between translation units in a with some added caveats. ODR isn't the real issue I was pointing to, but it was a quick example I could think of. My question would apply to a C program.

n4950: C++23 draft (2023-05-10)

6.3 One-definition rule [basic.def.odr]

...

14 For any definable item D with definitions in multiple translation units,

  (14.1)— if D is a non-inline non-templated function or variable, or
  (14.2)— if the definitions in different translation units do not satisfy the following requirements,

the program is ill-formed;

batuzovk · 2024-01-12T16:11:49Z

Excuse me if I'll say something stupid, but I don't follow how functions being 'same' from high-level language point of view applies here. Comparator treats them as just addresses inside regions. Before this change functions within the same region were sorted by address, functions from different regions were sorted based on where CodeRegion class landed in host memory (i.e. randomly from the user perspective). Now functions within the same region are still sorted by their addresses, and function from different regions are sorted based on their respective region boundaries. What was working reliably still works the same way, what was not reliable before now is.

hainest · 2024-01-12T16:22:32Z

Excuse me if I'll say something stupid, but I don't follow how functions being 'same' from high-level language point of view applies here. Comparator treats them as just addresses inside regions. Before this change functions within the same region were sorted by address, functions from different regions were sorted based on where CodeRegion class landed in host memory (i.e. randomly from the user perspective). Now functions within the same region are still sorted by their addresses, and function from different regions are sorted based on their respective region boundaries. What was working reliably still works the same way, what was not reliable before now is.

Dyninst can process object files (on Linux '.o') both in isolation and as part of an archive file ('.a' on Linux). Since these binaries have not been passed through the linker, they all possibly start at the same base offset. If a function were defined in two .o files, then it's possible to have two separate ParseAPI::Functions representing the same function[1]. This would be a degenerate case that is very unlikely to happen, but the old sort by pointer comparison could have returned a different result than the one now. That said, I think the updated comparison is more correct.

[1] Actually, it could be any two functions that reside at the same offset.

batuzovk · 2024-01-12T16:28:30Z

the old sort by pointer comparison could have returned a different result than the one now

The biggest problem with the old sort was that it returned different results from run to run (because memory was allocated differently). Basically this is how the issue got spotted in the first place. So the code that relied on the order of these functions wouldn't have worked in the first place.

kupsch · 2024-01-12T16:30:26Z

For statically linked archives (.a files) all the member object files have an offset of 0. So functions in different object files could have the same offset without a region there is ambiguity.

batuzovk · 2024-01-12T16:48:35Z

For statically linked archives (.a files) all the member object files have an offset of 0. So functions in different object files could have the same offset without a region there is ambiguity.

So you are saying that there can be a case where region boundaries are the same, addresses are the same, but the functions are different? From the get go I don't see any reasonable sorting order for those that can be easily implemented.

Sorting by pointers groups functions for the same region together (but the order of the regions will still change from run to run), but it makes situation very confusing for ELFs with multiple code sections.

kupsch · 2024-01-12T18:00:46Z

WIthout a region, then the offset of function in different object files may have the same offset. The pair (region, offset) is unique for every function, but two functions without a region may have the same offset and therefore (null, offset) may map to more than one location/block/function. If all the code is in the same ELF file then (null, offset) is unique, but if you have multiple ELF files loaded then (null, offset) is not guaranteed to be unique.

bigtrak · 2024-01-12T18:36:02Z

At the C / assembler level Tim wants to ensure it works OK with ...

The replicated symbols in multiple archive members are completely legal.
That is because if a object module is selected from a library archive due
to a dependency, which selects only that archive member .o will be added to the linkage.

However, if two archive members, which have competing symbols are requested
via unique symbol dependency, then that is when the final linkage will detect symbol
duplication between the object files in the archive, and the link fails.

The offsets of the duplicate symbols in each archive member can differ, and they
are unique to that .o in the archive. No conflict until archive members with conflicting
symbols are requested via symbol lookup. You can force a conflict by just linking
object files with conflicting symbols, as inclusion of those object files then become mandatory, unlike
the selection of libraries, which only pull out object files with requested symbols.

As far as dyninst goes, everything is just symbols in a object file, which may or may not be
a library member. As long as each object file is tagged uniquely in dyninst, there should not
be any conflict between them. If a symbol that is multiple library members is requested
by a dyninst API use, then there is an issue, or potential issue depending upon linkage vs
analysis. As long as dyninst follows the normal ordering, there isn't a conflict until one is forced.

Perhaps we need a better / different way of identifying regions to resolve this issue.

batuzovk · 2024-01-12T19:26:12Z

Suggestion:

If f1_region == f2_region then order by addr(). This also handles the case where both regions are null.
Null region is less than any non-null region. No null regions in later cases possible.
Order by CodeRegion::low()
Order by CodeRegion::high()
Order by numeric value of the pointer.

The expected result:

For normal ELF functions will be sorted by their address.
For archives they will be grouped by object files, object files will be sorted from smallest to largest.
Object files of the same size would always compare the same within one run (i.e. any data structure should work), but might compare differently on different runs (i.e. there will be some nondeterminism in this corner case).

hainest · 2024-01-12T22:18:11Z

Allowing for non-existent regions is more complicated that I had originally considered. Here is my suggestion.

struct less {
  /*  
   *  In general, functions are not orderable because code can be shared
   *  in overlapping regions (e.g., shared blocks) or distributed
   *  among different regions (e.g., cold sections).
   * 
   *  This ordering satisfies the strict weak ordering required by the
   *  C++ Compare concept, so Function objects can be stored in
   *  standard containers like std::set.
   * 
   *  It is assumed that the functions being compared come from the
   *  same binary file (e.g., .so or exe). Creating a meaningful
   *  comparison of functions from different binary files encounters
   *  too many ambiguities.
   */

  bool operator()(const Function * f1, const Function * f2) const {
    auto *r1 = f1->region();
    auto *r2 = f2->region();

    if(r1 && r2) {
      // In the same region
      if(r1->low() == r2->low()) {
        // Not at the same entry point
        if(f1->addr() != f2->addr())
          return f1->addr() < f2->addr();
        }
        /*
         * Two functions in the same region at the same entry
         * point is nonsense, but we account for it to provide
         * a strict weak ordering- just in case.
        */
        return f1 < f2;
      }
      
      // Functions in different regions are ordered by the offset of
      // their containing region.
      return r1->low() < r2->low();
    }
    
    /* It may be possible that a function does not have a region.
     * This is likely a coding error, but we account for it.
    */

    // Functions from an empty region come first
    if(!r1 &&  r2) return true;
    if( r1 && !r2) return false;
    
    // If neither has a region, then ordering is arbitrary.
    // This satisfies the reflexive requirement of the Compare concept.
    return f1 < f2;    
  }
};

batuzovk · 2024-01-13T09:23:39Z

 // In the same region
 if(r1->low() == r2->low()) {

Unfortunately having the same low() doesn't mean it is the same region (hence this entire discussion).

I've turned my own suggestion into code

    struct less
    {
        /**
         * If there are more than one guest binary file loaded, multiple
         * functions may have the same entry point address in different
         * code regions. And regions themselves may use the same
         * address ranges.
         *
         * We order functions by their regions first, by their addresses second.
         *
         * We order regions by their start first, by their end second,
         * by the numeric value of their pointers third. We consider NULL
         * to be less than any non-NULL region.
         *
         * For typical shared libraries and executables this should order
         * functions by their address. For static libraries it should group
         * functions by their object files and order object files by their
         * size.
         */
        bool operator()(const Function * f1, const Function * f2) const
        {
            CodeRegion *f1_region = f1->region();
            CodeRegion *f2_region = f2->region();

            /* Same region or both regions are NULL => order by addr() */
            if (f1_region == f2_region) {
                if (f1->addr() < f2->addr()) return true;
                if (f1->addr() > f2->addr()) return false;
                if (f1 == f2) return false;
                /* Same region, same address, different functions.
                   Realistically we should never get here, but just in case... */
                return uintptr_t(f1) < uintptr_t(f2);
            }

            /* Only one region can be NULL by this point */
            if (!f1_region) return true;
            if (!f2_region) return false;

            if (f1_region->low() < f2_region->low()) return true;
            if (f1_region->low() > f2_region->low()) return false;

            /**
             * Multiple regions with the same low() can happen when processing
             * static libraries (i.e. archives of object files)
             * Try ordering by high()
             */
            if (f1_region->high() < f2_region->high()) return true;
            if (f1_region->high() > f2_region->high()) return false;

            /**
             * Corner case: different regions with the same address and the same size,
             * probably from different files. Order by numeric value of their pointers.
             */
            return uintptr_t(f1_region) < uintptr_t(f2_region);
        }
    };

I believe the code above correctly handles all cases with non-existing regions, gives correct ordering to be used with standard data structures, and is mostly deterministic across multiple runs (except for the last return).

If this looks good from the algorithm point of view, I'll update the code under review.

kupsch · 2024-01-17T16:45:03Z

@batuzovk this look good, you can update the code. One addition to make it slightly more deterministic would be to add the following before the final region pointer comparison

        if (f1->addr() < f2->addr()) return true;
        if (f1->addr() > f2->addr()) return false;

Then the only non-determinism for two functions ordering is if their regions are different, but there region's low and high and the function's addr are identical.

hainest · 2024-01-17T17:32:52Z

 // In the same region
 if(r1->low() == r2->low()) {
Unfortunately having the same low() doesn't mean it is the same region (hence this entire discussion).

That comment was too short. It should really be "Regions are at same offset". When comparing two regions in the same file, they can be differentiated by low. If they are from two different files with relocatable code, then they might not. I noted in the code comments that I explicitly ignored this case because of the ambiguities.

I've turned my own suggestion into code

    struct less
    {
        /**
         * If there are more than one guest binary file loaded, multiple
         * functions may have the same entry point address in different
         * code regions. And regions themselves may use the same
         * address ranges.
         *
         * We order functions by their regions first, by their addresses second.
         *
         * We order regions by their start first, by their end second,
         * by the numeric value of their pointers third. We consider NULL
         * to be less than any non-NULL region.
         *
         * For typical shared libraries and executables this should order
         * functions by their address. For static libraries it should group
         * functions by their object files and order object files by their
         * size.
         */
        bool operator()(const Function * f1, const Function * f2) const
        {
            CodeRegion *f1_region = f1->region();
            CodeRegion *f2_region = f2->region();

            /* Same region or both regions are NULL => order by addr() */
            if (f1_region == f2_region) {

Could you add a note here that the pointer comparison is intentional? Just to keep future readers from thinking it was a mistake since it doesn't compare regions directly.

            if (f1->addr() < f2->addr()) return true;
            if (f1->addr() > f2->addr()) return false;
            if (f1 == f2) return false;
            /* Same region, same address, different functions.
               Realistically we should never get here, but just in case... */
            return uintptr_t(f1) < uintptr_t(f2);
        }

        /* Only one region can be NULL by this point */
        if (!f1_region) return true;
        if (!f2_region) return false;

        if (f1_region->low() < f2_region->low()) return true;
        if (f1_region->low() > f2_region->low()) return false;

        /**
         * Multiple regions with the same low() can happen when processing
         * static libraries (i.e. archives of object files)

Two functions from different binaries with relocatable code (.o, DSO, PIE, ...) can possibly have the same low(), high(), and addr().

         * Try ordering by high()
         */
        if (f1_region->high() < f2_region->high()) return true;
        if (f1_region->high() > f2_region->high()) return false;

        /**
         * Corner case: different regions with the same address and the same size,
         * probably from different files. Order by numeric value of their pointers.
         */
        return uintptr_t(f1_region) < uintptr_t(f2_region);
    }
};


I believe the code above correctly handles all cases with non-existing regions, gives correct ordering to be used with standard data structures, and is mostly deterministic across multiple runs (except for the last return).

If this looks good from the algorithm point of view, I'll update the code under review.

This includes more cases than what I proposed and is closer to what SymtabAPI::Function::operator== does, so I would go with this one. +1

batuzovk · 2024-01-18T11:52:16Z

I've updated the code and adjusted comments slightly.

I've run SortChecker++ on this comparator, and it didn't find any problems. The caveat is that dynamic checkers are only as good as your input data, and my input wasn't robust at all.

One addition to make it slightly more deterministic would be to add the following before the final region pointer comparison
   if (f1->addr() < f2->addr()) return true;
   if (f1->addr() > f2->addr()) return false;

I've passed on this suggestion for now. It doesn't eliminate possible non-determinism completely. Without it functions are always grouped by region in the results. With this change they will be grouped, but with one rare exception where functions from regions with the same boundaries get interleaved. I think making "intuitive sense" in common cases is more convenient for users then slightly less determinism in the case of multiple binary files (where there is no presumed order on functions to begin with).

With current implementation there is a short and concise description of the resulting order for common use cases that doesn't involve any exceptions or going into details of how functions are represented in Dyninst

For typical shared libraries and executables this should order functions by their address. For static libraries it should group functions by their object files and order object files by their size.

With the proposed suggestion there will be an exception.

But I don't insist.

batuzovk · 2024-02-01T11:06:12Z

Any more comments on this?

kupsch · 2024-02-02T19:34:15Z

Sorry for the delay, other tasks were taking my time. This looks good, I'll merge it shortly after running it through our testsuite.

review is on outdated code

hainest changed the title ~~Bugfix to function sorting predicate~~ Use code region boundaries for ParseAPI::Function::less comparator Jan 12, 2024

hainest added enhancement parseAPI This issue is directly related to parseAPI user-reported This issue was reported by someone outside of the core Dyninst developers labels Jan 12, 2024

hainest previously requested changes Jan 12, 2024

View reviewed changes

parseAPI/h/CFG.h Outdated Show resolved Hide resolved

batuzovk force-pushed the func_sort branch from 6479f97 to e1536ef Compare January 12, 2024 15:51

batuzovk force-pushed the func_sort branch from e1536ef to 57fd4b8 Compare January 18, 2024 11:29

Use code region boundaries for ParseAPI::Function::less comparator

062234f

batuzovk force-pushed the func_sort branch from 57fd4b8 to 062234f Compare January 19, 2024 09:59

kupsch self-requested a review February 5, 2024 16:57

kupsch approved these changes Feb 5, 2024

View reviewed changes

kupsch merged commit e76e036 into dyninst:master Feb 5, 2024
13 checks passed

hainest mentioned this pull request Feb 27, 2024

Inconsistencies in data structures produced by Dyninst #1675

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use code region boundaries for ParseAPI::Function::less comparator #1668

Use code region boundaries for ParseAPI::Function::less comparator #1668

batuzovk commented Jan 11, 2024

hainest commented Jan 12, 2024

bigtrak commented Jan 12, 2024

hainest commented Jan 12, 2024

batuzovk commented Jan 12, 2024

hainest commented Jan 12, 2024

batuzovk commented Jan 12, 2024

kupsch commented Jan 12, 2024

batuzovk commented Jan 12, 2024

kupsch commented Jan 12, 2024

bigtrak commented Jan 12, 2024

batuzovk commented Jan 12, 2024

hainest commented Jan 12, 2024

batuzovk commented Jan 13, 2024

kupsch commented Jan 17, 2024

hainest commented Jan 17, 2024

batuzovk commented Jan 18, 2024 •

edited

batuzovk commented Feb 1, 2024

kupsch commented Feb 2, 2024

Use code region boundaries for ParseAPI::Function::less comparator #1668

Use code region boundaries for ParseAPI::Function::less comparator #1668

Conversation

batuzovk commented Jan 11, 2024

hainest commented Jan 12, 2024

bigtrak commented Jan 12, 2024

hainest commented Jan 12, 2024

batuzovk commented Jan 12, 2024

hainest commented Jan 12, 2024

batuzovk commented Jan 12, 2024

kupsch commented Jan 12, 2024

batuzovk commented Jan 12, 2024

kupsch commented Jan 12, 2024

bigtrak commented Jan 12, 2024

batuzovk commented Jan 12, 2024

hainest commented Jan 12, 2024

batuzovk commented Jan 13, 2024

kupsch commented Jan 17, 2024

hainest commented Jan 17, 2024

batuzovk commented Jan 18, 2024 • edited

batuzovk commented Feb 1, 2024

kupsch commented Feb 2, 2024

batuzovk commented Jan 18, 2024 •

edited