New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use code region boundaries for ParseAPI::Function::less comparator #1668
Conversation
@kupsch In the (extremely) degenerate case that an archive has two .o files with the same function definition that the linker hasn't thrown away yet (e.g., an ODR violation), do we care that this changes the sort for those? |
The way I read ODR ... That is not an ODR violation -- yet -- the c++ linker removes the duplicates. |
@bigtrak It applies between translation units in a with some added caveats. ODR isn't the real issue I was pointing to, but it was a quick example I could think of. My question would apply to a C program. n4950: C++23 draft (2023-05-10)
|
Excuse me if I'll say something stupid, but I don't follow how functions being 'same' from high-level language point of view applies here. Comparator treats them as just addresses inside regions. Before this change functions within the same region were sorted by address, functions from different regions were sorted based on where CodeRegion class landed in host memory (i.e. randomly from the user perspective). Now functions within the same region are still sorted by their addresses, and function from different regions are sorted based on their respective region boundaries. What was working reliably still works the same way, what was not reliable before now is. |
Dyninst can process object files (on Linux '.o') both in isolation and as part of an archive file ('.a' on Linux). Since these binaries have not been passed through the linker, they all possibly start at the same base offset. If a function were defined in two .o files, then it's possible to have two separate [1] Actually, it could be any two functions that reside at the same offset. |
The biggest problem with the old sort was that it returned different results from run to run (because memory was allocated differently). Basically this is how the issue got spotted in the first place. So the code that relied on the order of these functions wouldn't have worked in the first place. |
For statically linked archives (.a files) all the member object files have an offset of 0. So functions in different object files could have the same offset without a region there is ambiguity. |
So you are saying that there can be a case where region boundaries are the same, addresses are the same, but the functions are different? From the get go I don't see any reasonable sorting order for those that can be easily implemented. Sorting by pointers groups functions for the same region together (but the order of the regions will still change from run to run), but it makes situation very confusing for ELFs with multiple code sections. |
WIthout a region, then the offset of function in different object files may have the same offset. The pair (region, offset) is unique for every function, but two functions without a region may have the same offset and therefore (null, offset) may map to more than one location/block/function. If all the code is in the same ELF file then (null, offset) is unique, but if you have multiple ELF files loaded then (null, offset) is not guaranteed to be unique. |
At the C / assembler level Tim wants to ensure it works OK with ... The replicated symbols in multiple archive members are completely legal. However, if two archive members, which have competing symbols are requested The offsets of the duplicate symbols in each archive member can differ, and they As far as dyninst goes, everything is just symbols in a object file, which may or may not be Perhaps we need a better / different way of identifying regions to resolve this issue. |
Suggestion:
The expected result:
|
Allowing for non-existent regions is more complicated that I had originally considered. Here is my suggestion. struct less {
/*
* In general, functions are not orderable because code can be shared
* in overlapping regions (e.g., shared blocks) or distributed
* among different regions (e.g., cold sections).
*
* This ordering satisfies the strict weak ordering required by the
* C++ Compare concept, so Function objects can be stored in
* standard containers like std::set.
*
* It is assumed that the functions being compared come from the
* same binary file (e.g., .so or exe). Creating a meaningful
* comparison of functions from different binary files encounters
* too many ambiguities.
*/
bool operator()(const Function * f1, const Function * f2) const {
auto *r1 = f1->region();
auto *r2 = f2->region();
if(r1 && r2) {
// In the same region
if(r1->low() == r2->low()) {
// Not at the same entry point
if(f1->addr() != f2->addr())
return f1->addr() < f2->addr();
}
/*
* Two functions in the same region at the same entry
* point is nonsense, but we account for it to provide
* a strict weak ordering- just in case.
*/
return f1 < f2;
}
// Functions in different regions are ordered by the offset of
// their containing region.
return r1->low() < r2->low();
}
/* It may be possible that a function does not have a region.
* This is likely a coding error, but we account for it.
*/
// Functions from an empty region come first
if(!r1 && r2) return true;
if( r1 && !r2) return false;
// If neither has a region, then ordering is arbitrary.
// This satisfies the reflexive requirement of the Compare concept.
return f1 < f2;
}
};
|
Unfortunately having the same low() doesn't mean it is the same region (hence this entire discussion). I've turned my own suggestion into code struct less
{
/**
* If there are more than one guest binary file loaded, multiple
* functions may have the same entry point address in different
* code regions. And regions themselves may use the same
* address ranges.
*
* We order functions by their regions first, by their addresses second.
*
* We order regions by their start first, by their end second,
* by the numeric value of their pointers third. We consider NULL
* to be less than any non-NULL region.
*
* For typical shared libraries and executables this should order
* functions by their address. For static libraries it should group
* functions by their object files and order object files by their
* size.
*/
bool operator()(const Function * f1, const Function * f2) const
{
CodeRegion *f1_region = f1->region();
CodeRegion *f2_region = f2->region();
/* Same region or both regions are NULL => order by addr() */
if (f1_region == f2_region) {
if (f1->addr() < f2->addr()) return true;
if (f1->addr() > f2->addr()) return false;
if (f1 == f2) return false;
/* Same region, same address, different functions.
Realistically we should never get here, but just in case... */
return uintptr_t(f1) < uintptr_t(f2);
}
/* Only one region can be NULL by this point */
if (!f1_region) return true;
if (!f2_region) return false;
if (f1_region->low() < f2_region->low()) return true;
if (f1_region->low() > f2_region->low()) return false;
/**
* Multiple regions with the same low() can happen when processing
* static libraries (i.e. archives of object files)
* Try ordering by high()
*/
if (f1_region->high() < f2_region->high()) return true;
if (f1_region->high() > f2_region->high()) return false;
/**
* Corner case: different regions with the same address and the same size,
* probably from different files. Order by numeric value of their pointers.
*/
return uintptr_t(f1_region) < uintptr_t(f2_region);
}
}; I believe the code above correctly handles all cases with non-existing regions, gives correct ordering to be used with standard data structures, and is mostly deterministic across multiple runs (except for the last return). If this looks good from the algorithm point of view, I'll update the code under review. |
@batuzovk this look good, you can update the code. One addition to make it slightly more deterministic would be to add the following before the final region pointer comparison if (f1->addr() < f2->addr()) return true;
if (f1->addr() > f2->addr()) return false; Then the only non-determinism for two functions ordering is if their regions are different, but there region's low and high and the function's addr are identical. |
That comment was too short. It should really be "Regions are at same offset". When comparing two regions in the same file, they can be differentiated by
Could you add a note here that the pointer comparison is intentional? Just to keep future readers from thinking it was a mistake since it doesn't compare regions directly.
Two functions from different binaries with relocatable code (.o, DSO, PIE, ...) can possibly have the same low(), high(), and addr().
This includes more cases than what I proposed and is closer to what SymtabAPI::Function::operator== does, so I would go with this one. +1 |
I've updated the code and adjusted comments slightly. I've run SortChecker++ on this comparator, and it didn't find any problems. The caveat is that dynamic checkers are only as good as your input data, and my input wasn't robust at all.
I've passed on this suggestion for now. It doesn't eliminate possible non-determinism completely. Without it functions are always grouped by region in the results. With this change they will be grouped, but with one rare exception where functions from regions with the same boundaries get interleaved. I think making "intuitive sense" in common cases is more convenient for users then slightly less determinism in the case of multiple binary files (where there is no presumed order on functions to begin with). With current implementation there is a short and concise description of the resulting order for common use cases that doesn't involve any exceptions or going into details of how functions are represented in Dyninst
With the proposed suggestion there will be an exception. But I don't insist. |
Any more comments on this? |
Sorry for the delay, other tasks were taking my time. This looks good, I'll merge it shortly after running it through our testsuite. |
Function::region() returns a pointer to CodeRegion class. Comparing pointers directly with 'less' has a lot of issues, ranging from nondeterministic order of functions in the resulting array to it not always having a well defined behavior according to standards.
Here I tried to recreate what I believe was the intended implementation of the predicate: sort based on region address first, function address second.