Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Linux perf JIT support (/tmp/perf-$pid.map) #1432

Merged
merged 3 commits into from Dec 28, 2014
Merged

Add Linux perf JIT support (/tmp/perf-$pid.map) #1432

merged 3 commits into from Dec 28, 2014

Conversation

randomstuff
Copy link
Contributor

'perf' is the standard builtin tool for performance analysis on recent
Linux kernel. Its source code is shipped within the kernel repository.

'perf' has basic support for JIT. For each process, it can read a file
named /tmp/perf-$PID.map. This file contains mapping from address
range to function name in the format:

41187e2a 1a JIT_PPC_b804a33fc

with the following entries:

  1. address of the start of the range (hexadecimal);
  2. size of the range (hexadecimal);
  3. name of the function.

Currently, we supply the address of the basic block instead of a
function name.

Usage:

DOLPHIN_PERF=1 dolphin-emu &
perf record -F99 -p $(pgrep dolphin-emu) --call-graph dwarf
perf script | stackcollapse-perf.pl | c++filt | flamegraph.pl > profile.svg

Issues:

  • perf does not have support for region invalidation. It reads the
    file in preprocessing. It does not work very well if a JIT region
    is reused for another basic block. Can we avoid this?

To be fixed:

  • Add a GUI option.

Long term evolutions:

  • add the function name as well in the file;
  • add support for ARM JIT;
  • add support for DSP JIT as well;
  • stack unwinding for JIT functions.

Generated FlameGraph :

(Here we are sampling on time but it is possible to sample cache-misses …)


#ifdef USE_LINUX_PERF
if (getenv("DOLPHIN_PERF")) {
char filename[500];

This comment was marked as off-topic.

@phire
Copy link
Member

phire commented Oct 28, 2014

Looks good. I really like the chart it generates.

Currently, we supply the address of the basic block instead of a
function name.

A function name is possible less than useful. We probably care more about which basic block is profiling badly than which PowerPC function is profiling badly.

perf does not have support for region invalidation. It reads the file in preprocessing. It does not work very well if a JIT region is reused for another basic block. Can we avoid this?

The only way we can avoid this is by avoiding using address space twice.
We could create a custom allocator which will return memory at a unique address each time it is called. Unfortunately due to the < 4gb limitation, we will still eventually run out of address space. But it's probably enough space to get decently sized profiles before it fails.

In the future we may be able to lift the < 4gb limitation which will allow us to dedicate a huge chunk of the 64bit address space just for jitted code.

Another solution might be to hash the basic block before compiling it. If the code and the jit parameters the same the exact same code will be generated as before, we can compile it again at the same address and avoid wasting address space (or even keep a cached version around).

@phire
Copy link
Member

phire commented Oct 28, 2014

@dolphin-emu-bot rebuild

@comex
Copy link
Contributor

comex commented Oct 28, 2014

Just FYI, I'm interested in writing a Python script or something to convert this data into DWARF info (eww, I know) for use in Instruments etc. However, for this purpose I'd also like to emit a .s file for the JITted blocks and output "line number" info that maps native instructions to PPC instructions in the .s file. I can do this myself if I ever get around to it, but just mentioning it in case it interests anyone.

Jit64() : code_buffer(32000)
{
#ifdef USE_LINUx_PERF
perf_map_file = NULL;

This comment was marked as off-topic.

@randomstuff
Copy link
Contributor Author

@comex: Could you elaborate on the DWARF thing? You would like to generate DWARF info about the block ranges? How do you expect to use it?

I thought about generating DWARF info in order to use this debugger JIT interface: this way GDB/LLDB would be JIT aware as well. We might even (?) be able to add stack unwinding DWARF info in order to be able to unwind the stack.

On possible application would be to use GDB to sample the stack instead of perf with support for unwinding the stack after JITed code.

// This is very slow, can I do this efficiently?
// Symbol* symbol symbol = g_symbolDB.GetSymbolFromAddr(b->originalAddress);
std::fprintf(perf_map_file, "%llx %x JIT_PPC_b%x\n",
(long long int) b->normalEntry,

This comment was marked as off-topic.

@comex
Copy link
Contributor

comex commented Oct 28, 2014

@randomstuff Hmm, I suppose being able to debug the code live would be an advantage of generating it directly in Dolphin rather than some script. My plan, however, was to generate basic info about function spans and line info, and dump it into a Mach-O file to load after the fact into Instruments.app, which does not support any nice JIT-specific API.

I doubt using GDB as a profiler would have anything other than significantly increased overhead compared to Dolphin's built in profiler. Though it would be cool to be able to break into JIT code and randomly see the PPC stack... :p

@randomstuff
Copy link
Contributor Author

When using a high sampling rate, a GDB based profiler will not be very efficient. When using a low sampling rate, a GDB-based profiler might have lower overhead than the builtin profiler (which needs to instrument each basic block entry and exit).

@randomstuff
Copy link
Contributor Author

@comex: I might be interested in the DWARF generation thing. (I already wrote some DWARF consuming code.)

@randomstuff
Copy link
Contributor Author

I should probably move this in JitCache.cpp alongside its USE_OPROFILE and USE_VTUNE friends (in FinalizeBlock). It would be shared with the ARM JIT and JITIL.

if (perf_map_file.is_open())
{
perf_map_file << StringFromFormat(
"%" PRIx64 " %" PRIx32 " EmuCode_%" PRIx32 "\n",

This comment was marked as off-topic.

@comex
Copy link
Contributor

comex commented Oct 29, 2014

Do we really need a compile time option at all?

@degasus
Copy link
Member

degasus commented Oct 29, 2014

@randomstuff Is it possible to group by ppc instruction type instead of by ppc address? Is it fine to just reuse the same name? Configureable of course :) I think it would be also nice to see how much time is spend in eg floating point operations.

I'm still surprised how few code this PR is, good job.

@randomstuff
Copy link
Contributor Author

@degasus: What do you mean "group by ppc" instruction. You would like to see how much time is spent for each time os PowerPC opcode? We'd have to generate target IP ranges mapping to each individual PowerPC instruction. The main difficulty is that it would generate a lot of ranges. I'm not sure it would scale so well at this level of granularity but it might be worth trying. It might be more scalable if we can ask dolphin at runtime to only generate ranges for categories of opcode we are interested in ("only generate ranges for FP instruction"). Is that what you have I mind?

@randomstuff
Copy link
Contributor Author

I don't think there is any problem with associating the same name for different IP ranges. We could as well generate unique name (such as OP_$opcodename_$powerpcip). Then with the same dataset with could either group by both (opcodename + PpowerPC IP) or by opcode name (by sed 's/OP_\([a-z.]*\)_[a-z]*/OP_\1/').

@degasus
Copy link
Member

degasus commented Oct 29, 2014

Yeah, that's why I meaned. So if I use the same name twice, will these blocks be combined in perf?
But I think this is likely out of scope of this PR. Basic usage first

@randomstuff
Copy link
Contributor Author

@comex: Yes, the compile-time option is not really necessary. I initially wanted to use #ifdef __linux__ but in fact the only requirement is getpid() and, as you pointed out, I could be used for other OSes. I thought about #ifndef _WIN32 but it would be useful for Windows as well, I just need a different code for the filename there. Maybe something like:

std::string filename = StringFromFormat("perf-%d.map", GetCurrentProcessId()).

{
if (original_address) {
perf_map_file << StringFromFormat(
"%" PRIx64 " %x %s_%x\n",

This comment was marked as off-topic.

@Sonicadvance1
Copy link
Contributor

Can we change this over to a namespace instead of a bunch of functions in declared in a header polluting the global namespace? This can allow us to remove the ugly extern declared perf_map_file in the header as well.

@randomstuff
Copy link
Contributor Author

I brought VTUNE support in JitRegister. It seems the VTUNE JIT profiling API makes a copy of the method_name for itself so we don't need to keep it this string around.

@lioncash
Copy link
Member

lioncash commented Nov 4, 2014

You would need to add JitRegister.cpp/.h to the Visual Studio project file in Common as well.

if (original_address)
{
snprintf(buf, 100, "%s_%x", name, original_address);
symbol_name = buf;

This comment was marked as off-topic.

@Stevoisiak
Copy link
Contributor

@dolphin-emu-bot rebuild

@randomstuff
Copy link
Contributor Author

Fixed the include order.

@Stevoisiak
Copy link
Contributor

@dolphin-emu-bot rebuild

@randomstuff
Copy link
Contributor Author

I was expecting the check-includes.py to exit with 0 when everything is OK: I was wrong. This time it should be really OK.

'perf' is the standard builtin tool for performance analysis on recent
Linux kernel. Its source code is shipped within the kernel repository.

'perf' has basic support for JIT. For each process, it can read a file
named /tmp/perf-$PID.map. This file contains mapping from address
range to function name in the format:

  41187e2a 1a EmuCode_804a33fc

with the following entries:

 1. beginning of the range (hexadecimal);
 2. size of the range (hexadecimal);
 3. name of the function.

We supply the PowerPC address of the basic block as function name.

Usage:

    DOLPHIN_PERF_DIR=/tmp dolphin-emu &
    perf record -F99 -p $(pgrep dolphin-emu) --call-graph dwarf
    perf script | stackcollapse-perf.pl | grep EmuCode__ | flamegraph.pl > profile.svg

Issue: perf does not have support for region invalidation. It reads
the file in postprocessing. It probably does not work very well if a
JIT region is reused for another basic block: wrong results should be
expected in this case. Currently, nothing is done to prevent this.
Move the JITed function/basic-block registration logic out of the CPU
subsystem in order to add JIT registration to JITed DSP and
Video/VertexLoader code.

This necessary in order to add /tmp/perf-$pid.map support to other
JITed code as they need to write to the same file.
@Tilka
Copy link
Member

Tilka commented Nov 24, 2014

"/tmp" is hardcoded into perf.

@randomstuff
Copy link
Contributor Author

@Tilka: Yes the idea was that it might be useful for people which are not using perf (see @comex) and might not need to store it in /tmp (and even people on Windows without /tmp).

@skidau
Copy link
Contributor

skidau commented Dec 7, 2014

@dolphin-emu-bot rebuild

@randomstuff
Copy link
Contributor Author

Is there anything holding it from master?

degasus added a commit that referenced this pull request Dec 28, 2014
Add Linux perf JIT support (/tmp/perf-$pid.map)
@degasus degasus merged commit c5a0b6b into dolphin-emu:master Dec 28, 2014
@degasus
Copy link
Member

degasus commented Dec 28, 2014

Sorry, I've just missed this PR. Through there are still some unsupported features:
We do have a far code cache. It's used for rarely used code and it's generated per block. So also add such an entry would be usful. We also have a vertex loader jit which could get a symbol name. Maybe also our shared assembly.

@randomstuff
Copy link
Contributor Author

I'll try too look at this. I have a WIP (I'm not sure it's correct) branch jit-register with support for the vertex loader JIT and shared assembly JIT as well.

@degasus
Copy link
Member

degasus commented Dec 28, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
10 participants