Opengl 3.3 allows to query a timer objet, to have the real time spend in the GPU vs the inaccurate time spend on the CPU to measure the performance. In short the extension can work in 2 ways:
1/ You can have the delta (in ns) between glBeginQuery(GL_TIME_ELAPSED,..) and glEndQuery(GL_TIME_ELAPSED);
2/ You can have absolute time (in ns) with glQueryCounter. You can emulate the first version, with 2 glQueryCounter and do a manual diff.
An example tutorial : http://www.lighthouse3d.com/cg-topics/opengl-timer-query/
Additional note, the command go into the GPU pipeline, so you must wait enough time to readback the value and therefore it could stall the pipeline. Or you can use a ping-pong buffer and read the data of previous frame (which is done and therefore query counter too).
Now it remains the question of what to profile? I feel it would be to heavy to annotate all GL commands and maybe not useful. For the moment my ideas are
1/ Use glBeginQuery/glEndQuery around the dump of a call.
2/ Use some glQueryCounter (with ping-ping buffer) around swap buffer command to get the performance by frame.
There is already some work on this direction:
But it's not yet complete.
James is working on this in https://github.com/exjam/apitrace
This has been now merged into master. Let us know if you run into any issues.