-
Notifications
You must be signed in to change notification settings - Fork 3
Knowledge Dump Debugging Epoch Programs
This page is continually undergoing revisions and updates as we progress with the debug support project. Be sure to check back frequently for the latest developments. Also, be sure to read to the end of the page for the most complete and up-to-date status information.
One of the key usability traits of a new programming language is the debugging experience. Without a solid set of tools for debugging, any novel language faces a serious uphill battle for adoption. Epoch has been since its inception a pragmatic language first and foremost; if it doesn't help get work done, it isn't doing its job. Debugging is no exception (pardon the pun) so we want to have a first-class debugging experience ready for future Epoch programmers.
One option for great debuggability of new languages is to build the debug tools by hand. In fact this was largely the plan for Epoch for a long time. We need to emit comprehensive metadata about the code anyways for garbage collection purposes, so why not hitchhike on that data and deliver a custom debugger?
Of course the problem with this is that building a world-class debugger is a monumental undertaking, and is not even guaranteed to hit the bar for quality. More specifically, a home-grown debugger is likely to offer a very different UX than the tools developers already know. So the gold standard is to integrate with existing tools cleanly.
Since Epoch is primarily targeting Windows (for now!) this means that the ideal debugging experience is to work seamlessly with tools like Visual Studio and WinDbg. Moreover, it means adopting the PDB
debug file format so that things like DbgHelp.dll
can generate stack traces, minidumps, and so forth.
It doesn't take much research into the PDB
format to discover that very little is actually publicly known about how these files work. There are a tiny number of projects that have interfaced successfully with PDB
files, most notably cv2pdb
. The strategy used by this tool is to talk directly to MSPDB140.dll
(or a similarly named file depending on local Visual Studio version) and use its APIs to build up and emit a PDB
.
Based on analysis of this project as well as some minimal code open-sourced by Microsoft, we've discovered enough of the format's peculiarities to at least make a convincing sketch of a valid PDB
file for a test program written in Epoch.
As of July 2016, the Epoch 64-bit compiler emits debug symbols that can be used with Visual Studio and WinDbg. A number of additional tools have been used to reconstruct the details of how a PDB
comes to be.
-
DBH.exe
can dump the function names and source mapping correctly from our generatedPDB
. This tool comes with the Windows SDK and can be found alongside WinDbg. - The
cvdump.exe
tool which can be found in the Microsoft GitHub repomicrosoft-pdb
emits a chunk of data which is useful for validation and sanity checking. - The DIA SDK included with Visual Studio has a tool called
DIA2Dump
which also provides useful details aboutPDB
files.
Interestingly, none of those tools appear to offer a comprehensive dump of symbol data, so using all three was necessary to engineer a working symbol generation pipeline. The current status of debugger support for Epoch follows:
- Visual Studio 2015 correctly generates callstacks with function names
- Visual Studio 2015 correctly shows source code for Epoch programs during debugging
- WinDbg correctly generates callstacks with function names
- WinDbg correctly shows source code from a given instruction in the disassembly
- DbgHelp generates correct callstacks
There are several notable holes in the current PDB
generation code:
- Type metadata is not emitted yet; this limits a number of debugger features
-
PDB
data is generated using theAddSymbols
API ofMSPDB140.dll
fed with CodeView data generated by LLVM; we currently manually crack this blob and tweak it a bit to make the debug files work, prior to handing over the stream toMSPDB140.dll
- The raw debug data being generated by LLVM is in some cases bogus, because we feed it hack data for laziness reasons. For example we don't track actual line numbers or source files because the compiler front-end is not set up to track that information yet.
Ultimately the project is moving forward and we are very close to supporting a moderately good debug experience on Windows. As time goes on we can fill in the remaining gaps and generate enough debug data to be competitive.
- For about a week we had a problem where Visual Studio would show function names, but WinDbg wouldn't. More interestingly, WinDbg would show source code, and Visual Studio wouldn't! It turns out this is down to messing up two things: section contributions in the
PDB
, and a lack of a section symbol describing the.text
PE section for the compiled binary. Fixing up the contributions logic and adding a symbol to map addresses into the correct space resolved this weird behavior. - A huge amount of insight was gained by reading the code for
cv2pdb
,DIA2Dump
, and most recently,cvdump
. We plan to try and coalesce this information into the Epoch compiler and document it for posterity. Ideally any novel language built on LLVM should be able to benefit from thisPDB
emission pipeline, even though it does technically require a Visual Studio installation to work.
Another tool came to light from poking the LLVM mailing lists - llvm-pdbdump
. This is by far the most comprehensive resource we've found yet for PDB
emission. It includes the capability to emit a complete MSF
(Multi-Stream Format) file, which is the parent/container format for PDB
data. Based on this tool we are now writing a raw PDB
emission pipeline that integrates with the compiler to generate PDB
debug data for Epoch programs as they are compiled, rather than a post-hoc second step.
It seems that the following components are necessary and sufficient to get a usable debug experience from Visual Studio 2015 and WinDbg:
- A usable
MSF
file to host the data. Currently we do this withMSPDB140.dll
integration but as noted above we also have a rawMSF
generator in the works. - A PDB Information stream that contains a GUID and "age" value of the same values as the
.debug
COFF section in the image (.EXE
) to be debugged (see theWriteDebugStub
function implementation for details). - A DBI stream with associated contents:
- A Section for the
.text
section in the final.EXE
image. It is not clear if additional sections are necessary, but setting up validCOFF
section data is highly recommended as it makes it easier to align the addresses of code in the final image with addresses as computed by thePDB
consumer. - A Module for code. Only one is needed and it may be useful to have multiples for separate compilation but this seems to be pretty flexible so far.
- The Module contains symbol data. From the perspective of a consumer of
MSPDB140.dll
this is just a matter of feeding CodeView data from LLVM/etc. directly into theAddSymbols
function. There is some fixup needed to handle relocations, and one additionalS_SECTION
symbol should be added to help the debugger map the code addresses to aCOFF
section. - Each publicly visible CodeView symbol from the module should be fed through a call to
AddPublic2
as well. This ensures that the debuggers will see the symbol, but it doesn't seem to prevent non-publicized symbols from occasionally working as well. - The Machine Type of the DBI stream should probably be set but it doesn't seem to be a problem if it's bogus or zero.
- A Section for the
- Notably, TPI (Type) information is not necessary although it will severely limit the debugging experience to not have it.
- The IPI stream is still a mystery although it apparently mimics the TPI stuff in a lot of ways?
The implementation against MSPDB140.dll
can be seen here. The Epoch implementation of "raw" emission can be seen in this folder.
The first version of completely fabricated PDBs is now checked in to version control (see "raw" emission link above). Most of the work involved stepping through the disassembly of MSPDB140.dll
and hand-correlating the code to the microsoft-pdb
repo. This proved necessary because the code for the PDB
handlers does not compile as it stands in the Microsoft repo. This makes understanding the runtime flow of the code mildly awkward, but nothing that can't be solved with a healthy familiarity with x64 ASM and a willingness to spend a lot of time in a debugger.
In any event, we now have a PDB that loads cleanly in multiple tools (as cited previously in this article) and also serves up working debug information for both Visual Studio 2015 and WinDbg. It is highly hard-coded and many hacky workarounds remain. Getting this PDB generation up to production quality will still take some time, but it is a very encouraging and promising milestone that we can completely bypass black-box tools to generate this data.