Discussion on reorganizing gc.cpp #408

Open
ala53 opened this Issue Mar 6, 2015 · 27 comments

Projects

None yet
@ala53
ala53 commented Mar 6, 2015

Per (the nightmarish) PR #403, it was recommended I open an issue to discuss reorganizing gc.cpp.

For those that aren't familiar, gc.cpp is a 36,000 line file which seems to contain the entire garbage collector and gcpriv.h is a 4,000 line header which seems to correspond to gc.cpp.

So, what would be the best way to split gc.cpp into its components.

We could split by function: allocator/ mark phase / sweep phase / deallocator / etc.
We could split by class: gcstatistics / gcmechanisms / gcallocator / etc.
We could split into allocator, deallocator, and garbage collector.
Or something not mentioned here...

What I've identified so far:

  • There are 11 commented #defines for the same function (dprintf).
  • max_generation, something seemingly globally needed, is a private enum hidden in gc.h.
@brianrob
Contributor
brianrob commented Mar 6, 2015

FYI - @Maoni0

@Maoni0 Maoni0 self-assigned this Mar 8, 2015
@lucasmeijer

fwiw, imo it would be very beneficial to the understanding of newcomers to the code to split up gc.cpp into different seperately digestible parts. It's a complicated beast, and a 1mb+ cpp file is really not helping :). @ala53 is not the only one who has run into this problem.

@richlander
Member

@lucasmeijer, @ala53 suggested a few different approaches. Do any of those resonate with you, or you'll wait and see as the conversation progresses?

@lucasmeijer

@richlander, I don't have a good enough overview yet to have an opinion on how to best split it up. I mostly added my comment because #403 was closed with the reason "we wont do PR's that just move stuff around". this code could definitely use some moving around :). how to do it in a way that conceptually makes sense, and does not regress on performance is something I hope others will have good opinions on.

@Maoni0
Member
Maoni0 commented Mar 9, 2015

Hi @ala53, I really appreciate your interest in the GC, but we would prefer not to take these kinds of source file refactoring changes. We’ve had the current structure since CLR started so it is familiar and efficient for those who have worked on it the most. From the guidelines:

please do give priority to the current style of the project or file you're changing

Also splitting up code into multiple files, as Brian pointed out in your PR, would fall under ‘formatting changes’ guidelines:

Formatting changes

Because the code is mirrored with our internal source control system we would like to keep these kinds of changes to a minimum to avoid unnecessary merge conflicts … formatting changes are likely going to be turned down.

The dprintf macros you pointed out are strictly only for logging. I left multiple versions there because I use different versions for different things all the time. If you want to add a comment that explains why these different versions are there, that type of change could improve clarity while having very low impact on other branches and we could take it.

I hope this isn’t too discouraging. In the future we can discuss other proposed changes in github issues to avoid investing time in changes we are unlikely to accept.

@Maoni0 Maoni0 closed this Mar 9, 2015
@hbarrington

@Maoni0 not to beat a dead horse, but is there any kind of internal effort at Microsoft to break this up? I understand merge conflicts increase if lots of outside developers are reorganizing code, but it is incredibly hard to read. I'd be surprised if internal teams actually preferred this one-file structure.

@Eyas
Eyas commented Mar 9, 2015

@Maoni0 also, this is not necessarily a task item or work item, nor is it an active pending change (say, a PR), but rather a place to brainstorm about what should happen to GC code.

The arguments for closing that I see here are arguments against accepting a PR right now, not arguments against having this discussion, which would be actionable at a future time.

Would you be able to provide an argument why a discussion is harmful? It seems like the answers are (1) we never want to split up GC, or (2) we already have a plan.

I'd be curious about the arguments/reasonings for both.

@ala53
ala53 commented Mar 9, 2015

@Maoni0

I hope this isn't too discouraging. In the future we can discuss other proposed changes in github issues to avoid investing time in changes we are unlikely to accept.

This discussion was my way of doing so: I wanted to discuss possible ways to clean up gc.cpp. The issue-pull request-issue actually stemmed from me trying to understand how garbage collectors work but the current organization of the GC provides a high barrier to entry/it isn't very easy to understand.

I have no problem with you wanting to avoid merge conflicts and such, but I (as well as some others) think that splitting the functionality of the GC into different files (not reorganizing functionality) would be beneficial. As @Eyas said, why is it a bad idea to discuss how to do so in the future? That's why this is an issue, not a pull request: the best method can be discussed without changing the code until a good solution (refactoring strategy) is found.

We’ve had the current structure since CLR started so it is familiar and efficient for those who have worked on it the most.

I understand not wanting to restructure old code, but I'm not talking about changing the structure. If you look at the PR, I wanted to move existing classes (such as the allocator) into their own files so it would be easier to work on one part without having to deal with others, not changing class names and such (which would ruin the current structure of the code).

Because the code is mirrored with our internal source control system we would like to keep these kinds of changes to a minimum to avoid unnecessary merge conflicts … formatting changes are likely going to be turned down.

On this part, I'd think it would actually reduce merge conflicts. Say me and a fictional "Bob" both have changes that improve the performance of the GC. Mine focuses on better root tracing, while his focuses on more efficient allocation. In the current structure, once mine is accepted, his will be invalid (merge conflict) as the entire 36,000 line file will have changed. If it is restructured to split functionality, then both changes will be non-interacting and be able to be merged.

@karelz
Member
karelz commented Mar 9, 2015

Apologies for closing the issue. As you know we are new to open source, so things are a bit bumpy and we are learning as we go. We didn't realize right away that issues are used also for design-style discussions ... I am reopening the issue for the purpose of continued discussion.

@karelz karelz reopened this Mar 9, 2015
@karelz
Member
karelz commented Mar 9, 2015

@Eyas discussion is not harmful. Currently we do not have plans to split up the GC code (and we do not even have plans for the plans). Top need for GC work is currently x-plat (see e.g. #138) - any help there would be very appreciated.

@shahid-pk
Contributor

@karelz this is encouraging and constructive we all want the gc to be cross platform first then other things but this discussion could help the community to know why the structure of the gc is what it is and why it is staying that way. We understand the current priority is cross platform first.

@karelz
Member
karelz commented Mar 10, 2015

@shahid-pk Makes sense. Let me try to answer your questions to my best knowledge (and some guess work):
Why it is this way? ... Partly historical reasons (it is this way since the start). Partly because devs working on it didn't feel the urge to refactor it. Partly because splitting of gc.cpp is non-trivial and risky and because it does not bring too big value (ramp up in the code base can be gained also in the combination of reading BOTR and debugging the code).
Why it is staying this way? ... Cost/benefit/risk ratio is IMO not in favor of a change here.

Few additional thoughts:
Am I happy that there is only 1 large file? No, but it doesn't hurt me much either.
Do I see the disadvantages of large file? Yes, but I don't think they are huge. More like minor annoyances with easy workarounds.
And to turn it around: Do you see the risk of any changes here? Do you see the cost of extra careful code reviews to mitigate the risk?

Strictly technically, we truly believe this is a formatting change. If it was simple to split it up and if it would be low risk and if it would be very easy to review, it might be worth the 'minor' improvements mentioned above ... but I don't see that combo happening (not on a noticeable scale in gc.cpp).
On a personal note: I also trust CLR team that if all these three things were true, the refactoring would have happened long time ago.

Hopefully this helps to give some insight into our view point ...

@Eyas
Eyas commented Mar 10, 2015

@karelz your points are valid and I expected as much.

To provide a perspective on why continuing the conversation is useful, I think it is illuminating to look over at what the folks over at CoreFX are doing with their GitHub issues. This might be an interesting place to get towards. Especially, see how issues are tagged with different priority levels, one of which is '0 - Backlog' (idea "good to go but unplanned"), and some are not tagged with anything yet.

Here's one way I see the discussion can move forward. The question "Should the GC code be reorganized" will either by answered by "No, never" or "Yes" (now/later/maybe never). If No, then an informative answer on "Why" is incredibly useful (for the community to contribute to a project owned by the CoreCLR team, understanding subtle aspects of their reasoning is helpful in making sure the community can actually help, rather than slow you guys down).

If Yes, there are two interesting questions that come up at this point: (1) How would one ideally go about this? (2) How important/urgent is it? Note that (1) and (2) inform each other. If this is urgent but the ideal way is difficult, then we look for another way, but a way of going about it that is less elegant would also make it less desirable.

In the case of this discussion, @karelz mentions some reasons that oppose action and encourage caution that apply for any attempt. It seems that (2) is settled as "not now, and not soon"-- which is great. Others might continue to disagree and they might have a point, but this answer is totally valid as is.

"Not now, and not soon" is sometimes an exciting answer for some, I feel; it means you can discuss an ideal solution without having to worry about implementing it under any constraints!

Every time someone works on the GC, whether from the CoreCLR team or an external contributor, that person might notice a trend, a line of abstraction, or another structural unit that could be factored around, and they would be able to document it here.

@Eyas
Eyas commented Mar 10, 2015

Onto the specifics here, I do agree that reorganizing and refactoring is a formatting change in this case.

@d-kr
d-kr commented Mar 10, 2015

In #401 @cnblogs-dudu referenced a post explaining that the GC was machine-generated from LISP. Since gc.cpp is not a real source code (just intermediate code), but the lisp code is, should it be more useful to publish the lisp source and the transpiler (Source-to-source compiler) for LISP -> CPP?

@hbarrington

@Eyas 👍

@karelz
Member
karelz commented Mar 10, 2015

@d-kr The LISP conversion might have happened before v1, just once (most likely). Current gc.cpp is "the source". There is no additional magic we use.

@karelz
Member
karelz commented Mar 10, 2015

@Eyas Thanks for understanding.
The tagging is interesting idea and I bet that @richlander @terrajobst will make it happen here once it settles in corefx project.

@ala53
ala53 commented Mar 10, 2015

@karelz That makes sense. It's an issue of cost-benefit. Reorganizing the GC may cause a bunch of subtle bugs and wouldn't help the people who work on it a lot. Like @Eyas, I vote the discussion be left open so that (a) people won't post more issues of "WHY IS GC ONE HUGE FILE?!?!?!!," as the reasoning is documented here and (b) so people can document little things they find that could be factored out eventually.

I wonder if adding something to the BOTR documentation would help - specifically a mention that the GC is by design in one file and see (link to here) for the reasoning.

Anyways, I see the point behind not touching the GC and it is definitely not a functional change so it should not (even remotely) be a focus for the people working on it right now.

@Eyas: Virtual +💯 to you. On your questions, (2) is definitely settled: get x-platform working well and then come back to readability changes. (1) is what this discussion should eventually converge on. Like you said, people who work on the GC can document any ideas here, without having to worry about any implementation details so that (maybe) a perfect solution is found.

I do, however, question if some of the work could be done as part of the general changes to the GC. For example, if someone makes a really big change to the allocator, perhaps they could split it out as they do so. This way, we avoid doing unnecessary formatting changes and since the component would be mostly rewritten (like I said, a big change), it would not be much extra work to factor out into its own file.

A small note: GitHub could use an "internet arguments and discussions" section for things that don't focus on implementation.

@Maoni0 Maoni0 removed their assignment Mar 10, 2015
@tzwlai tzwlai added this to the Future milestone Oct 20, 2015
@tzwlai tzwlai added the GC label Oct 20, 2015
@xoofx
Member
xoofx commented Jun 16, 2016 edited

Maintaining the status quo regarding this issue is frankly disappointing.

the current structure since CLR started so it is familiar and efficient for those who have worked on it the most

efficient only because familiar. For those who are not familiar, the current structure of the code breaks any attempt at efficiency.

For such a critical component in the runtime, I would really love to hear that there is a roadmap/work planned to progressively restructure the code. Consider restructuring as just a formatting process which would not bring any value is seriously misleading. Among the things this could bring:

  • lower the cost for people not familiar with the GC code to jump more easily into it: CoreCLR is now open-source, and the key to its success is to welcome contributions, as long as they improve the quality of the product. Disallowing this because there is some internal/legacy way of working is not really constructive (to say the least) and a welcoming attitude to the .NET community.
  • improve the cost of maintenance (by getting more people familiar with it), Easier merge conflicts on small files rather than a big blob/locked file.
  • allow and encourage for improvements, new implementation ideas
  • make it a reference code for all students and professors, to help them study it, experience with it, understand and challenge the trade-off between theory and practicality
  • there are many performance challenges ahead for .NET to stay relevant, and the GC is a key component of this challenge: making the code more modular and manageable will help more people to tackle this challenge

So please, reconsider this issue, we are willing to help!

@shahid-pk
Contributor

Adding to @xoofx the gc is the only feature/code currently which seems like no one outside Microsoft understands because other parts of coreclr are getting active contributions for example ports of the run time like Linux arm port , Mac port , FreeBSD port , NetBSD port and for RyuJit @mikedn and others deep understanding of it and contributing to it. Its actually a surprise no one from the community understands gc's code that much well which shows that the .net team need to come up with a plan to make it easy for people to understand gc at least if contributions to it understandably is more risky.

@jakesays

@xoofx @shahid-pk well do something about it. Start refactoring and submit some PR's. The code is open. You just might learn something in the process ;)

@xoofx
Member
xoofx commented Jun 17, 2016

@xoofx @shahid-pk well do something about it. Start refactoring and submit some PR's. The code is open. You just might learn something in the process ;)

@jakesays no, we can't do something about it, Have you carefully read all the posts above? It is stated formatting changes are likely going to be turned down. and a refactoring is considered as a formatting. So usually, you don't start to work on something when you know that It will get rejected, unless you want to fork the whole project.

The code is certainly open, but currently not enough open for contributions, that's the whole point of the discussion.

@alexandrnikitin
Contributor

@jakesays FYI, there were few attempts to refactor the code but they were declined: e.g. #403 (comment)

@Maoni0
Member
Maoni0 commented Jun 20, 2016

We do appreciate your interest in our GC. Our goal is to enable people who work on this code the most to be efficient and to find ways to make a larger group of people efficient working on the code base over time. We’re going to wait to make GC codebase usability changes until after other people start contributing to it and have specific feedback based on making multiple changes. This way we have a better chance of maintaining a positive cost/benefit ratio for our efforts. We hold the belief that the primary barrier to making GC changes is experience with the domain not the structure of our codebase. That’s the primary motivation for our approach.

The GC is written in portable C++ code and relies on the PAL for OS services. It doesn’t need to be updated to port CoreCLR to FreeBSD, for example.

We took a look at other challenging parts of CoreCLR to see how they compare to the GC.

Assembly loading, for example, is spread over many different files and is very non-approachable. The community effort to implement collectible assemblies was unfortunately unsuccessful, see issue 552. Whether something is monolithic or split into multiple files doesn’t seem to directly inform approachability.

Another case is the JIT. @mikedn is a codegen/compiler person and has had success contributing to the JIT. JIT has files that have the same order of magnitude of lines as gc.cpp, e.g. https://github.com/dotnet/coreclr/blob/master/src/jit/importer.cpp. Thanks @mikedn!

We do want to expand the set of folks contributing to the GC. For example, we would love to see academics using this GC implementation for their research. I believe if there are GC folks who’re interested in contributing to our GC, they would be able to find their way around our GC code just fine, like @mikedn did with our JIT. Once devs make contributions to the GC, we’ll naturally take their feedback and preferences into consideration and will move forward.

@FransBouma

@Maoni0 That's all well and good, and sorry if I sound harsh, but your arguments sound like you keep it this way as a bit of a filter on who will work on it. If the code is incomprehensible to a person who isn't familiar with the inner workings (but other than that a skilled engineer) it can be a serious hurdle to work on the codebase. If the code is easy to understand for such a person, more people are likely to be able to contribute. The easier it is to work on a codebase, the more maintainable the codebase is. Sure it's maintainable for the people who work on it now, but that group is very small, and what if they move on to another team within Microsoft or if they're a 3rd party contributor lose interest, or don't have enough time? Who will take over? As it's not easy to participate in the project.

Of course the domain is complex too, but that's a given regardless of the codebase state: to be able to contribute to a GC, you have to be familiar with research on that topic. That already takes some time and effort, and if the codebase is complex on its own and hard to maintain, it'll take even more time and effort to contribute. Who will do that? I mean, who has so much time and energy as a 3rd party, non-paid volunteer(!), to contribute to this project?

Keeping the codebase as one big file will not help getting more input from the community, on the contrary: how is someone from the community going to contribute if a deep study to how the complete code is even organized is needed before a single statement can be added? If a codebase is properly organized, it's not needed to know the entire codebase before a contribution can be made. Isn't the goal for Open Source software to make it as much as possible for others to contribute? If only a select view are able to contribute (namely the people who invest serious time to learn the entire codebase), the amount of contributions worth having is only coming from those select view, and we then thus all depend on them to come up with noteworthy contributions.

I don't think that's a sustainable future.

@Maoni0
Member
Maoni0 commented Jun 22, 2016

I personally find that when you have the source and can build and run it, you are in an excellent position of understanding the code. What I did, with the GC code or anything else where I had these available, was just to build it, run it with a small test program and step into the code. I outlined the code flow in the GC BotR chapter (in “Physical Architecture” section) and you can set breakpoints on the top level functions; when those are hit you can step in and see what they do. I believe this should be very helpful to understanding the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment