Permalink
Find file Copy path
3901b98 Jul 24, 2018
2 contributors

Users who have contributed to this file

@s1monw @archanid
113 lines (55 sloc) 22.7 KB

Elasticsearch Team Development Constitution

Preamble

We, the team of Elasticsearch core developers, want to move as fast as we can toward a system that is reliable, robust, secure, scalable, and straightforward to use. We want to strive for innovation, replace legacy constructs and features, remove fragile code, and work toward a better user experience while keeping our users onboard with our rapid changes.

It’s crucial for us to have a shared vision of where the team is heading and maybe even more importantly why the team is going down a certain path. When Elasticsearch was the new kid on the block it shone with endless flexibility, ease of use, and rich APIs. We formed a company around that young kid and suddenly its user base shot through the roof. The support organization could barely keep up with the growing number of customers, which is a good problem to have. Yet as the number of users grew, so did the chance of things going sideways, unfortunately much more quickly than we could ever hire support engineers. We learned that much of the flexibility came from leniency, from features that worked in most cases but not all. For instance, having scripts that users can send with the request is basically a remote code execution engine and if it goes wrong it’s fatal. Even the most basic features, like settings, were very flexible but enormously fragile. Specifying a number without a unit was perfectly fine except that many users didn’t know what the default unit was. We just tried to do the right thing, which turned out to not be the right thing all the time.

Today we are in a different position. Our user base is much larger than it was in 2013 but our support organization hasn’t grown at the same rate. Yes, we handle an order of magnitude more support cases than in 2013, but this would not have been possible with the system we had back then. Now we’ve moved from a fragile but flexible system toward software that is narrower in scope. We have defined many more boundaries: stricter input validation, a security model that allows us fine grained control over permissions, and even a plugin model that provides great flexibility to add riskier features.

But hold on, we are not there yet! There are still endless problems that can have fatal consequences. Aggregations can blow up servers with a single request. Users feel the need to run Elasticsearch with 30+ GB heaps. We are still offering 27 different ways of specifying a boolean value. And this list continues…

We have a massive responsibility to our users, the support organization, the cloud hosting teams, and third party providers to offer a system that is reliable, robust, secure, and straightforward to use. For this reason we all should strive for innovation, replace legacy constructs and features, remove fragile code, and improve the user experience. Our advantage over other companies is our innovation, and innovation requires velocity. We must move and embrace change to innovate without leaving the user behind.

The following sections are a collection of principles and guidelines for designing, refactoring, or removing code from the Elasticsearch codebase. These points are unordered and mostly uncategorized and should be seen as a constitution of software development within the Elasticsearch team.

Designing features

  • Progress over perfection. We have followed this approach for many years now which allows us to make large changes over time without big bang commits originating from massive pull requests. For example, the completion suggester was added in the early days of Elasticsearch without support for realtime updates and specifically deletes. This means that deleting a document in Elasticsearch wasn’t immediately reflected in the suggestions. It was a hard problem at the time and about three years later we added support for bitset filters to the Lucene suggester as well as to Elasticsearch. Meanwhile it’s been an acceptable solution for many users out there, with many bugs fixed and the evolution toward a document based suggester. It was all about progress over perfection.

  • Design for today! Use abstractions with care. Computer Science professors teach students to make extensive use of abstraction layers in the name of flexibility and information hiding. Certainly Elasticsearch makes extensive use of abstractions; no project involving several million lines of code could do otherwise and survive. But experience has shown that excessive or premature abstraction can be just as harmful as premature optimization. Abstraction should be used to the level required and no further.

As a simple exercise, consider a function which has an argument which is always passed as zero by all callers. One could retain that argument just in case somebody eventually needs to use the extra flexibility that it provides. By that time, though, chances are good that the code never noticed — because it has never been used. Or when the need for extra flexibility arises, it does not do so in a way which matches the programmer's early expectation. We should routinely submit patches to remove unused arguments; in general they should not be added in the first place. (adopted from https://www.kernel.org/doc/Documentation/development-process/4.Coding)

  • Start simple; don’t be smart. Everybody wants to write code that is top-notch, cutting edge, fast as hell, robust to the end of days, elegant, and efficient. Unfortunately, this doesn’t happen overnight. As with everything we’ve got to learn to walk before we can run. A new feature should start in the simplest way possible. Even if it’s desirable to prevent moving all shards to shrink an index from 8 to 4 shards, it’s better to have a solid shared infrastructure that requires all shards to be on the same node. Special allocation logic can come later, in subsequent releases.

  • Beware: removing code is hard. Removing even the smallest feature can be extremely hard. Be aware of this when you add code to the code base and choose wisely what gets added. We might need to stick with it for years or break many users out there when we attempt to remove it.

  • Strict, unambiguous, reliable, and simple. Elasticsearch has a history of adding lenient, ambiguous, unreliable, and complex options. The direction along those lines has changed a while ago. At first glance it seems to help with being friendly to users but it comes with a high price. It comes with a combinatorial explosion of options and code paths which are not tested and which hide bugs. A perfect example is the gazillion options to specify a boolean value. Someone would think it’s as simple as comparing the value to the string "true" or “false” and if it doesn’t match we throw an exception. No, it accepts the values “false”, “0”, “no”, and “off” and if it doesn’t match any of those it’s interpreted as “true”. What could possibly go wrong? If you add code, do it in the simplest way possible that is also strict, unambiguous, and reliable.

  • Stick to core responsibilities. It is crucial to our system that we build solid and reliable features. To do this, we need to remember our core responsibilities as a distributed scalable search engine. For example, we used to provide a limited web server called site-plugins, but it didn’t represent our core responsibilities, so we removed it. When features align with making Elasticsearch a better distributed scalable search engine, it becomes more solid and reliable. (The same principle applies to all our products.)

  • You are the expert; act like it. Elasticsearch has become popular. The user base is huge and growing exponentially. The subset of power users is shrinking such that one of our core responsibilities now is to streamline API usage and reduce the risk of "shooting yourself in the foot". Our core APIs offer a lot of flexibility which makes them easy to misuse. The result is often slow performance, cluster outages, and wrong results. Going forward, we should use our experience and deeper knowledge of the system to prevent these pitfalls. Build APIs and features that do their one thing very well. Don’t design it to be a workaround for other problems.

  • Build features in isolation. Always prefer adding a feature as a plugin rather than adding the feature to core. The best way to make clear APIs and extension points is by using them. To arrive at a maintainable core we have to keep it lean. Our plugin model allows classloader isolation as well as dedicated permissions to third party components. An isolated implementation is always preferable. If it needs to ship with the distribution it can be a module.

  • Remove first, fix later. Often the removal of dangerous or trappy features stalls because there is no replacement for its functionality. We will remove those features and prioritize reimplementing them as though they were new features. If the features were important we’ll make the reimplementation a blocker for the next release. If they weren’t it might be several releases until they are reimplemented. Or they might never be reimplemented, becoming a relic of time long gone by. Removing dangerous features is crucial to the success of the organization. For instance, delete-by-query was repeatedly causing massive outages that took days to debug and fix. Its removal has potentially saved us a large amount of money and time that we didn’t spend with our customers. Now given the fact that our user base is growing fast, it’s our responsibility to make the right decision for our users even if that decision is not popular. The remove first approach is mandatory when it comes to security, cluster stability, and data corruption.

  • Be fast by default; slow is optional. Performance is key in our business. Slow is considered esoteric. This is a very difficult topic since for instance O(n) can be acceptable with 10k documents but not for 10m. A perfect example is *_source* access in scripts. There are some scripts (search for instance) that should simply not allow the access of *_source* since it loads the JSON source of every scored document. That is a round-trip to disk per doc the script is executed for. Features like this must be disabled by default or should not be available on performance critical parts of the system. There is always the argument of prototyping, small documents sets, little webshops etc. Yet our message here should focus on reindexing and future improvements to our defaults to make these APIs obsolete.

  • Focus* on upgrade experience.* With time-based release the upgrade experience is crucial for us since we want users to move to new releases ASAP. We had several problems in the past with breaking too much and users suffered from long cluster restarts. Our feature development and ideas for improvement should focus on smoothing the path forward.

  • Break on majors, not on minors. Breaking changes must be done on major releases. We try to scope the changes to not break too much at the same time especially if we are getting closer to the GA date. It’s OK to push out a breaking change to the next major if we are close.

  • Test bottom up. If you write code, write unit tests first. Write many of them. Write code so you can write many of them. Integration testing is the last step. Focus on adding more tests that execute fast and are easy to debug, like unit tests. This is crucial for developer velocity.

  • Consider Java APIs expert APIs. With the exception of the HTTP-Client and its dedicated APIs, all APIs in Elasticsearch, all extension points and plugins are expert APIs. Accordingly, expert users can handle API changes and removals. The most reliable way to make sure folks don’t use deprecated APIs is to remove them. Don’t hesitate, especially when it’s an internal API. Non-expert users should always go through a REST interface.

  • Be critical, doubt all the code, and embrace mistakes. Everybody writes code that must be fixed, refactored, or removed at some point. But most of the time the effective half-life of code is pretty short. Add comments describing why things are done in a certain way. We can never know the full extent of a problem or all the use cases when we develop a feature.

When someone criticizes the code, they are not criticizing you, so don’t take it personally. Help them understand why you wrote it that way. When someone rewrites something you wrote, it’s not a rejection of your ideas. At one point Mike pushed a change to Lucene that Adrien obsoleted just two days later! It’s great when other people take interest in code you wrote: it means the code is alive. See improvements to your code as that code becoming a growing, thriving being.

Don’t fear making a mistake and, more importantly, don’t let fear paralyze you from adding something that might not be totally right. See mistakes and failures as feedback, *discovery *and knowledge that can make our product better.

  • Don’t be afraid of big changes. Often the solution to a problem is hard. The hardest part is to solve it correctly. It comes together with tons of work, risk, and changes in the system that will affect others — mostly users. Prefer incremental changes (see Progress over perfection), but be willing to make big changes in big chunks where incremental change is impossible.

  • Don’t be afraid to say no. Elasticsearch is at a point where it can’t accept every change. If we tried to make everybody happy we would stall and be paralyzed. There are certain things that just don’t work with a system like Elasticsearch. Think of joins or real type isolation. To these we have to say thanks, but no thanks!

  • Only accept features that scale. We often get feature requests that would work fine in the single-node case (e.g., joins or precise cardinality aggregations) but would be a disaster given the distributed nature of Elasticsearch. These feature requests should always be rejected because they violate our core Elasticsearch responsibilities of scalability and distributed nature. In principle we do not add features that only work in the single-node case.

  • Always start with a dream. It’s wise to start with an idea of what a perfect solution would look like, even if it involves backward breaks or removing core features and replacing them later with better solutions. Sometimes it’s possible to implement the ideal even if it takes time. By the same token, it’s wise to think about the simplest possible solution, and in many cases the biggest bang for the buck is pretty close to the simplest solution.

  • Focus on error reporting. In software development lots of things are binary. If it didn’t work as expected it should fail fast and hard. Focus on good error reporting; avoid swallowing exceptions, rather declare checked exceptions and forcing the caller to handle the situation, Exceptions are part of a method’s contract! Guard preconditions for methods with actual checks. If we don’t know or document what exceptions can happen, and when, then we have no idea really how the method behaves. Add checks/javadocs and try to make this better. Look at the JDK code for examples of this, even Lucene code which we use as a reference often is not a good example here. Understand how this can make your code faster, e.g. checking for array index up front in the method is not only more clear, it fails fast and hard, and can also eliminate bounds checks ("dominating test"). When reporting errors, ask yourself a) what message you would want to see if you are debugging a problem, and b) what information would give the operational production-support team enough insight to diagnose problems.

  • Document the code. You may think your code is obvious, but it may not be. Give a high level overview of it to someone who is unfamiliar with those thousands of lines of code, such that you both would be able to divide and conquer. Document a summary of what things do at package, class, and method level. If you think your own code is tricky or hairy, do even more to try to compensate. Long living code is written once but read and reread many times!

  • Private by default. Java’s access levels are a good way to encapsulate code: separate the interface/contract from the implementation detail. Private is best, package-private is good, public is last resort. Be careful about what you expose, so that your class or API is simple easy to use.

  • Every change deserves review. Our system is complex and every change can have potential side effects. We expect everyone to work hard and think things through but there will be times when implications are missed. Every change should be proposed and receive at least one LGTM. For complex changes, two reviewers are better. On some teams two is the minimum number of LGTMs, three for complex changes. Coders and reviewers share responsibility for failures associated with a change; this encourages careful review. Sometimes a feature fails unexpectedly because something it depends on has changed or broken. We should all take ownership for failures and unexpected problems for customers, rather than blaming a few people.

  • Rules are meant to be broken. Sometimes the code has to break the rules. Maybe it’s five times easier to understand if your comment contains a table 150 characters wide. Nasty abstractions to try to enforce "DRY" can end out far worse than a simple duplication of code!

Interacting with people

  • Voice your opinion with precision and respect. Always share what you have to say but leave room for another opinion. Always explain your reasons. Ultimatums kill conversation. Phrases like "This will never work" and “This is stupid” are lazy and imprecise. Say “I think this will be problematic in the case that ... because … “. Don’t say “This is wrong”; say “I think this is wrong because…”. Don’t say “Is this really needed?”; ask “Why is this needed?” Don’t say “I am not open to anything else” or “There is nothing to discuss”. See the point about vetoes instead.

  • Be kind. The written form is hard. What you intend with your words may not be obvious to the reader. Make the effort to explain your reasoning clearly. Be quick to apologise if you haven’t made a good job of explaining. Assume misunderstanding rather than malice. When in any doubt about communicating your idea, get on video or voice chat.

From Being Kind:

Being kind is fundamentally about taking responsibility for your impact on the people around you. It requires you be mindful of their feelings and considerate of the way your presence affects them. It isn’t the same as being nice. It isn’t about superficial praise. It doesn’t mean dulling your opinions. And it shouldn’t diminish the passion with which you present them.

  • Thank people. Say it when someone has done a good job. Take the time to include some detail on why you think it was a good job to make it sincere and specific.

(Beware of the bad audio of the recording: How to Delegate, Like a Boss)

  • With power comes responsibility. You have the power to veto. A veto or in other words a -1 is a strong call. Only use it as a last resort when you are 100% convinced that a certain change should not be made. Don’t use it if you only disagree or do not like a change. Beware that a veto will kill progress of the issue/change and won’t be overruled unless retracted, so be conscious of the seriousness of a veto.

Use wise words to explain your objection. Your veto must come with a technical reason. Be prepared to discuss and explain. The vetoed change is guaranteed to be seen as good by the persons who proposed it and they deserve that discussion. Of course, they also deserve the chance to convince you of their reason. In that past, such healthy discussion actually ended with the a veto being pulled back.

  • Empathy for passion. Some of your coworkers have unlimited passion. Unfortunately it doesn’t always come with unlimited patience. It’s always good to change communication channels if discussions go sideways. Face to face communication is crucial in such a development environment. For example, some of our code cleanup efforts last several months. Think of settings refactoring or removing google guice. If you argue on an issue, keep in mind that the other person might have spent months on this and you may be missing something in the big picture. It’s enormously hard to ensure both people are on the same page! If in doubt do /zoom on slack or go on aOn.

  • Empathy for the pressured. You will face situations where an argument goes sideways. You will see situations where people do not use a polite voice. Don’t accept it but try to look behind the scenes, speak about it, and forgive.

  • Report abusive comments to our Code of Conduct team. If you see abusive comments, even if you are not part of the conversation, please report it (push back). Don’t fuel a discussion, and don’t provide a forum for more abusive comments. End the discussion and notify others to back you up. If you don’t feel like doing this, it’s fine to ping others directly to do this for you. Use github emoji (and reaction emoji) to amplify pushback from others. Then move on with technical arguments. Be the bigger personality and help others to reduce the impact of abusive or aggressive comment.

  • Ask questions if in doubt. There are many complex areas in Elasticsearch. If you are in doubt or you are not sure how to fix a certain problem, if you don’t even know how to approach the problem or you are stuck, go and get help. Having zoom sessions to explain the problem to others might even help in realizing that the solution to the problem is a totally different one.

  • Solve conflicts by speaking to each other. Accept decisions, even if not yours and move on. We are all passionate and opinionated people. This is what makes us good at our job and moves the code forward. It also means we will not always agree. Talk things through and try to see the other side. Almost always there is also a third way that will make both parties happy. In the worst case, there will be times that consensus is not reached and leadership has to make a call. Disagree and commit. Remember: nothing is final and things can be changed if they have been proven wrong.

Inspired by:

Zen of Python

Contributor Covenant

Amazon’s Leadership Principles

Rust’s Code of Conduct, Rust video on Conduct