Skip to content

Latest commit

 

History

History
2313 lines (1289 loc) · 266 KB

AUDIOBOOK.md

File metadata and controls

2313 lines (1289 loc) · 266 KB

A word from the author

Thank you for acquiring this audiobook. You will soon know all is to be known to successfully start or pursue your digital transformation with DevOps.

Some parts of the original book were adapted to be better understood from the voice of a narrator. If you know GitHub, this book also has its own repo : if you spot an error or want to support the author, leave an issue or a star.

I wish you a very pleasant listening.

Acknowledgments

Dear colleagues, dear friends,

There are those adventures that could not succeed without the rigor and dedication of a few. You have been those few. By agreeing to read and reread this manuscript, by sharing your constructive criticism, your informed suggestions and during our enriching discussions, you have largely contributed to its success.

Through your expert eyes, I was able to sharpen every line, polish every word. Your support has been a pillar, a silent and unwavering force. Many thanks for your time, expertise and camaraderie.

It is because this writing is also partly yours that I wanted to mark it with the names of the most tireless contributors. However, in order to preserve your anonymity, I have chosen to only include your initials.

Thank you B-A, F-T, M-R, F-C, F-P, N-K, A-B, A-C, N-P, N-M, C-H, E-L, P-B, A-N and T-F. Yea, that's quite a lot of people.

Disclaimers

This book includes numerous references and web links to people, products, companies and organizations. The opinions expressed in this book are those of the author and do not in any way reflect the opinions of the entities mentioned.

The author has no affiliation with any companies mentioned in this book, whether by partnership, sponsorship, or any other arrangement. Any mention of a company or product is strictly informative and should not be construed as promotion in any way.

Transparency seems essential to me in any research and writing work, and I wish my readers to be informed of my lack of affiliation with the organizations cited in my work.

Introduction

The constant evolution of technology requires organizations to reinvent themselves. They are forced to respond ever more quickly - and often without increasing resources - to their operational requirements. Strategists are mobilizing to stay ahead of the curve in the face of ever-fiercer competition.

Many organizations have already started their digital transformation to master the complexity of interdependent and fragmented information systems. DevOps is one of the approaches to achieving this goal and working more efficiently.

Appearing in 2007, this cultural and organizational movement allows an organization's stakeholders to work more effectively to achieve its objectives more quickly.

Thanks to several theorized methods, DevOps constitutes a means of responding to this efficiency challenge. Each aims to improve the relevance and reliability of the services offered by the organization. To enable it to be more agile, DevOps takes full advantage of Cloud technologies: mostly open, proven, standardized, and attractive.

According to the consulting and research company Gartner, more than 85% of organizations will adopt a Cloud strategy by 2025. Atlassian's survey revealed that 99% of companies believe DevOps positively impacts their organization.

Several initiatives to create sovereign Cloud platforms are taking shape around the world. This is for example the case of MeghRaj in India (2014), Bundescloud in Germany (2015), JEDI in the United States (2017) , Nimbus in Israel (2020), GAIA-X in Europe (2020), the Riigipilv in Estonia (2020), Outscale, Athea, and S3NS in France (2010, 2017 and 2021), the Government Cloud in Japan (2021), the National Strategic Hub in Italy (2022). At the heart of these infrastructures, there is unanimous agreement on an organizational structure to unify practices and orchestrate these technologies: DevOps.

More widely used in the private sector, the major cloud providers (Amazon Web Services, Google Cloud Platform, Microsoft Azure, Alibaba Cloud) internally practice this organizational, promote it, and provide the technologies to adopt it.

South Korea has historically favored the use of private cloud technologies, especially since its 2015 law that facilitated outsourcing. Owing to multiple overlapping investments, outdated information systems, and a shortage of cybersecurity experts within the country, South Korea established national data centers in 2007. These centers now accommodate the information systems of 45 government agencies. In the wake of the COVID-19 crisis, in 2021, the country announced an ambitious digital transformation plan for its administration: the Digital Government Master Plan 2021-2025. This strategic plan introduces a technical framework named eGovFrame designed for the development and management of government information systems. One of its primary objectives is to enhance their interoperability, and it intrinsically incorporates DevOps principles.

In an effort to regain sovereignty, other governments display a clear desire to adopt these technologies and practices, without necessarily describing their initiatives in public. These desires take shape within documents mentioning the Cloud, AI, or data strategy of the countries.

For example, Canada released its "Goal 2020" report in 2013 to modernize the manner in which public services operate. It later released the "Cloud Adoption Strategy" in 2018.

In the UK, the Ministry of Defence announced in 2022 its intention to become an "AI-ready" organization, in its "Defence Artificial Intelligence Strategy". In the way it describes its transformation, it perfectly captures the essence of DevOps.

UK Ministry of Defence quotes « We must change into a software-intensive enterprise, organised and motivated to value and harness data, prepared to tolerate increased risk, learn by doing and rapidly reorient to pursue successes and efficiencies. We must be able to develop, test and deploy new algorithms faster than our adversaries. We must be agile and integrated. »

As early as 2018, the UK Ministry of Defence launched the NELSON program to equip themselves with a big-data platform for the benefit of the Royal Navy. This technical environment, based on Cloud technologies, also incorporates DevOps practices.

Across the Atlantic, the United States had already recognized the need in 2011 to manage information in a unified and agile manner, accessible via a single access point. The Department of Defense (DoD) outlines this vision in its "DoD I-T Enterprise Strategy and Roadmap".

United States Department of Defense quotes : « Twenty-first century military operations require an agile information environment to achieve an information advantage for personnel and mission partners. To meet this challenge, DoD is undertaking a concerted effort to unify its networks into a single information environment that will improve both operational effectiveness and information security posture. »

It will publish in 2019 his first reference guide for the industrialization of DevSecOps practices: a methodology emphasizing security. Aimed at providers, buyers, and managers of modern information systems, this institutional guide describes best practices for the implementation and maintenance of such systems. The stated goal is to deploy software at the "speed of operations". In the economic environment, the parallel is that of the "speed of stock markets".

In the private sector, Microsoft historically launched its new products every 3 to 4 years (e.g., Windows or Office). As early as 2014, its CEO Satya NADELLA warned his teams about the risk posed by the long duration of this development cycle. By continuing with the same organization mode, Microsoft would become obsolete. The teams responsible for developing each product worked independently from one another, with their own organizational methods and their own tools. NADELLA reorganized the company based on the DevOps methodology. He would unify the tools and practices of the teams, so they would interact with each other.

Faced with increasingly aggressive economic or military competitors, transformation is an imperative necessity to stay in the race and prevail in the next confrontations. For institutions, it is no longer a question of "if" but "when" they will need to embark on a transformation journey, or risk being left behind.

However, the majority of organizations still struggle to pragmatically implement these new practices. The main obstacle is finding the talents capable of implementing the techniques and tools suitable for DevOps operation.

There are numerous studies to refer to on DevOps, which is primarily a topic of cultural transformation for technical and management teams. These studies draw upon the experiences of many players and allow us to avoid common mistakes in a transformation approach.

For instance, Google Cloud's DORA research program (DevOps Research & Assessment) has been conducted since 2014 with over 33,000 professionals in the Cloud sector. Each year, its report on the state of DevOps worldwide is published. Therefore, this field is far from new, and the initial risk is now much more moderate for newcomers. However, the industry continually finds ever-more effective ways to transform, to keep pace with the rapid advancements in the digital sector.

This book aims to demystify both the organizational and technical aspects of DevOps. These concepts are accessible to all and will provide you with an overview of the Cloud's challenges for a successful transformation. It offers guidance for a first-time DevOps experiment or to refine an ongoing transformation.

We will explore the reasons for the emergence of this methodology, its content, and how to inspire your organization to transform. Every organization has its own needs, its own maturity level, and there is no one-size-fits-all solution. Nevertheless, the industry's successive experiences have led to the creation of standards that will be presented throughout this book.

The experience of pioneering companies now ensures that efforts invested in DevOps will make your organization more efficient, agile, and sustainable.

The five pillars of DevOps

According to the renowned American company Atlassian, the DevOps movement was born between 2007 and 2008. It was a time when software development professionals (those who develop) and system administrators (those who deploy) were each concerned about their poor ability to collaborate. They viewed this situation as a critical dysfunction, stemming from their lack of closeness.

Initially, DevOps focused on how to improve the efficiency of software development and deployment. More than a decade later, this methodology has evolved to address other areas such as security, Cloud infrastructures, and corporate culture. Around 2015, the DevOps methodology was primarily employed by major American tech companies (GAFAM and NATU) or businesses that were already using the agile methodology.

Now widespread, organizations of all sizes use the DevOps methodology worldwide and across various sectors (healthcare, finance, transportation, government, heavy industry...).

The term DevOps is attributed to Belgian engineer Patrick DEBOIS. As a consultant in 2007 for the Belgian government, he was tasked with migrating a data center. After spending significant time discussing with developers and system administrators, he observed what renowned engineers Andrew CLAY SHAFER and Lee THOMPSON would later theorize two years afterward as the "wall of confusion". This metaphor can be summed up as stakeholders not understanding each other.

The community coined a term for a real phenomenon that hinders communication and collaboration between teams, resulting in inefficiencies and delays. This led to the writing of his book in 2015, "The DevOps Handbook: How to Create Technologically Agile, Reliable, and Secure Organizations". In it, DEBOIS describes how organizations can increase profitability, improve their corporate culture, and surpass their goals using DevOps practices.

Google theorizes five pillars of DevOps:

  1. Reduce organizational silos : by fostering engagement and sharing the sense of responsibility among stakeholders in both successes and failures (engineers, project managers, users/business teams). Everyone should feel involved and validated at their level.
  2. Embracing failure : with the understanding that failure is a result of the organization's lack of procedures and methods.
  3. Reducing the cost of change : by implementing changes incrementally, deploying quickly and failing quickly to iterate.
  4. Leveraging automation : by automating to save time and improving the maintainability of the infrastructure.
  5. Measuring everything : by establishing performance indicators, system reliability metrics, to better understand the behavior of deployed services, respond more quickly or even predict.

DevOps vs Site Reliability Engineering

To fully understand how DevOps can benefit your organization, let's start by defining two of the most important terms in the field: DevOps and SRE.

DevOps

It bridges the gap between development and production.

"Dev" stands for "development", while "Ops" refers to the administration of I-T systems in production.

"DevOps" - which means Development and Operations - denotes the organizational and cultural movement aimed at streamlining the software development and deployment cycle.

To achieve this goal, engineers practicing DevOps are tasked with facilitating communication and collaboration among stakeholders. Stakeholders are the developers, system administrators, security teams, project managers, and users.

They identify the most relevant I-T practices and tools for an organization and study their implementation. As a team, they ensure the alignment of developments with deployment requirements. Today, these professionals primarily focus on the use of Cloud technologies.

DevOps practices span the entire technical chain, emphasizing automated mechanisms for development (e.g., continuous integration), deployment (e.g., continuous deployment), and maintenance (e.g., monitoring). Both internal teams and clients benefit : clients collaborate more effectively and securely, while internal teams receive higher-quality software more promptly.

This role involves the responsibility of aligning all stakeholders on a common working method. Hence, possessing strong communication and teaching skills is vital, especially in transforming organizations.

DevOps engineering aims to make the entire organization aware of system reliability issues. The most experienced engineers can establish practices that meet resilience requirements without affecting development velocity.

The main challenge lies in striking a balance between complexity induced by reliability and security requirements and the need to develop new features.

In the next part of this book, we'll see that the implementation of DevOps is unique to each organization. To reach these goals, methods and tools adjust according to the organization's technical maturity level. There's no "one-size-fits-all" approach, but there are "best practices" to know and follow.

Just as there isn't a single recipe, there isn't a unique "DevOps engineer" profession. We'll touch upon this topic in the chapter "Between SRE and DevOps".

While the term DevOps is becoming increasingly popular and is starting to appear commonly in job offers, Site Reliability Engineering (SRE) is less well-known, particularly in France.

The written book shows here a Google Trends graph from 2014 to 2022 illustrating the popularity of the terms DevOps and Site Reliability Engineering among countries.

Site Reliability Engineering (SRE)

System Reliability Engineering (SRE) is an older discipline than DevOps. It traces back to 2003 when Ben TREYNOR SLOSS, then an engineer at Google, founded a team bearing this name. He is recognized as the founding father of SRE and the earliest practices considered as "DevOps".

The Site Reliability Engineer is responsible for designing, deploying and maintaining the infrastructure that makes the company's services available. He ensures the proper functioning of the technical base on which the software is deployed. It ensures their security and guarantees their availability to customers.

The SRE team, therefore, is responsible for your I-T infrastructure, typically comprising several environments: development, testing, pre-production (or staging), and production. They aim to answer the question, "What are the things (tools, procedures, machines) that we don't have, but need to achieve our resilience goal?"

SREs employ software engineering practices to manage their infrastructures. They develop and deploy tools aimed at achieving a resilience objective. In this regard, SRE encompasses many facets of DevOps, but focuses on the automation of administration, as well as the measurement of system reliability.

Companies primarily hire them to honor their service contract (Service Level Agreement). In the private sector, if service availability drops below the value stipulated in the contract (e.g., below 99% monthly availability), the company is obligated to pay penalties.

In simpler terms, companies task SREs with making their infrastructure more resilient, meaning ever more available and stable. SREs seek to answer the following question: "what are the things (tools, procedures, machines) that we don't have, but need to achieve our resilience goal?"

DevOps practices are an excellent means to reach this goal, which is why SREs often employ them in their daily work.

Between SRE and DevOps

Definitions vary depending on who you ask. While some leaders like Google and AWS officially define DevOps as a "methodology" and the role of SRE as its "implementation", the majority of job listings in the market often still use the title "DevOps Engineer": a title that is incomplete in the strict sense of the historical definition.

The fact is that both disciplines have evolved and overlap in many areas today: they share the goal of rapidly deploying reliable and efficient software.

However, they don't focus entirely on the same things. While DevOps leans more towards development efficiency and deployment speed (things such as CI/CD, automated tests or cross-team collaboration...), SRE focuses on system reliability, adopting a more methodical approach (things such as SLI-SLO-SLA, blue/green deployments or postmortems...).

Today, you might find "DevOps Engineers" who don't do SRE, but the reverse is rare. As DevOps is a philosophy, this term should be used as an adjective. For instance: "DevOps Software Engineer" or "DevOps System Administrator".

However, let's see what the market has to say. By observing job listings in the field, it's noticeable that those titled "DevOps Engineer" involve a wide range of tasks. They might be:

  • Development-oriented, with roles such as software engineering, system engineering or quality assurance engineering.
  • Operations-oriented, with roles such as system administration, Cloud engineering or network engineering.
  • Oriented towards both, with roles such as SRE, automation engineering or platform engineering.

In reality, all these roles enable the practice of DevOps. However, the presence of each within an organization depends on its maturity and resources.

In summary, it's said that SRE utilizes DevOps methods. DevOps and SRE are neither opposing nor identical methods, but two disciplines that will help you break down barriers between your teams. This will allow you to deploy services faster, more securely, and of higher quality.

In this book, you will discover best practices from these unified disciplines, tailored to institutions.

DevSecOps

The term DevSecOps is gaining in popularity. It describes a DevOps organizational approach that integrates Information Systems Security (ISS) teams from the software design phase and throughout its lifecycle.

More specifically, it ensures compliance with security standards set by the organization, using automated rules that verify the compliance of developed software.

Perhaps you've heard of "shift left security"? This term emphasizes the importance of incorporating security efforts into a software project as early as possible (best practices, vulnerability analyses, audits).

Organizationally speaking, this method places the ISS teams at the heart of the exchanges between developers and production teams. These teams will support the developers to integrate the organization's security requirements into their software as seamlessly as possible.

From the design phase, the DevSecOps ISS teams define and provide tools that monitor the presence of privacy and security features in the software. For example, they will check for GDPR functionalities in a software or the proper functioning of the "need-to-know" mechanism for data access. This can also include the implementation of automatic vulnerability detectors in the code.

Nicolas CHAILLAN, former Director of Software Engineering at the United States Air Force, defines it this way :

"DevSecOps is the evolution of software engineering. It's the balance between development velocity and the time allocated to security considerations. We want security to be integrated to ensure it's not overlooked but added to the software development cycle. It's about using modern cybersecurity processes to ensure the software is both efficient and built securely, ensuring it remains problem-free over time. This is what will allow companies and organizations to remain competitive and move forward at the necessary speed against their competitors."

Today, the term "DevSecOps" is often favored with the sole aim of making the discipline more attractive. However, it can help Security teams and their managers understand they have a concrete role to play in this type of organization. It's the "Sec" in the middle of "DevSecOps".

Author's Note: I consider security to be inherent to any information system, so I see the "Sec" in "DevSecOps" as implied. That's why I will rarely use this term throughout this book.

We will discuss the paradigm of this organizational structure and its security techniques in the following chapters. But before that, let's learn more about the organizational challenges of DevOps.

DevOps in Practice

A DevOps initiative is a significant transformation for an organization. If it hasn't yet transitioned to agile mode, it involves every layer of the company to foster shared synergies.

DevOps doesn't just bring together "Devs" (which are the software engineers) and "Ops" (which qualify system administrators), but primarily the "management layer". Management needs support in understanding the opportunities presented by a change that's often perceived as challenging because it's unfamiliar. In most cases, this transformation requires a significant evolution of the organization's I-T systems in the long run, as it involves the adoption of new tools.

Empathy is the key skill for a successful transformation. For some, these new work methods and tools are in stark contrast to their traditional practices.

That's why it's crucial to frequently educate the hierarchy on the benefits of shifting to DevOps: demonstrate it to them, answer all their questions, and support them until they fully grasp its implications.

Every organization benefits from addressing new technological challenges. In the face of ever-modern and swift competition, your entity will not dominate by resting on its laurels.

Why DevOps?

Research reports support the theory that the benefits of efforts invested in SRE become apparent in the medium term.

According to these reports, practicing SRE does not affect an organization's resilience until a certain maturity level has been reached. This means that one needs to achieve a critical mass before being able to reap the benefits of these tools and practices.

The DORA 2022 report highlights the need to adopt a substantial number of SRE practices before harvesting "significant" resilience benefits. This phenomenon can deter decision-makers from transitioning to a DevOps model.

Where interest is confirmed is that the benefits yielded by DevOps exceed the costs once the initial investments are made.

This trend is precisely where DevOps shines: even though traditional infrastructures initially require little investment to provide a service, the cost of maintaining them increases proportionally to the number of services deployed. This makes their management unsustainable in the long run. DevOps, on the other hand, requires a higher initial investment but provides the ability to manage exponential activity with a logarithmic cost trend.

This organizational mode aims to make infrastructures more reliable, reduce manual tasks to make the most of engineers' time, deploy software faster, and ultimately, deliver a better-quality service.

DevOps to traditional infrastructures is what assembly line construction is to craftsmanship: by constructing on an assembly line, costs are reduced, and demand is met. The added advantage in software is that one can adjust the product to be delivered within a few hours. This action can be repeated several times a day!

While historic practices deserve credit for running information systems for years, more agile methods exist today. To militarize the argument: bows and arrows served their purpose, but since then, armies have invented the AR-15.

The challenge of transformation is to get your hierarchy to buy into this initial investment, even when the benefits might initially be hard to see. This is a common challenge that we will address in the chapter "How to convince and keep the faith".

Skeptics and over-optimists

Companies are typically aware of the changes they need to implement. However, they either hesitate or are unable to immediately commit to the necessary efforts to achieve this transformation.

The most skeptical or overly optimistic believe they can get by by starting a cost-effective initiative:

They say: "I only need one SRE-DevOps engineer."

Sorry but, no.

Let's use an example to illustrate this scenario. You start with a 2-person team developing software. Several issues are already identified, especially if you're operating in a regulated sector:

  • Who sets up the infrastructure to properly develop this software? (the software factory, dependency mirrors or library registries)
  • Who secures this infrastructure?
  • Who handles backups?
  • Who defines development rules and ensures their consistency for maintaining software over time?

If you rely solely on your software engineers to manage infrastructure, they will inevitably increase technical debt since it's not their core competency. This debt incurs costs and maintenance efforts, which worsen as your team grows. Developers won't focus on development but will spread thin on tasks meant for an SRE. This scenario already calls for at least 1 SRE/DevOps engineer.

What if you hired more and now have a team of 6 engineers? They need machines set up and configured. Some encounter bugs, others request library updates... If there are security mandates (e.g., approvals, event logs), time must be spent setting up tools and infrastructure properly. This calls for at least one additional SRE/DevOps engineer.

Two engineers leave your company? Unfortunately, you still have to maintain the evolved infrastructure for your remaining 4 engineers and all the machines or servers you've set up.

Understand that you need to achieve a critical mass of SRE/DevOps profiles to sustain a robust foundation. This foundation enables your engineers to have the necessary tools to work efficiently. This critical mass should adjust based on staff size and can't be reduced without facing significant operational challenges.

The debate often circles back to "quality or quantity?". History of global armed conflicts demonstrates that both are often necessary. Armies need a critical mass of soldiers and equipment to establish favorable power dynamics and compensate for losses to keep advancing. Though high-quality equipment can accomplish more, it can't handle everything simultaneously. The same goes for an engineering team. No matter how brilliant, a critical size is needed to meet the basic requirements for an efficient and resilient service.

For instance, Google, with tens of thousands of engineers, maintains its SRE-to-developer ratio at about 10%. This SRE/developer ratio and associated costs are initially high at the outset of your initiative but tend to level off as the number of deployed services grows. This is due to the strong infrastructure needs at the start of your initiative, which decrease as administrative tasks become automated.

It's proven that transitioning a traditional structure to DevOps demands significant investment. Establishing the foundation to reap its benefits also takes time. However, remember that the essence of DevOps practices is to manage exponential growth with logarithmically trending costs.

Too big, too soon

The failure of a project is often due to a poorly defined scope, with overly ambitious objectives or unclear planning. This mismanagement leads to uncontrolled increases in timeframes and costs. It then becomes common to seek an "interim solution" while hoping that the initial plan might eventually come to fruition.

A DevOps initiative is built upon what already exists within your institution: the key is to start small to accurately understand the needs of the business and to bring the entire organization on board. This approach is the Kaizen method, originated in Japan during the 1950s within Toyota factories. In France, it's known as the "strategy of small steps".

Dare to start small and iterate as both you and your institution become more familiar with the challenges and nuances of these new technologies. Ensure that each team becomes an advocate for your initiative. We will discuss the theories behind this recommendation in the chapter "How to persuade and keep the faith".

Changing the culture of an organization takes time, but taking shortcuts may offend, demotivate your teams, and ultimately, cause your project to fail. Since DevOps is based on the principle of successive iterations, you'll be taking fewer risks.

Initiatives within Organizations

Has your management been convinced by your transformation initiative and granted you all the necessary resources? If so, move on to the next chapter. If not, let's delve deeper to understand why.

It may happen that newly appointed decision-makers ask their subordinates to "quickly" find turnkey solutions to the problems they encounter. Rather than adopting an investigative approach, the urgency of obtaining immediate results leads them to make hasty decisions. After all, a leader is expected to quickly and effectively find cost-effective solutions. Most of the time, however, initiatives - of varying maturity - already exist within the organization.

Technical solutions are easy to design and delegate. Instead of considering historical proposals, purchasing "off-the-shelf" technologies or launching a brand new project may seem more efficient. However, choosing a solution without considering the organization's inherent constraints can be risky.

Moreover, these constraints are often already recognized and have been expressed for years by internal expertise. They lead to the birth of projects initiated by employees, in response to observed needs or their own frustrations. Instead of encouraging them to find a solution, they are often reprimanded for insubordination. In reality, these projects often get lost in middle management and rarely reach the decision-maker who can sustain them.

Indeed, decision-makers seldom have the time to meet each of their teams. As a result, they tend to favor their own opinions or seek those of their deputies instead of their experts. The resulting decision, therefore, reflects the perspective of a single person, isolated from operational realities. The more layers of management there are, the more pronounced this isolation becomes. This leads to a concentration on poorly researched and non-inclusive projects. Paired with inherently low-impact communication, it inevitably produces frustration within the company.

A prime example is the U.S. Department of Defense. They launched a new DevSecOps initiative named Vulcan 4 years after the Platform One initiative, which had the same aim. Beyond causing frustrations within the Platform One teams, the Vulcan program has experienced delays and cost overruns.

In other instances, the skepticism of some leaders leads them to question proposals made by their internal experts. Taken to an extreme, this mindset negates the benefit of hiring experts who interact daily with the company's issues. The external expert (e.g., a consulting firm or a third party) then becomes indispensable, viewed as objective and impartial.

When faced with leaders who do not share our vision, one can express outrage and leave or try to understand their reactions and improve practices. As the leader of an internal initiative, you need to understand the decision-makers' fears: entrusting a groundbreaking project that disrupts organizational practices carries multiple risks.

If your organization is large and longstanding, it's because it has consistently met a need. If leaders come to believe it needs transformation and nothing has been started, the organization might be facing the innovator's dilemma.

Conceptualized by Clayton M. CHRISTENSEN in 1997, this dilemma describes a scenario where a pioneering company, attempting to maintain its competitive edge, inadvertently misses out on major innovation. A previously unforeseen competitor then offers this innovation and upturns the market share. For instance, in 2023, Microsoft stunned everyone by releasing a ChatGPT integrated into its search engine before Google did. At that time, Google was the pioneer in internet search and was investing billions of euros annually in AI research. How could Google let a competitor get the upper hand?

The answer is simple: the risk for Google - with 84% of search engine market share - of releasing an unfinished product - which might return false information, for example - is much higher than for a startup like OpenAI or for Microsoft's Bing - with 9% search engine market share. This is evident as, at the time of writing this chapter, few online articles questioned the Bing Chat launch compared to Bard, despite having similar issues. In summary: Microsoft had everything to gain, while Google had everything to lose.

That said, Google saw the pitfalls of not taking risks and has been working on a competitor, Bard. To avoid this dilemma, organizations should:

  1. Stay informed by keeping an eye on emerging trends and new customer needs by organizing partner visits, attending trade shows, or consulting experts to stay updated on market developments. For instance, you can ask your experts to draft a quarterly newsletter for management, highlighting current technological trends.
  2. Regularly re-evaluate its strategy. Using this knowledge, seek new growth opportunities by addressing new use cases. Propose new products and employ new technologies. In an institution, insights are also internal: it's essential to connect with different departments to understand their daily challenges and align innovations to address them.
  3. Encourage risk-taking and experimentation. Motivate teams to propose new ideas and pilot projects to explore new technologies. Reward risk-taking.
  4. Invest in innovation, by allocating sufficient resources for research and development. For instance, grant one day of remote work per week to your experts to explore innovative technology. Provide funding for teams to purchase equipment for experimentation or grant access to a cloud hosting provider.

More practically, if you decide to form your own team, members might leave your organization at any time. Given the depth of the discussions they were involved in, they might leave behind work that is challenging to pick up. That's why many organizations prefer to engage a third party, with a clearly defined scope, ensuring the leader gets a result. We will explore in a next chapter how this approach can have long-term negative consequences for the organization.

Organizational changes always entail a cultural shift that must be managed. This cultural gap can sometimes be too challenging to bridge for the entire organization, indicating it might be too early to introduce your plan. Cultivate awareness through presentations and success stories. Leaders must clearly understand the transformation's impact and associated risks: service disruption, HR strategy changes, staff training, or equipment purchases. Support your leaders in visualizing the transformation as you work on building your evidence.

The rest of this book will address understanding the psychology of change to ensure your project's success.

Chronic Reorganizations

"Another one!" your most loyal team members might exclaim. How many reorganizations has your organization undergone? When overdone, they muddle the message and breed confusion for your teams.

In most cases, technical teams already exist within your organization, already serving business needs that necessitate their existence.

Leaders with limited understanding of business and technical stakes often feel the urge to alter the roles of certain teams. They do this in favor of a new project, based on the current skills present within the team. However, a team always forms around a project that shaped its culture, making it so efficient for the company today. Decision-makers should consider this before thinking of breaking this hard-earned culture by imposing a transformation.

The risk of drastically changing a team's roles means you should be well-prepared to support them – often, this isn't the case, as you likely don't have the time. Their current operational methods are the result of several restructurings, which probably already affected their ideals and the reasons they joined your organization in the first place.

Changing a team's roles without considering its culture and history risks losing team members: either they'll be demotivated by your project, or they'll resign. You need to provide them with a clear vision, convince them with solid arguments, but most importantly, involve them.

Given their history in your organization, your teams' knowledge can help you grasp concepts you haven't fully understood yet. Be open to their suggestions and feedback to discern how best to reorganize the team - and only if necessary - based on its aspirations. An excellent way to gain a team's trust and better understand its challenges is to do its job for a few days. This can be done when a decision-maker first joins the organization.

If you believe you don't have the necessary internal resources, don't hesitate to recruit. It's risky to affect the established teams if they're serving a need expressed by your organization. The essence of a transformation is to ensure service continuity while changing its practices.

Be more nuanced than announcing a "major transformation plan." Such practices invariably frustrate many team members, fail to gain the support of all your teams, and risk undermining your credibility. They can also make you a hostage to your predecessor by associating you with past failed transformations.

As discussed in the chapter "Too big, too soon," adopt a step-by-step strategy and gradually develop your intuition about who needs to be reorganized. Gain team buy-in by showcasing the realm of possibilities to inspire them. Then let them convince their peers on your behalf. We will delve deeper into these strategies in the chapter "How to convince and keep the faith".

Refusing technological lag

"It's normal, we'll always be behind here."

If this statement sounds familiar to you, it probably evoked a sense of dismay.

It's understandable for a company to face delays due to its size, resources, and safety requirements. However, organizations must not tolerate such lag. Under no circumstances should the statement "it's normal here" become an acceptable response.

If the speaker is genuine, this mindset merely stems from a lack of knowledge about how to achieve the goal. Otherwise, it might indicate a lack of courage or even intellectual laziness.

If the majority of an organization's employees believe that it is behind, there is a severe issue at hand. Maintaining the status quo in such situations inevitably leads to the organization's decline and an irrevocable loss of credibility among its employees and partners.

In one of his articles, speaker and transformation expert Philippe SILBERZAHN gives the example of a man waiting for his train scheduled at 9:30. The screen reads "On Time," but it's already 9:35 on his watch. The man thinks about photographing the sign but wonders, "what's the point?". Many observers would downplay this five-minute difference, express irritation, or simply blame a display malfunction. "After all, no one can do anything about it," they might conclude. It is with such behavior that Philippe SILBERZAHN argues organizations decline: they grow accustomed to mediocrity.

While initially considered unacceptable, over time, the malfunction becomes increasingly acceptable to the organization, without them realizing the cost in time and money. The effort to rectify the issue becomes less and less justifiable, and silence becomes the default choice to conserve energy. Until an irreversible situation arises; or that a few brave souls shake the structure !

However, it's essential to know when to unveil innovations. Preston DUNLAP, the first Chief Technical Officer (CTO) of the USAF, describes in his public letter Defying Gravity how "bureaucratic forces" can hinder innovation if introduced prematurely.

"Some have asked me what my recipe for success was over the past three years. I haven't spoken much about it because I knew that if I revealed the elements too early, the natural forces of bureaucracy would come back stronger, rejecting at every turn all the potential of innovation." - Preston DUNLAP, Defying Gravity.

To prevent technological lag, organization leaders can adopt several practices:

  • Continuously train staff, including decision-makers ;
  • Maintain an internal innovation capability to stay critical ;
  • Accept controlled risk-taking and promote open communication ;
  • Measure and implement indicators to avoid complacency.

Prerequisites

Designing the best service won't let you be helpful to your organization unless you provide easy access, uninterrupted service, and support. DevOps will enable you to structure and maintain this source of value.

This book doesn't even require your team to be especially large, nor does it require your leaders to already be convinced. However, it does require your team to be convinced that they can drive the project forward. Of course, over time, support from other teams in your organization will become a valuable argument to showcase the success of your initiative.

A leader only asks to be convinced by an initiative from their subordinates. Help them visualize and understand the added value of what you're proposing.

This will require you to regularly present the progress of your project: both so they remember and so they understand. It's always risky to assume a project is understood after the first presentation, especially when introducing a new paradigm.

Plan to set up an internal team : there will always be bugs to fix, configurations to adjust, and features to add. Whether developed internally or by a contractor, you'll face the phenomenon of software erosion. This refers to the issues software may face over time when left unattended. For instance there may be critical security updates, full disk space or processes that stop working...

Don't think that a contractor can solve all your problems: you'll lose money and won't achieve your goals. The result of a contractor will only be the product of your ability to synthesize your challenges. Yet, during a transformation phase, you'll discover new issues every week. Unlike you and your team, the contractor probably won't be continuously present in your organization to capture all stakeholder challenges.

Starting your DevOps initiative requires envisioning the recruitment of several profiles:

  1. A team leader whose engineering skills are recognized and who has excellent communication abilities.
  2. Software engineers who will develop solutions to business or user needs.
  3. SRE/DevOps engineers who will develop your foundation and manage the software development/deployment cycle.

Whether you're a senior manager or a mission officer aiming to enhance the services your organization offers, you will need to justify your initiative to your superiors and the rest of your organization. It's therefore essential to understand how to communicate effectively so everyone buys into your project. Let's explore some strategies for doing this in the next chapter.

How to convince and keep faith

First and foremost, it's not about convincing. You can't just walk up to someone and say, "you're wrong, I'm right." Instead, you need to inspire your audience to align with your vision or project. In this way, they'll be convinced on their own.

Gaining the support of your superiors or colleagues for an initiative isn't always straightforward. William MORGAN - the leader of a renowned tech startup - recommends 4 rules to follow:

  1. Identify who is affected (these are the stakeholders) ;
  2. Determine what the new solution will bring them (these are the benefits) ;
  3. Understand what their concerns are (these are the worries) ;
  4. Alleviate concerns, highlight the benefits, and communicate.

According to William MORGAN, once you reach a certain level of technical engineering, the roles of "salesperson" and "engineer" become indistinguishable. He says that "Advanced engineering work is indistinguishable from sales work."

Here's how these rules could be applied to security and management teams:

  • For security teams, the proposed technology might automatically manage and audit the encryption of flows between services. Their primary concerns could be: "Will this technology make my infrastructure more secure?" or "What new attack vectors could this technology introduce?"
  • For management teams, the proposed technology might speed up the development pace and reduce service interruptions. Their main concern would be understanding the hardware or human resources the company would rely upon after implementing this new technology.

The theory of mental models helps us better understand the decision-making process. Everyone's perception varies by individual. Transformation, then, is about collectively agreeing on an alternative mental model.

Even though DevOps might be backed by studies and is evident in the private sector, institutional initiatives are still not widespread enough. Therefore, you find yourself in a position where you're certain about the direction to take, but you're not fully able to justify it with data or examples. Presented with your forward-thinking transformation proposal, the decision-maker thus faces a risk. And as a matter of survival:

  • "It's better to be wrong with the group than to be right against the group."

To assist the decision-maker in making their choice, you need to work on minimizing this risk. But how? The idea is to rally early adopters to your cause without announcing it to the collective.

  • "The first one to step forward takes a massive risk. The 150th takes none."

Besides enhancing your value proposition, you'll have examples to reference and support: you won't be the "first" taking the risk, and neither will your organization.

Act with finesse

General LAGARDE once said : "Initiative is the most refined form of discipline."

Operating your project without telling anyone in your organization requires understanding the potential repercussions. Even though you may want to make improvements in good faith, you might misjudge the overall situation of your organization. Thus, your project could disrupt established power dynamics, making you undesirable in the eyes of some.

For instance, a team lacking resources comes to you for help. Seeing their distress, you design a brand new tool for them quickly using your DevOps platform. You choose not to inform your superiors, fearing they might reject this innovation.

What you don't realize is that the team you're supporting hasn't been doing the work required by management for several weeks. While the leaders are trying to balance the situation, a sudden player (your team) starts doing favors for the delinquent team.

Upon hearing the news, the leaders find themselves in an awkward position: they appreciate the support you provide but resent you for meddling in their affairs.

And thus, your initiative gets caught in a vicious cycle. On one hand, your team sees no harm in helping and stops reporting to the management team. On the other hand, the leadership gives up on collaborating with you and trusting you.

The problem is primarily cultural: the organization isn't trained to support innovation, making it challenging to innovate. Innovators must then find indirect ways to make a difference. On the flip side, innovators are often not well-versed in the structures where they are asked to innovate. This highlights the need to train these profiles so they better understand how the organization operates. By implementing the 5 pillars of DevOps, you will help your organization transform its culture and promote innovation.

Therefore, make sure you fully grasp the political dynamics between the leadership team and your initial experimenters before acting covertly, or you risk complicating your progress.

Approaches in Facing Opposition

Keep in mind that if things are the way they are today, there are valid reasons for it: you might not necessarily have a comprehensive understanding of these past reasons and it's not your role to blame those involved. These reasons are for instance the time allocated to projects, HR/financial resources or power plays.

Also, be aware that during a transformation, leaders must continue to deliver the same services as before. Decision-makers then have to manage the transforming environment parallel to the current environment, ensuring the former doesn't overshadow the latter.

Furthermore, don't get demotivated by the first person who resists. Every innovation initially faces moral mockery and goes through three phases: ridicule, perceived as dangerous, and then seen as self-evident. Having experienced this firsthand, I can vouch for its accuracy, and there are historical examples:

  • Women's suffrage. Initially deemed ridiculous, then seen as dangerous as some suffragettes lost their lives in the 1910s, and now it's a given in our contemporary societies.
  • Henry FORD had a vision that every American should own an affordable car. Back then, cars were seen as a luxury item for the wealthy: "it's not clear what it's for, but it looks nice." He created the first moving assembly line in 1913, and Ford is still an industry leader today.
  • Elon MUSK believed in creating reusable rocket launchers. Initially mocked or highly doubted by the Russian and American space industries, he's now respected by the latter and seen as a threat by the European space industry.

If you face direct opposition, you may need to rethink your communication strategy by tailoring your speech. Start with understanding opposing viewpoints. If you feel that some are deliberately trying to end discussions, consider the following tactics:

  • First, invoke shared values. Even if you and your counterpart have different beliefs, you might still have common values. Show how your initiative aligns with them. If both of you value innovation, explain how your approach promotes it and the new opportunities it offers. If both are keen on enhancing the day-to-day experience for a certain profession or user, provide use cases on how your solution can assist.
  • Second, put them in the spotlight. Be it a decision-maker or a client, anyone will support your idea if it lets them shine. Identify how your project can help them achieve their goals and make this clear to them. A misplaced ego often arises from a disconnect between the project's stated objectives and the individual's personal goals. If your counterpart seeks to stand out and gain influence in their organization, show them how your project could bolster their reputation as an innovative leader committed to improving the lives of their team.
  • Third, build a coalition. Gather people who share your transformation vision. These individuals often agree with you on the organization's inefficiencies. By creating a supporting community, you show stakeholders that your approach is legitimate and backed by many. Also get official testimonials: a letter or email signed by a recognized leader from an entity you've worked with, vouching for your methods or services. Finally, accept that you might not be a permanent fixture in the organization. If your initiative doesn't find its place there, it's the organization's loss! The same effort could have a different impact elsewhere. And only you set your own boundaries.

Tailoring Your Message

Successful transformation requires impeccable communication from its initiator. It's crucial to know how to present based on the target audience, all the while keeping in mind certain common organizational phenomena.

"Why don't they seem convinced?" you may tell yourself.

Perhaps after one of your presentations, you've found yourself in this situation. Validated by many of your peers and seemingly well-suited after rehearsals, it still didn't achieve the desired impact. The person you addressed didn't ask the right questions or seemed bored, or even irritated.

Presenting to different audiences requires tailoring your presentation style, examples, and arguments to their roles, constraints, and needs. Don't expect anyone to understand the so what of your presentation if you haven't first understood why it was beneficial for them to attend. Typically, two presentations suffice: one for professionals (which are your "clients") and another for senior officials (which are the "policy makers").

However, it's important to differentiate between senior officials (which are the executives) and managers. The latter often have a stronger bond with their teams, making them more receptive to business-related arguments. Senior officials, on the other hand, operate at a strategic level, where they set the organization's vision and major directions. Operational, tactical, and technical considerations are delegated. As such, messages passed up the hierarchy might get distorted or altered.

That's why you shouldn't assume that the leaders are always aware of what you observe at your level. Don't hesitate to remind your audience about the effort required for even the most common tasks. For instance, emphasize that 80% of n individuals' work is dedicated to a certain task. With your approach, you could save x hours per day for each employee, equating to y euros in savings or z times increased productivity.

Decision-makers seek arguments they can use to persuade others. Endeavor to grasp the mandates they must adhere to, providing them communication tools they can reuse. For instance, the CEO of a multinational might prioritize economic profitability, while a high-ranking politician might weigh social impact more heavily. However, both will be keen to align with their organization's strategy.

Just like you, a decision-maker newly introduced to a topic can only retain a few key pieces of information. Ensure you focus on a maximum of 2 or 3 main ideas you want to convey. Conclude your presentation with a call to action, guiding them on how they can support your project's realization.

Lastly, you cannot completely rule out the possibility that your counterpart might have conflicts with other stakeholders in your organization. This could hinder them from making seemingly beneficial decisions, in a bid to maintain their status or protect their career. In such scenarios, try finding equally or higher-ranked influencers to champion your vision among decision-makers. Once multiple top executives back your message, it becomes challenging for anyone to reject what the rest of the organization sees as essential.

More often than not, lackluster communication results from misunderstandings rather than malintent. Unless you're sure, always assume the issue isn't with the person in front of you.

Internal team model

By understanding the techniques to address common resistance to change situations, we can move forward with greater confidence. Let's now explore how to structure our approach and strengthen our arguments for effectively launching our initiative.

In-house development as a real alternative

In the chapter "Refusing technological lag", I discuss internal innovation as a means to prevent an organization's decline. However, it's crucial to clarify how in-house development, beyond being effective, becomes essential if a company wants to remain competitive.

Which company responsible for a major I-T project would claim, "We don't need an I-T expert"? Due to a lack of technical acculturation or previously mentioned psychological phenomena, decision-makers sometimes chronically turn to consulting firms.

Much like global organizations such as the World Health Organization or the United Nations, French national entities like the National Center for Scientific Research, the National Education, and the Public Health agency have an internal scientific council. This ensures they stay updated on the latest scientific knowledge, enabling decision-makers to make informed choices. In the private sector, this role is filled by the Chief Technical Officer and their senior managers (or VPs).

While a scientific council can help an organization remain at the forefront of scientific knowledge, it isn't enough to make it innovative. Especially if its members aren't periodically refreshed. To innovate, practice is key.

If you want to effectively address the challenges facing your organization, only an internal team practicing the technologies related to your topics can help. Thus, boldly setting up your technical team offers numerous benefits. Daily contact with business sectors or clients enables the creation of tailor-made tools, finely tuned to meet their needs effectively.

This close proximity to the requester also facilitates real-time support service, eliminating the additional costs and delays usually associated with external support. This leads to shorter improvement cycles and faster delivery of requests.

Having the project roadmap under their direct control allows decision-makers to ensure developments perfectly match their needs and vision. This in-house management significantly reduces costs by pooling investments for several simultaneous projects.

One of the primary strengths of an internal team lies in data security, with data strictly confined to the organization's infrastructure, accessible only to authorized members. This minimizes the risk of data breaches.

Furthermore, an internal team has a unique ability to quickly and relevantly evaluate technological innovations, placing them in the organization's business challenges context. They are also positioned to promote the assimilation of these new technologies within the organization through presentations suitable for all levels.

Relying solely on an external resource for your I-T projects will inevitably lead to prohibitive costs. Without internal expertise, you're at the mercy of talented sales teams from companies eager to sell you services your organization will never use.

The main reason decision-makers are cautious about in-house developments is maintenance. They're right: paying a service provider can be expensive, but they're contractually bound to deliver. This contract often comes with a maintenance provision. A single internal developer - poorly equipped due to limited support - might fail at the same task, ultimately calling the decision-maker's responsibility into question.

Therefore, hiring two or three engineers won't be enough to sustain your developments. To successfully offer a useful solution, which can be a viable, maintainable alternative and credible to your superiors, you'll need to assemble a much larger team.

By equipping this team with a proper development environment and incorporating best DevOps practices, they'll have time to focus on the quality of your software. While this requires a time investment and might be a challenging step with your superiors, they haven't yet realized how invaluable this advancement will be in the future! Stay the course.

At one of the companies I worked for, the in-house development of software by an engineer saved several million euros. Equivalent industrial programs were stagnating, and the business units remained helpless. It took just one engineer - albeit a brilliant one - to solve a problem that had persisted for over 6 years.

Thanks to DevOps rules demanding software quality standards, over ten developers in the last three years have contributed to this project to maintain and enhance it. It still receives numerous weekly updates today.

Beyond providing a pragmatic solution to a problem, this engineer especially succeeded in acclimating the entire hierarchy to modern development concepts and machine learning techniques. Invited to major strategic meetings alongside traditional external providers, he became the organization's machine learning reference. Without him, no one internally would be able to specify a need or evaluate a machine learning solution with full knowledge.

"Innovative" Teams and Data Science

Many organizations have sought to invigorate their structures by creating "innovation teams." Yet, many have not truly managed to deploy into production what was developed therein.

Use cases often revolve around data and artificial intelligence. Buzzwords such as "data scientists", "deep learning", and "artificial intelligence" have led to numerous false hopes. Many organizations hired data science profiles only to find them unable to deploy their algorithms to interfaces designed for non-expert users.

The problem isn't with the data scientists but rather with decision-makers who, until recently, didn't understand what responding to business needs entails: a reliable development foundation, clean data, massive data, model tracking, and a deployment team. In essence, many thought (and continue to think) that "AI" could solve any problem with just a few lines of code. These individuals are unaware of the infrastructure and technical support required by these technologies.

A typical data science example concerning DevOps is the need for computational power, storage capacity, and services to develop and monitor the training of models. Yet, most data scientists aren't equipped to set up their machine, their GPU drivers, and their Jupyter Notebook environment, especially within the complex environments characteristic of large organizations.

Staying close to business needs

What will set your team apart is the support you provide to your operators. Compared to traditional development teams or external service providers, your advantage is the potential to have close interactions with your organization's business operations.

This is the renowned "agile" methodology, in contrast to the "V-cycle" (another name for the waterfall methodology).

In many organizations, the "V" approach is still employed: the service provider meets the business team with a requirement, produces a PowerPoint presentation a month later, and unveils the development outcome between 6 months and 6 years later. In software, the delivered product is already outdated, and the teams that made the request might have changed by then.

In manufacturing—like designing a warship—it's legitimate to ensure that the vessel will float correctly and that its rudder will steer it properly before launching. The ship's features are often set: its range, missile capacity, service duration, etc. One wouldn't alter the hull composition at the last minute or adjust the shaft line bearing. The "V" cycle is appropriate here.

However, in software, a more agile approach is feasible. Software behavior can be assessed and simulated in near real-time. This flexibility ensures the software can be adapted at any point, ensuring it meets set objectives.

Within an armament program, the onboard computer systems of a ship can follow agile methodology, while the carrier's production can be governed by the "V" methodology. While the hull may undergo few changes, the software can be updated as rapidly as operations require.

Beyond the technical solutions you offer, your business teams will notice that your more agile organizational mode is efficient for them. Consequently, they will support your initiative. As a team leader, your goal should be to have representatives from business teams that you've aided with your tools testify during crucial presentations. Such representations will bolster your credibility and prevent your teams from merely being seen as "technical development providers."

This proximity to business operations will enable your teams to feel more involved in your organization's missions. It's a win-win dynamic for both your engineers and clients. Both parties benefit from each other's expertise: the engineer gains a deeper understanding of the issue, and the operator specifies their need as precisely as possible.

Agile coach Henrik KNIBERG's illustration effectively conveys the essence of the agile methodology: the preference is to deliver a functional product at each stage, gather user feedback, and iterate.

Throughout your career, you've likely noticed: clients often struggle to articulate their exact needs. Agile and ultimately DevOps methodologies allow for adaptation to the ever-evolving business realities, ensuring a deep understanding and delivery of a product truly aligned with their requirements.

By automating tedious processes, DevOps techniques will free up time, allowing you to spend more with your clients, understand their needs better, and effectively address their feedback and suggestions.

Bringing technical profiles and business teams together adds value by promptly and accurately addressing internal challenges. This is also a key to staff retention. Remember: your teams seek purpose. They don't merely come to work to follow orders but to employ their expertise to devise the best technical solution for a business problem. An engineer's work culmination is witnessing the business use the solution they've crafted.

Unleashing Communication and Breaking Down Data Silos

One of the cornerstones of DevOps is to break down silos, including access to data.

If you want your technical teams to best respond to your needs, they require privileged access to your company's data.

When the legal framework allows, forego "anonymized samples". Engineers need a precise understanding of the data they are supposed to process. Trying to develop a tool based on "anonymous" data is akin to developing a tool that only partially addresses a use case.

Otherwise, you can be sure a bug will occur as soon as an "unknown" data passes through the software. Provide your teams with production data intended to be used in the tools: you'll spend less time on bug fixes and improve the quality of service provided by your software.

If you don't have the necessary permissions, perhaps hiring in-house isn't essential. A service provider can just as effectively build the software from open-source data. However, consider the risks of proceeding this way.

Security: a new paradigm with the DevOps approach

The idea that DevOps bridges different professions for collaboration is not easy to implement. Traditional roles in Information System Security found themselves confronted with practices they weren't used to and sometimes didn't have the time to grasp.

In large organizations, company rules or even the law itself require specific versions of a software to be defined for it to be qualified or approved. Imagine having the responsibility to enforce these conditions when DevOps methods involve dozens of software updates daily: it's quite daunting! Therefore, understanding the makeup of a cloud infrastructure to correctly define its "security" is essential.

Security affects all pillars of DevOps. This chapter focuses on a high-level description of security concepts within a DevOps approach.

In this organizational mode, security practices are automated to be systematically verified. The aim is to minimize so-called "documentary" security in favor of programmed rules. Indeed, using standardized technologies (e.g., containers, Kubernetes) facilitates implementing security rules, ensuring they are applied.

Culture of Security

The DORA report "State of DevOps 2022" focuses on security challenges in corporate DevOps transformation initiatives. It states that a company promoting trust and psychological safety is 60% more likely to adopt innovative security practices. This culture reduces the number of burnouts by 40% and increases the likelihood of an employee recommending the company.

Security has always been a matter of culture. However, the DevOps methodology introduces all the techniques that will allow an organization not to overlook good practices, previously neglected or lost in voluminous and cumbersome archives.

The key is to understand that in DevOps mode, we operate on a principle of iterative improvement cycles. Projects are never set in terms of technology used, and deployments are continuous without human interaction. This ensures that innovation remains agile and always addresses the client's needs most accurately.

But it's not a free-for-all: there are technological standards and procedures that control what's deployed, according to the security standards your organization demands.

We'll delve deeper into the cultural aspects of the DevOps methodology in the chapter "Embracing failure".

Qualification, Certification, and Accreditation

Governments are hungry for new innovative technologies. However, they need to strike a balance between the risks they may entail and the benefits they may get. This is why they create frameworks to manage this risk.

France's cybersecurity agency defines three ways to assess risk of using a technology : qualification, certification and accreditation. Most western countries adopt similar processes and signed an agreement making the trade of secure I-T solutions between members easier : the Common Criteria Recognition Arrangement.

As a declarative approach to managing security risks, traditional approval processes are not well-suited for continuous deployment practices. They freeze risk for a specific moment or architecture. Yet, threats emerge daily: a vulnerability in a library, for example, could be detected a day after approval is granted. Even though the approval is temporary and a periodic assessment might be required, the vulnerability might persist during this time, leading to a risk of exploitation.

For Cloud service providers, the United States established the Federal Risk and Authorization Management Program (or FedRAMP). It adds a new layer of security compared to traditional approaches by enforcing a demanding continuous monitoring process.

Assuming security flaws might emerge at any moment must be part of your cybersecurity posture. You must have actionable tools to quickly respond to threats and preserve your ATOs. To address this challenge, it's recommended to adopt continuous integration techniques.

Continuous Integration and Security

Continuous integration allows automatic monitoring of changes made to software or infrastructure.

Whenever even a single line of code is altered, tests are triggered. If a code modification doesn't meet the defined security standards, the contribution is rejected. The developer is automatically notified in their software factory (e.g., GitLab). They can see an error message explaining the issue, enabling them to immediately make the necessary adjustments.

This is where the expertise of security managers is needed. These professionals must explain to DevOps engineers and SREs what specifically needs to be monitored. These rules are then translated into code, forming automated tests, within a continuous integration pipeline used by all company projects.

These versioned rules in the form of code become automated tests. They can be updated as needed, instantly impacting all projects.

They might consist of antivirus checks, vulnerability scans in used Docker images, or ensuring that no passwords are inadvertently left in a public file.

In the illustration provided in the book, you can observe a 5-stage continuous integration chain (build, test-code, test-lint, test-security, and deploy). The column of interest is test-security. It contains various security tests that are initiated. They can either pass (with a green checkmark), fail (with a red cross), or fail with a simple warning (with a yellow exclamation mark).

An exclamation point means the test did not pass but was not deemed critical (e.g., an outdated software dependency with no security flaws).

For engineers, the ultimate goal is to see their project accompanied by a green checkmark, signifying all tests have been successfully passed.

In a DevOps approach, developers don't start from scratch. They begin with a template that they copy, which integrates - in addition to development files - all security rules. Ensure that security teams co-contribute to these templates so every new project incorporates your security standards to save time for everyone.

Continuous integration chains aren't limited to security tests. Consider them as scripts automatically triggered with each code modification. Although the traditional trigger is "code modification", cloud hosts like AWS might offer their triggers (e.g., adding a file to an S3 bucket). We'll delve deeper into the workings of continuous integration in the chapter "Continuous Integration (CI)".

Code Reviews

In an ideal world, all verification is automated. However, it's sometimes challenging to "code" advanced security checks, and you might not have the human resources to develop them.

In DevOps, the GitOps methodology is practiced: everything is based on code (software, infrastructure, architecture diagrams, presentations, etc.).

Each developer works on their own branch and develops their feature. They test if everything works as expected, then creates a "merge request" into the main branch.

Code review takes place at this juncture. It's an opportunity for engineers to approve others' changes, providing an external perspective before it gets merged into the main branch. This is the time when various stakeholders involved in reviewing the quality of a contribution can write their comments.

The goal is to ensure the developer hasn't made significant errors in the code's functionality or is not adding technical debt. For instance, at Google, a merge request requires approval from at least two engineers before it can be validated.

Releasing a new version of software in production is the ideal time for security teams to audit the code. This practice is known as "security review". Every new software release is subject to previously mentioned continuous integration rules with additional automated security tests and optionally the validation from the security team.

For security teams, the code review aims to ensure that the maximum security criteria are met, such as:

  • Presence of activity logs documenting user actions ;
  • Access to authorized data sources ;
  • No data being sent to an unauthorized service ;
  • Password/cookie storage techniques ;
  • GDPR functionality compliance.

GitLab, for example, allows you to mandate the approval of a merge request by specific teams (e.g., the security team) before a contribution can be merged into the main branch.

Tools like ReviewDog, Hound, and Sider Scan assist engineers during code reviews. For instance, these tools run linters and automatically add comments on the relevant line.

Securing your software supply chain

In May 2021, the White House released a decree describing new strategies for "improving the country's cybersecurity". Among the 7 described priorities, enhancing the security of the software supply chain is mentioned. It states there's an "urgent need to implement stricter techniques, allowing for quicker anticipation, to ensure software purchased by governments operate securely and as intended". This commitment was renewed in January 2022 with Joe BIDEN's signing of the U.S. National Security Memorandum.

DevSecOps Maturity Models like those from OWASP, DataDog, AWS, or GitLab offer general techniques to enhance DevSecOps practices. They help in breaking down an organization's maturity progression into more accessible steps, aiming to achieve better security practices.

First, we'll explore the techniques and tools used to secure the software supply chain. Then, we'll see how they're integrated through frameworks. The vast majority of tools mentioned in this chapter are run within continuous integration chains, serving to validate an organization's entire security rules with every code change.

Techniques and Tools

Software Component Analysis (SCA)

Information Security practices within large organizations often require that any deployed software be accredited. The accreditation document must list the dependencies used in the software: the third-party libraries it relies on. This list is called the Software Bill of Materials (or SBOM).

The SBOM allows for quick answers to questions like "Are we affected?" or "Where is this library used in our software?", when a new vulnerability is discovered. In a DevOps approach, the libraries used in software change over time. A library or technology used today might be replaced tomorrow. Hence, developers cannot be asked to manually list these hundreds (or even thousands) of dependencies used in their software.

SBOM is part of the techniques of Software Component Analysis or "Analysis of software components". SCA encompasses techniques and tools to determine the components of third-party software of a software (e.g., the dependencies, their code, and their licenses) to ensure they do not introduce security risks or bugs.

The advantage of the DevOps methodology is that all code is centralized within the software factory. This allows us to use tools to analyze the composition of each project and prevent security vulnerabilities.

It's possible to generate the SBOM of software using tools like Syft, Tern, or CycloneDX. The standard format of an SBOM file is SPDX, but some tools like CycloneDX have their own. The common practice is to store this file in an artifact signed by your software forge with each new version of the software you wish to deploy.

The goal remains to determine if a used library is vulnerable, to update or replace it. Apart from meeting regulatory constraints, just leaving this file as a simple document isn't very useful. That's why it's now necessary to analyze the SBOM.

A lightweight analysis tool like OSV-Scanner can be easily integrated into your continuous integration pipelines and provide a first level of protection. However, it won't provide an overview of all the affected software within your infrastructure. Tools like Dependency Track, Faraday, or Snyk Open Source are then required. They can ingest multiple SBOM files and display an overview of threats to alert engineers if necessary.

Softwares like Renovate or GitHub Dependabot allow detecting dependencies with vulnerabilities and automatically propose an update in the software forge by opening a merge request.

In summary: Instead of just listing dependencies, the aim is to set up continuous detection of used libraries for all projects. It's essential to alert about threats as early as possible and refuse contributions that could bring risks before they are deployed in production.

Static Application Security Testing (SAST)

While SCA tools allow you to analyze the composition of your project, SAST tools aim to analyze the software code you develop. However, SAST tools also cover SCA features. Both fall under the domain of Source code analysis.

Static Application Security Testing, focuses on techniques and tools intended to find vulnerabilities in your source code before it's run. They represent a form of white box testing. For instance, SAST tools will identify insecure configurations, SQL injection risks, memory leaks, path traversal risks, and race conditions.

A comprehensive list of open-source and commercial code analysis tools is available on the OWASP foundation website.

While SAST significantly improves software supply chain security, it doesn't replace other security practices. Indeed, static analyses can produce false positives or miss vulnerabilities that only manifest during software execution. Therefore, complementing SAST with techniques like DAST (Dynamic Application Security Testing) or IAST (Interactive Application Security Testing) is recommended. We will cover these in the following chapters.

In summary: SAST is a so-called "proactive" security approach, allowing for the identification and rectification of vulnerabilities before they can be exploited. Integrated within the development process, it reduces security risks and ensures better code quality. The aim is to keep a keen eye on the security of the source code throughout its lifecycle, avoiding errors that could be exploited in production by malicious actors.

Dynamic Application Security Testing (DAST)

Dynamic Application Security Testing (DAST), is an analysis technique that focuses on detecting vulnerabilities in a running application.

Essentially, it's an automated black box intrusion test that identifies potential vulnerabilities attackers might exploit once the software is in production. These vulnerabilities can be SQL injections, Cross-Site Scripting attacks, or issues with authentication mechanisms.

One advantage of DAST is that it doesn't require access to the application's source code. When used in conjunction with SAST, it provides more comprehensive security coverage. Indeed, DAST can detect vulnerabilities that might go unnoticed in a static analysis and vice versa.

Numerous products with overlapping features exist. They generally allow for automated vulnerability scanning that includes: fuzzing (random inputs), traffic analysis between a browser and API, brute force attacks, and vulnerability analysis in JavaScript code. The go-to DAST tool is OWASP ZAP, but others include Burp Suite, W3af, SQLMap, Arachni, Nikto, and Nessus.

An extensive list of open-source and commercial code analysis tools is available on the OWASP foundation website.

However, DAST isn't a magic solution: tests can sometimes produce false positives or negatives, it cannot detect vulnerabilities or poor practices at the source code level, and advanced knowledge might be needed to configure the tests. Therefore, DAST tools should be used in conjunction with other security techniques, such as SAST and IAST.

In summary: DAST encompasses tools that analyze applications in real-time to detect potential vulnerabilities. It complements static analysis (SAST). By integrating DAST into one's software pipeline, it's possible to ensure the security of applications throughout the software lifecycle: in development and in production.

Interactive Application Security Testing (IAST)

Interactive Application Security Testing (or IAST), encompasses tools that identify and diagnose security issues in applications, whether they're running or during the development phase.

According to OWASP, IAST tools are mainly designed for analyzing web applications and web APIs. However, some IAST products can also analyze non-web software.

IAST tools have access to the application's entire codebase - just like SAST tools - but can also monitor the application's behavior during execution - like DAST tools. This gives them a more comprehensive view of the application and its environment, allowing them to identify vulnerabilities that might be missed by SAST and DAST.

So should you claim "IAST tools are fantastic! Can I just throw away my SAST and DAST tools?"

Of course not. Each has its pros and cons:

  • SAST tools are generally easier to set up than DAST and IAST. They're smaller, faster programs that are simpler to integrate into the development cycle. They quickly improve the security level of your software pipeline ;
  • DAST tools operate in a black box mode, allowing them to analyze applications without source code access. They can also be run intermittently, without the integration cost that IAST tools require. Moreover, your organization's security policies might prohibit tool access to software source code. DAST still allows you to evaluate the security of third-party software in such cases ;
  • IAST tools connect to both the source code and the running application. They can combine SAST and DAST analyses but might be slower. Running an IAST tool isn't trivial; it impacts application performance in production. Some prefer these tests in an isolated environment. However, the tested software might not represent the version available to attackers, potentially missing some vulnerabilities.

Whether DAST or IAST, tools typically require a solid understanding of the application to perform and interpret tests effectively. This often relies on engineers with deep expertise in the software being tested and, more broadly, solid security knowledge. Lastly, open-source solutions are rare in this domain, inevitably incurring costs. Both tool types are valuable but demand investment in time, human resources, and money.

In a mature DevSecOps infrastructure, SAST, DAST, and IAST approaches combine. All-in-one software dedicated to securing the development cycle and integrated within software repositories exists. Examples include Snyk, Acunetix, Checkmarx, Invicti, and Veracode.

Security Frameworks

Today, standards describe how one can properly secure their software pipeline. These are grouped under what's known as security frameworks.

Each of the frameworks presented in this chapter (SLSA, SSCSP, SSDF) contains a list of recommendations on security techniques to implement in one's software pipeline. They advocate for the use of SCA, SAST, IAST, and DAST techniques.

Supply-chain Levels for Software Artifacts (SLSA)

The Supply-chain Levels for Software Artifacts framework (SLSA, pronounced "salsa") focuses on the integrity of data and artifacts throughout the software development and deployment cycle.

SLSA originated from Google's internal practices. The company developed techniques to ensure that employees, acting individually, cannot directly or indirectly access or manipulate user data in any way without appropriate authorization and justification.

In software development, you use and produce artifacts. These can represent a development library used in your code, a machine learning binary, or the product of compiling your software. SLSA operates on the principle that each stage of software creation involves a different vulnerability and that these artifacts are a prime vector of threats.

Its rules revolve around the automatic verification of the integrity of the handled data. Some examples of vulnerabilities addressed by SLSA include:

  • ensuring that the source code used in software compiling scripts has not been altered ;
  • verifying the origin of development dependencies ;
  • ensuring that the software factory has minimal network connectivity.

Based on a team's technical maturity, it's possible to apply SLSA rules across four levels of security and complexity. The idea is to progressively enhance the security of one's software chain over time.

SLSA consists of two parts:

  1. requirements: a set of security rules, varying in complexity depending on the desired SLSA level (from 1 to 4) that an organization aims to achieve
  2. threats and mitigations: detailing threat scenarios, known public examples, and ways to mitigate them

The FRSCA project is a pragmatic example of a software factory implementing SLSA prerequisites. Integrations within GitHub's continuous integration chains, like the "SLSA Build Provenance Action", are also available.

SLSA documentation is regularly updated by the community and available on its official website.

Software Supply Chain Security Paper (SSCSP)

The Software Supply Chain Security Paper specifications from the renowned Cloud Native Computing Foundation complement the SLSA. Historically, they cover a broader range of topics, but many recommendations overlap today.

Although SLSA offers more interactive, well-illustrated documentation (with examples of tools to use or threats for each rule) and is almost gamified with its "security level badges", SSCSP appears - at the time of this book's writing - to provide a more high-level view of threats within a software chain.

Author's note: For beginners, I recommend starting your software factory security project with SSCSP, then advancing with SLSA.

This document is also collaborative and broadly belongs to the standards adopted by the CNCF's Technical Advisory Group (or TAG). The TAG writes various reference documents aimed at enhancing the security of the cloud ecosystem.

Secure Software Development Framework (SSDF)

The Secure Software Development Framework (or SSDF) is a document drafted by the National Institute of Standards and Technology (NIST) of the US Department of Commerce for all software publishers and buyers, regardless of their affiliation with a government entity.

NIST deserves recognition for the variety and quality of their reports on cutting-edge technologies and techniques. Their works often result from collaboration with numerous institutions and private companies, such as Google, AWS, IBM, Microsoft, the Naval Sea Systems Command, and the Software Engineering Institute.

More comprehensive than the previous two, SSDF acts as a directory consolidating recommendations from dozens of other frameworks. It categorizes them into four major themes: preparing the organization, protecting software, producing securely developed software, and addressing vulnerabilities.

The framework lists general concepts progressively associated with more concrete rules. Each theme encompasses broad practices to follow, which in turn include tasks with examples linked to the relevant frameworks.

For example, under the "protect software" theme, the "protect all forms of code from unauthorized access and tampering" practice suggests using "commit signing", referencing the SSCSP in its "Secure Source Code" chapter.

This document can be found on the NIST website. The online library of the Chief Information Officer from the US Department of Defense is also an excellent source of inspiration.

GitHub's example

GitHub is the most popular code-sharing platform on the Internet. It hosts over 100 million projects with more than 40 million developers contributing to it. As a pillar in the open-source field, it offers security tools natively integrated into its platform.

GitHub's aim is to ensure that securing one's code requires just a few clicks to activate the appropriate tools.

The company made a strategic move by acquiring Semmle in 2019, a tool for analyzing vulnerabilities in code. Since then, it offers several means to secure its codebase:

  • SCA and SAST are automated vulnerability analysis tools for the source code and its dependencies (e.g., SQL injections, XSS flaws, configuration errors, and other common vulnerabilities). GitHub also has a marketplace that allows adding code analyzers from third parties. You can add your custom rules by writing CodeQL files. You can deploy these tools on your infrastructure, for instance with GitHub Code Scanning, Klocwork, or Checkov.
  • The Secret Analyzer tool analyzes, detects, and alerts on potential passwords or tokens inadvertently left in the source code.
  • Dependabot is a dynamic analysis tool for risks associated with used dependencies. Dependabot automatically opens a code modification proposal (pull-request) on the project and suggests updating the dependency or an alternative.

All security flaws related to a project are centralized in an overview, allowing threats to be easily detected and addressed.

GitHub relies on the international Common Vulnerabilities and Exposures (CVEs) repository to recognize vulnerabilities. CVEs are a list of identified vulnerabilities in I-T systems described in a specific format. You can add additional verification mechanisms using GitHub Actions, GitHub's continuous integration mechanism.

Pre-Approved Resources

To mitigate risks, it is possible to base the software developed on pre-approved resources made available to developers. Every external component of the software is checked. This might include Python packages, NPM, Go, or even Docker images that have been analyzed, ensuring no vulnerabilities are present.

This is exemplified by the Iron Bank service set up by the U.S. Department of Defense within Platform One. Docker images must undergo a rigorous validation process before approval. These steps combine manual checks with automated ones, but initially, only automated procedures may be employed. Manual actions are necessary to justify adding a new image. This is what Platform One teams call "continuous accreditation of approved images".

In organizations dealing with highly sensitive data (i.e., data that can jeopardize a country's security or credibility if disclosed), the default policy is to authorize only the use of pre-approved libraries and images (which are called "hardened images"). However, consider the impact of such a choice on development velocity. Ensure your security and SRE teams can keep up with provisioning libraries.

Since it's nearly impossible to manually analyze each development library to ensure it's flawless, software factories can rely on file signatures. Trusted editors sign each of their libraries, so continuous integration pipelines or system administrators can verify it hasn't been altered during transfer. Each trusted editor issues a certificate that the SRE team can integrate into its continuous integration pipelines to ensure downloaded packages haven't been tampered with.

A simpler method is to use only the hash key of files. Each file is identified by a character string called a hash, which the computer can easily compute.

During installation, if the downloaded dependency has a different hash from the one provided by the publisher, the software launch is denied. This mechanism is mostly already implemented by programming language package managers (e.g., package-lock.json for NPM or poetry.lock for Python).

Managing Infrastructure through Code

Humans are the primary vector for security risks. To minimize errors or deliberate system compromises, modern infrastructures are deployed as "code".

This means that for everyday infrastructure operation, every administrative action is coded, published, and verified in the software factory before deployment. This allows for standardizing, documenting, replaying, and optimizing administrative actions over time.

The field encompassing production management techniques through code is commonly called Infrastructure as Code (or IaC). We will later detail this domain.

The above example is simple, but IaC can describe how machines can be instantiated and configured. An IaC setup can fully configure a machine from scratch. Once again, the idea is to minimize human intervention to avoid mistakes.

Fundamentals of Zero Trust Network Architecture

The zero trust concept can be summarized in one phrase: "Never trust, always verify." This practice has become imperative, with 55% of companies reporting having implemented a zero trust initiative in 2022, up from 24% in 2021.

Traditionally, network security was based on defining a "trusted perimeter" drawn around an organization's software and data. Various tools and technologies were then implemented to protect them. This network architecture, also known as "castle-and-moat" or "perimeter-based," assumed that any activity inside the perimeter was trustworthy and by reciprocity, any activity outside it was not (e.g., network access via a VPN or based on a machine's MAC address).

Zero trust assumes that no user is "trusted" by default, whether inside or outside the perimeter. Users must be authenticated and authorized to access data and software. Their activity should be monitored and recorded. This approach is more effective at defending information systems against sophisticated attacks, as it doesn't assume that all activities inside the perimeter are trustworthy. This network security model has especially evolved due to the massive shift to remote work.

Consider an example: Sophie is a colleague you've known for 3 years. She badges in every day and settles at her workstation. Days later, you learn Sophie was terminated a month ago. She might have accessed strategic company information, which she could use at her new job in a competitor company. Merely being "used to" seeing an employee allowed the company's precious information to be stolen. With zero trust technologies providing centralized access management, Sophie couldn't have logged in.

Three pillars constitute a zero trust network architecture :

  1. Identify the user through identification (Who are you?), authentication (are you who you claim to be?) and authorization (are you allowed to access this resource?).
  2. Context: How the user tries to access the resource with Least Privilege Principle or need-to-know. We grant resource access only as necessary (e.g., hide inaccessible apps/data sources or setting access expiration dates).
  3. Security: The hardware through which the user connects. We ensure the connecting machine meets security requirements (e.g., an active antivirus or an updated OS).

In zero trust, each request involves a fresh security check. The trust broker or CASB verifies these criteria (e.g., OpenID, Active Directory, PKI, SAML).

CASBs are integral to technologies known as "Zero Trust Network Access" (ZTNA) in implementing zero trust architecture. Cloudflare, Cato, Fortinet, and Palo Alto are examples of ZTNA technologies. Think of them as advanced proxy servers that continuously check multiple security criteria set by your organization. If you're looking to adopt zero trust, refer to the SASE framework.

Because of the many tools involved, setting up a zero trust model is less straightforward than perimeter-based security but overcomes its limitations.

Beyond the pressing need to bolster resource access security, zero trust architecture offers peace of mind from a secured infrastructure. It simplifies device and network equipment administration, reduces costs, and standardizes identity and user rights management interfaces.

Technological innovation demands swift adaptation. Zero trust enables organizations to quickly and securely adapt to environmental changes without revisiting their security stance.

Reference documents such as Google's Beyondcorp research paper, NIST's publications, or the US Department of Defense's papers provide specifications for deploying state-of-the-art zero trust networks.

Zero Trust Based Development

In the context of a Research & Development environment, the topic becomes more intricate. To remain innovative, your teams require flexibility. They make use of cutting-edge libraries, install the latest GPU drivers for machine learning experiments, and even test the performance of their software, fully utilizing the resources of their machines.

In essence, your teams need complete access to their machine's configuration for effective development.

However, as mentioned earlier, the third rule of a zero trust architecture is to ensure the user's machine is secure. If you grant a developer administrative rights, they might be tempted to disable their machine's security settings. So, what's the solution?

Development workstations are a unique component of our zero trust infrastructure. They entail integrating external resources into the company's infrastructure. At the same time, the software factory's source code or the company's data is copied onto these machines. With libraries downloaded carelessly or code editors with unchecked extensions, there's an added risk of data leakage outside.

We are faced with a dilemma here. We can choose to grant our developers full permissions, but risk them disabling our security measures. Alternatively, we can limit these permissions, potentially slowing down their development velocity and ability to innovate, while also needing to invest more time in training them for an unconventional work environment.

Several factors should be considered:

  • Is the physical security of your installations guaranteed?
  • Have your staff undergone a security clearance?
  • Is your infrastructure connected to the Internet?
  • Does your infrastructure have high bandwidth?
  • Is your infrastructure prone to frequent disconnections?
  • Is the data being handled massive in volume?
  • Could the data harm the organization if disclosed?
  • Can you provide machines for your employees?
  • Do you have teams capable of administering these machines?

There are several ways to address development environment challenges. Here are 6, categorized by user flexibility, implementation complexity, and associated risk:

  1. BYOD: Bring Your Own Device. The user uses their computer and means for development. You have no control over the machine.
  2. Semi-controlled machines. A specific user has administrative rights on their machine, while others don't.
  3. Fully controlled machines with ephemeral cloud development environments like CodeSpace, Coder, or Eclipse Che.
  4. Fully controlled machines with remote development VMs (e.g., Shadow, Azure VM).
  5. Fully controlled machines with local development VM.
  6. Fully controlled and equipped machines.

The more you want to reduce risk while increasing flexibility, the more your infrastructure teams will have work or incur costs (e.g., outsourcing). Consider the factors inherent to your organization and its operating mode to choose the solution that best fits it.

Protecting Your Secrets

Administrators of an infrastructure regularly handle "secrets": passwords or tokens. It's common to exchange them among administrators. In other cases, we might need to share an account password with the relevant person. Password managers are a great way to centralize and share these resources.

You can manage your passwords in them and share them granularly with other users. Each user has their own account to access the secrets they have the right to see. It is recommended to use them as much as possible.

Working on a network allows you to use these tools. Here are some collaborative password management services: Vaultwarden, Bitwarden, Lastpass.

A foundation for your resilience

The foundation of an I-T infrastructure comprises the set of technologies that allow software to be deployed on it. Typically associated with it are core services essential for the proper functioning of the infrastructure: a PKI, a centralized authentication server (e.g., LDAP), an NTP time server, or an Active Directory.

For engineers responsible for deploying software, the foundation provides common services to avoid deploying them with each piece of software. With these services centralized, administrative tasks are simplified. In a Cloud foundation, this concept is extended beyond a traditional foundation. This is made possible by the use of standardized technologies that simplify interactions with deployed software (e.g., Docker containers, Kubernetes).

Let's compare a traditional foundation with a Cloud foundation to better understand the latter's added value.

In a traditional foundation a virtual machine (VM) is assigned to each software to logically isolate it. Each software manages its own logs, certificates, secrets, and generates its own metrics. The foundation might host services centralizing this data, but the software developer would then have to modify their code to comply with the foundation's services.

The robustness of this type of foundation is well-established and is still widely used today among major institutions. The isolation is highly effective.

However, the maintenance needs for this kind of foundation increase proportionally with the number of deployed software. Each software comes with its installation instructions, supplemented by foundation compliance documentation. Installation and configuration are often manual. As organizations tend to install more and more services over time, to continue meeting business needs.

In summary, here the deployed software is forced to adapt to the foundation. This generates technical debt. Moreover, centralized foundation services, like those for logs or metrics management, might not always exist.

This type of foundation is effective with a reasonable number of deployed services, but it doesn't scale easily without proportionally sized HR.

In a Cloud foundation, the interaction between deployed software and the foundation is inherently stronger. Standardized container interfaces allow foundation services of an orchestrator (e.g., Kubernetes) to "connect" to it, while still maintaining logical resource isolation.

For instance, application logs or performance metrics can be automatically retrieved and stored in a centralized tool, then set up with alerts. An antivirus that continuously checks for threats in a container can be installed. Kubernetes' sidecars mechanism makes these capabilities possible.

Data flows between containers can be encrypted by default. Secrets (which are passwords or tokens) can be supplied by the foundation without an administrator seeing them. Persistent data are managed uniformly, and backups can be automated.

The benefit of this kind of foundation is that it integrates all these services automatically, without ever touching the application code, nor even requiring the integrator to know about your infrastructure. Thus, you are guaranteed that all deployed software conforms to your monitoring and security requirements. It's the foundation that adapts to the deployed software.

Installation mechanisms standardized by Kubernetes (e.g., Kubernetes manifests, Helm) require just a few commands for software deployment. Kubernetes will automatically instantiate new containers or nodes if user load is too high. We will discuss the technical aspects of these technologies in the chapter "Extensions to simplify infrastructure".

If your organization consists of staff already trained in ESXi technologies, or if your organization's SSI rules are not ready for a Cloud foundation, it's still possible to set up a Kubernetes cluster on your traditional ESXi infrastructure. This could be considered in a transformation plan, at the cost of temporarily increased technical debt while your historic teams get trained in Cloud technologies.

In terms of security, containerized technology interfaces are standardized. It's no longer about checking the container's contents since the infrastructure takes care of it. It's about ensuring the security of containerization technology (e.g., Docker, CRI-O), as well as orchestration technologies (e.g., Kubernetes, Rancher, OpenShift).

For example, have you ensured that Microsoft Word is secure through certification? However, every Word file doesn't need to be certified separately. It's the same for a containerized application: whether coded in Python, Go, PHP, or embedding the latest libraries, it's the container running it that needs certification.

In conclusion, treat your foundation as a product serving your engineers. The more you centralize and automate the use of this foundation's services, the less technical debt you'll have to maintain. Ultimately, this effort results in better service availability for your customers.

Abandoning VMs?

With microservices at the heart of Cloud DevOps infrastructures, containers seem like the ultimate deployment solution. Virtual machines become redundant, as orchestrators (e.g., Kubernetes) can be installed directly onto the machine. This is referred to as a "baremetal" installation. However, during a transformation, it's imprudent to outright discard VMs.

There are rare cases where you can overnight shift from your existing production infrastructure to a cloud setup. If your teams are accustomed to managing VMs, they need time to familiarize themselves with these new technologies. Similarly, applications require time to migrate to a compatible format.

To progress, set a goal to reduce VM usage. For example: "In 1 year, at least 80% of our software should run in containers" Or: "Any new software must be containerized for deployment". This involves setting up the prerequisites mentioned in the "Prerequisites" chapter, as well as implementing DevOps tools: a software factory, container image registries, and a containerized deployment environment.

Here are several complementary situations where VMs remain valuable:

  • Legacy or critical software of your company cannot be deployed as containers ;
  • The industry partners you work with aren't yet using containers ;
  • Your I-T security rules force you to use a specific operating system. To save time on making your installation compatible with this OS, you might install virtualization software (e.g., KVM) to use the OS of your choice and kickstart your DevOps infrastructure setup in a familiar environment ;
  • Infrastructure installation scripts are intended to be shared with various entities. In some cases, these entities may have strict security rules demanding the use of VMs ;
  • If you don't have a dedicated machine to deploy your infrastructure, VMs can be helpful to separate new software from existing installations. Similarly, if you have limited resources, VMs can isolate the workload brought by your DevOps software chain ;
  • As long as your teams aren't ready for a 100% cloud shift, the backup/restore process for a VM might be simpler to handle.

However, remember that maintaining this abstraction layer (VMs) to solely manage Kubernetes on top adds complexity to your infrastructure. As practices evolve in your organization, consider removing these layers. But maintain the flexibility to instantiate them when necessary.

Within a Cloud DevOps infrastructure, tools like KubeVirt or Virtlet can be used to spin up VMs within a Kubernetes cluster. This can facilitate a smooth migration of your legacy applications while getting your teams hands-on with cloud technologies. More visual tools like OpenStack can also assist the transition to this ecosystem, more seamlessly than the traditional command lines in a terminal.

Open-source: Risks and Strategic Advantages

Open-source technologies represent 77% of the libraries used in proprietary (or "closed-source") software. Among the top 100,000 websites, Linux - an open-source operating system - is used in nearly 50% of cases.

A European Union report states that in 2018, contributions from Europeans to GitHub - the world's largest open-source contribution platform - equated to 16,000 full-time positions. That's close to a billion euros for companies in Europe. These contributions offer a cost/benefit ratio of 1 to 4, allowing businesses to remain cutting-edge, develop quality code, and reduce maintenance efforts.

For instance, software like the Firefox browser, the Python programming language, or the Android operating system wouldn't exist without open-source. Even the proprietary software icon, Microsoft, began its open-source contributions to the Linux kernel in 2009. In 2014, its new CEO, Satya Nadella, proclaimed, "Microsoft loves Linux". Despite criticism, the company even acquired GitHub in 2018 and seems to continue delivering satisfaction to the community. They continue contributing to numerous open-source projects listed on opensource.microsoft.com.

However, where the use of open-source in the private sector is a no-brainer, technical teams in large organizations sometimes face skepticism from wary project managers. These teams are challenged regarding their use of open-source technologies based on security concerns.

Such skepticism isn't without merit. The idea of importing a third-party library into one's I-T system without examining its contents can seem risky. Potential risks include:

  • A library that arbitrarily deletes data ;
  • A library transmitting data to a remote server (such as software data or telemetry) ;
  • An updated library that no longer works (due to bugs or deliberate sabotage with protestwares) ;
  • Legally, the use of open-source tech might be governed by license terms (e.g., prohibiting selling software developed using the library).

So, it's about striking a balance between the productivity provided by open-source libraries/software and the trust we place in them.

Yet, it's a mistake to believe that simply buying software will ensure its security. Although responsibility is outsourced, the damage, if it occurs, is done. Google's engineering heads predict that by 2025, 80% of businesses will use open-source technologies maintained by salaried individuals (e.g., GitHub Sponsors).

Historically, the official policy for approving certain libraries went through a certification cycle, aimed at mapping the risks associated with using a technology to decide whether to accept it. This decision could be supported by a code audit.

For proper protection, maintain an active and systematic watch for security threats introduced into the code. In DevOps mode, your software factory is equipped with tools to detect dependencies or malicious code. You minimize risks by securing your software chain.

For example, if you can't set up a secure software forge yourself, you can use GitHub features. More broadly, security practices at GitLab are a great starting point.

Joining a bug bounty platform is common among large enterprises, both to analyze their websites or the open-source software they use. A bug bounty system rewards individuals for identifying vulnerabilities, aiming to detect and fix vulnerabilities before they're exploited by malicious hackers. Popular platforms in this area include HackerOne, BugCrowd or Open Bug Bounty.

In a mature organization, you could even open an Open Source Program Office (OSPO), responsible for defining and implementing strategies around the use of and securing open-source technologies employed in your organization.

Finally, major tech companies often release new software as open-source. These quickly become standards used by tens of thousands of developers worldwide. This facilitates the onboarding of engineers to their technologies without incurring training costs. These companies thus find themselves with candidates already proficient in their technologies.

Far from benefiting only these companies, this practice benefits the entire sector, which now has a pool of candidates familiar with the same tools and practices.

Assessing security and training

To excel in system resilience, as in any field, training is necessary. That's why one of the recommended practices in Site Reliability Engineering is to train to handle incidents. The objectives are as follows:

  1. Assess the quality of incident response ;
  2. Evaluate the resilience of the infrastructure ;
  3. Train engineers to better understand their infrastructure and the tools at their disposal to respond to incidents.

To ensure that its teams are well organized in case of an incident, Google has designed two types of training. The goal is to reduce the Mean Time To Mitigation (which is the average time to resolve an incident), which would impact the company's service contracts.

  1. Disaster Recovery Testing is an exercise in which a group of engineers plans and causes an actual failure over a defined period to test the effectiveness of its incident response. It is recommended to perform these trainings at least once a year on your critical services ;
  2. The Wheel of Misfortune is a fictional scenario drawn at random, in the form of a role-playing game similar to Dungeons and Dragons, where a team of engineers faces an operational emergency. They interact with a "game master" who invents consequences for the actions that the engineers announce they will take. Engineers take this opportunity to review their incident investigation procedures. This practice is particularly useful for newcomers but requires that the game master be particularly experienced.

Amazon Web Services offers a similar approach named Game days to Google's Wheel of Misfortune. The company lists its critical services and the threats that can be associated with them (e.g., data loss, overload, unavailability) to determine a "disaster" scenario. Subsequently, the idea is to provision an infrastructure identical to the production and cause the desired failure. It then observes how its teams and production tools react to the incident.

These trainings are more commonly called fire drills. Again, their goal is to practice incident response in an urgent situation. During these trainings, someone should note any incomplete or missing elements in existing procedures or tools, with a view to improving them.

Netflix goes even further with its tool Chaos Monkey which automatically, randomly, and at any time stops production services. The goal is to ensure that customers continue to have access to Netflix, even with one or more internal services down. Their tool Chaos Gorilla even goes as far as simulating the shutdown of a complete AWS region (e.g., a datacenter that would be taken out of service) to observe its consequences on the platform's availability. These practices are part of what is called chaos engineering.

Finally, apart from audit software that can detect some vulnerabilities (e.g., Lynis or Kube Hunter), there are other exercises for your security and SRE teams to practice.

The most popular way to assess the security of your infrastructure is the "blue team / red team" exercise. Inspired by military training, it consists of a face-off between a team of cybersecurity experts trying to compromise an information system (represented by the red team), and the incident response teams (represented by SREs - the blue team) who will identify, evaluate, and neutralize the threats. The idea is to avoid relying on the theoretical capabilities of your security systems, but to confront them with concrete threats to assess their usefulness and weaknesses. Variants exist with a purple team, a white team, or even a gold team. But start by setting up a simple scenario. For example: one of your developers who introduces a Docker image or tainted code.

This topic is vast and practices vary depending on the size of the organization you are employed in. This chapter gives you some references to start your training practices. Structure them subsequently according to your objectives and resources.

The Pillars of DevOps in Practice

Here we reach the heart of the matter. In this chapter, we will discover the different pillars of DevOps, describing the various practices and technologies that can meet our challenges.

In terms of organization, see DevOps as a way to apply "healthy pressure" on your teams, encouraging everyone to move in the same direction. It is about optimally communicating everyone through standardized technical tools.

Breaking Down Organizational Silos

The desire to eliminate silos within an organization is a common mistake. Let's see it differently: a silo is a concentration of knowledge; it's expertise.

It's fortunate that your organization has silos. For instance, they might consist of experts in organic chemistry, experts on the political science of a specific world region, or masters of a particular technology.

The creation of a silo is often essential to cater to an expertise requirement. This silo becomes necessary because this specialized expertise requires a structure tailored to its tasks. The key is to have tools and practices to let this silo communicate with the rest of your organization.

Undesirable silos emerge when the company doesn't provide teams with the tools they need to work effectively. Individuals then take initiatives to find more efficient alternatives. This is a predictable "immune" response when employees face deteriorating work conditions.

For instance, your expertise center is aging and doesn't renew its tools. Faced with an ever-increasing workload and the company's inaction, long-standing employees become frustrated. Some become disheartened at the thought of discussing issues with unresponsive managers. Others try introducing new practices but face outright refusals. Then, new employees join and find that their working conditions are below expectations. Knowing of an extremely efficient software, one new employee introduces it. Its compelling efficiency makes it popular and spreads throughout the center and then the company. Of course, the employee won't discuss this with the management, risking criticism and potentially having this new tool banned.

Due to the management's failure to anticipate decline or heed internal feedback, they initiate a transformation project. Concurrently, the isolated initiative, undertaken without informing management, leads to scope conflicts and unclear objectives. Management becomes increasingly disconnected from their teams, unaware that they've adopted new practices. The lack of communication with other employees and duplicated efforts become noticeable. This is the result of a lack of overall coherence—a fertile ground for resistance to change in the face of the leadership's belated reaction.

To prevent the decline of a silo that would spread company-wide, there needs to be effective communication among these silos. Silos can include the Executive Committee, expertise centers or teams. Management must provide tools to ensure the entire company speaks the same language. They should also instill a shared vision, allowing teams to collaborate towards a unified goal.

The DevOps movement believes in using common methodologies and tools to facilitate these exchanges. This chapter outlines the methodologies to adopt to achieve this objective.

Mapping the existing

To achieve a successful transformation, one must have a comprehensive view of the starting environment. Mapping it is a vital step that lets you understand the reality you're operating within and gauge the investments required.

Your environment's map should answer the following questions:

  • What are the company's mission(s)?: This might seem basic, but not all organizations clearly define this goal. Ensure you understand the company's objectives, i.e., the problems it addresses. Clearly grasp its business model to better formulate your transformation plan.
  • Is there an existing strategy?: What directions were given during the last transformation, and what can you learn from them? You might need to adjust your plan according to an already-implemented strategy or start afresh or in isolation.
  • Which teams are working on which mission?: List the existing teams in the organization and their contacts: whom do they serve? Who do they need? Perhaps some teams aren't collaborating with those they should be or can't communicate effectively.
  • What kind of profiles exist in the teams?: List the number of employees and their expertise. Maybe there's a data scientist in one team who'd be more valuable elsewhere. Perhaps there are too many project managers and not enough software engineers. Maybe the company doesn't yet have the profile you need.
  • How do teams exchange information?: List their communication tools. Some employees might be using the new internal cloud service implemented recently, while others might still rely on email.
  • Which teams have access to which data?: List team data access. Are there silos where teams hoard information? Is there an inadequately monitored database risking data leaks? Is a data source especially used or strategic?

By having a clear, substantiated overview of how the company is organized, you'll pinpoint critical areas to address. Use this document as your starting point and iterate on actions to take.

A Unified Network

Imagine for a moment data-scientist teams within each of your organization's offices. Fantastic! Every department has dedicated technical support to process their data. However, soon these teams of engineers begin communicating and realize they are working on the same topics. They notice they're developing the same things. This is frustrating for them, but it primarily means the company is wasting money.

If no one is aware of what the other is working on, efforts will naturally be duplicated. In large organizations, needs are often systemic: offices encounter the same problems with minor variations. Technical solutions can often address these problems in 90% of use cases.

By working on a unified network, all your documents and data are shared. Gone are the questions like "Has the Marketing department provided all the stats?", "Where's the latest version of this presentation?" or "Where is the procedure to apply for holidays?". Everyone virtually works in the same place. No more wondering whether a folder's content is up-to-date.

Engineers can share technical environments instead of redeploying infrastructure in every office. For instance, there's no need to duplicate a mirror of development libraries on two machines a few offices apart. For machine learning, a network allows for the shared computational power by utilizing resources from a central supercomputer.

In many large organizations, the main obstacle to adopting internally developed software is the network they're deployed on. Teams are forced to deploy on a different network from the departments due to information systems security concerns.

To make their software accessible on the departmental network, validation is often required. For any developed software, this process can take several months to a year. If these teams deploy dozens of updates every day, enduring such delays is impractical. In the end, the users you have the least time to assist will abandon your tools because the time irritant will become too significant for them.

Using a unified network is key in adopting your new tools. It allows your organization to save money, and your collaborators to be less frustrated by delays.

In the next chapter, we will see how a software factory is organized and how, thanks to a unified network, it greatly increases the productivity of the organization.

The lifecycle of modern software

Software factory

The software factory is at the heart of your DevOps infrastructure. This is where your engineers will spend most of their time: if they're not coding in their IDE, they will be managing their projects in the software factory.

A software factory consists of a software forge and services allowing your engineers to develop and deploy software on your infrastructure: container image registries, dependency mirrors, artifact/binary repositories. Nowadays, most of these features are directly available within software forges.

The most popular software forges are GitLab and GitHub. GitLab is more commonly found in large organizations since it is simply and freely deployable on isolated networks. Other platforms like Froggit, Gitea, and Bitbucket also exist.

As we will see in the chapter "GitOps", in DevOps, all software source code and all production configurations are stored as code within a software forge. It's thus said to be the "single source of truth" of your infrastructure.

Without a software forge, development teams would each work in their local folder and exchange code via network folders or USB sticks. This would waste significant time when merging code from multiple different contributors and would create numerous security issues. Needless to say, there's no way to trace actions or recover files in case of accidental deletion. Additionally, project management teams would be entirely left out of the software development cycle.

The most popular software forges rely on the git technology, allowing for tracking every contribution. Thanks to git, it's possible to know who made which change and when. You can track the history of contributions and easily manage merging these contributions. We will delve deeper into these mechanisms in the next chapter.

Today, development, sysadmin, InfoSec, and management teams collaboratively work on such platforms, capitalizing on:

  • The list of features to develop for software ;
  • Discussions on designing a feature;
  • User and technical documentation for software ;
  • Software source code ;
  • Infrastructure documentation ;
  • Infrastructure administration scripts ;
  • Security rules ;
  • Software quality rules.

The goal is to store as much knowledge as possible in one place, ensuring the most up-to-date documentation is always consulted.

Consider git for capitalizing on guides, tutorials, and even administrative procedures for your teams. If someone spots an error or outdated information in documentation, they can directly suggest the modification in git to keep the document current.

Teams adopting DevOps are replacing traditional Word or Excel with Markdown. This format, designed to be intuitive for both humans and machines, is independent of any proprietary technology (e.g., Microsoft Word).

It's even possible to create presentations in code form with tools such as Markdown-Slides or reavel.js. They then can be viewed in a simple browser.

Conversely, git is not designed to store large files. One should avoid storing large images, videos, binaries, or archives in it. Other technologies can store these types of files (such as Amazon S3, Minio S3, HDFS or CephFS) with or without a reference to a git project (e.g., DVC).

However, the software factory is not just about capitalizing knowledge. It also serves as a control point for all contributions. An initial level of control is established by adding users to projects they have permission to contribute to.

But a second level of control can be set up: through continuous integration mechanisms, automated scripts can validate a contribution based on rules defined by your organization (they can include rules for software quality or SSI compliance). If the contribution doesn't meet your standards, it's rejected. The contributor sees it instantly, knows why, and can suggest a correction within minutes.

Since software factories can manage access to resources based on a user's profile, it's entirely possible to open yours to external partners (e.g., contractors). They can then add their software following the rules established by your organization and will immediately know how to comply. These rules are defined internally by security engineers.

This is already the case with Platform One, which opens its software factory to manufacturers contracting with the U.S. Department of Defense. Similarly, the NATO Software Factory is NATO's software factory.

However, transformation offers an opportunity to develop internal expertise before being able to define rules for others. You must master the technologies discussed in this chapter to ensure your platform's security. So, work first on your internal projects before overseeing external ones. Each organization is unique but should have its internal experts to provide the best advice.

As described in the chapter "Code Reviews", these reviews offer an opportunity to provide feedback on a contribution before it's deployed. It's possible to set rules so that specific teams (e.g., security team) must approve the contribution before it can be accepted. This mechanism can be seen as a "seal of approval". Software factories contain all these contribution validation features to best ensure the software supply chain's security.

Lastly, the software factory is where software developed by your teams will be built and then deployed on your infrastructure. Analogous to the continuous integration principle, continuous deployment chains are responsible for deploying software according to rules defined in code.

Caution: under no circumstances does a software factory allow your teams to develop software per se. The software factory provides resources for engineers to develop their software (such as dependencies, packages or binaries) but doesn't allow for code writing or execution within it.

The software factory to a developer is what a brush set is to an artist: the set contains all the tools to paint, but the artist spends their time working on their easel. The developer's easel is their IDE on their computer: they code and run their code to test it as they write it. Options for setting up development environments are described in the chapter "Zero Trust-Based Development".

All these technologies help to bring teams closer together and unify practices within the organization. In the next chapter, we will explore how technical teams can organize themselves to collaborate effectively within a software factory.

GitOps

GitOps is a methodology for Cloud applications based on continuous deployment. It uses git projects as a "single source of truth" for infrastructure and application configurations. Once capitalized in this way, the configuration is termed "declarative". In other words, you "code" configuration files to define how to deploy your infrastructure.

The idea behind GitOps is to rely on code to determine the system's desired state.

Synchronizing the desired state is achieved through specific technologies (e.g., ArgoCD or FluxCD). This approach provides a single source of truth for the entire system, facilitating change tracking, configuration auditing, and ensuring the infrastructure meets the company's requirements.

Example: if you need to create a backup mechanism, you can code an Ansible playbook, push it to a git project, and a continuous deployment chain will deploy the change. The target end state is described by code.

You can start by writing manually launchable IaC scripts and then choose an automated solution after maturing on the topic (e.g., an Ansible script automated by a continuous integration (CI) chain with deployment handled by ArgoCD).

Git Workflows

A git workflow is a method to organize contributions to a software's code.

git allows for easy collaboration on code by providing a file history mechanism. But for effective collaboration, organization is essential.

Imagine several engineers working on a car on an assembly line. Robert works on the starter while Caroline ensures the headlights respond to commands. But when Robert turns off the car, Caroline can't measure the current. They can't work together simultaneously.

Due to a personal emergency, Robert has to leave quickly and is replaced by Marie. Unfortunately, Robert didn't have time to tell Marie his progress. So, she has to guess based on what she sees.

It's the same with software. Working in the same place at the same time leads to collisions. In git, when two people work on the same file and try to merge it, it causes a "conflict". Some structure is necessary when developing a project. Otherwise, there's a risk of accumulating technical debt and ending up with unmanageable software.

git operates on a branching principle. By default, only the main branch, main or master, exists. It's considered "stable". If an integrator has to deploy software in production, they will choose the code from this branch.

A developer wanting to design a new feature will create a new branch from the main branch. This results in a copy of the code where changes (commits) are at their discretion, without disturbing others. Once the feature is finalized, the developer can make a "merge request" towards the main branch.

There are three questions to answer when determining a "good" git workflow:

  1. Will this method adapt to the size of my team?
  2. Does this method easily allow for reverting to a previous version of the code in case of an error?
  3. Is this method not too complex to use on a daily basis?

Several methods have emerged over time, but there are 4 main ones:

  1. Release Branching. Suited for periodic software deployment or release, this method involves creating a new branch from the main branch and then stabilizing it with bug fixes and other changes before publishing.
  2. Gitflow. It is extension of the Release Branching method, this uses 6 parallel branches addressing specific needs : release, hotfix, feature, support, bugfix, in addition to the main branch. This method is historically used for managing very large projects.
  3. GitHub or GitLab flow. This method reduces the complexity brought by Gitflow by eliminating its 5 parallel branches to the main one. A developer should create a branch for each new feature from the main branch. A release can be created at any time from the main branch. Beyond its simplicity, the benefit is having a branch that always contains functional code and knowing it's up-to-date at all times.
  4. Trunk-based. This method promotes continuous deployment of software. Unlike github flow, there's only one branch here. Each developer pushes their code directly to the main branch (the trunk). This encourages making small, easily reversible contributions in case of bugs, while reducing time spent on conflicts. Indeed, developers sync their code more frequently.

According to Atlassian, the state-of-the-art git workflow today is trunk-based development. Google's codebase is a good example: despite tens of thousands of daily contributions, they chose this method.

However, you may not have Google's engineering teams. Trunk-based development requires a specific rigor that only a seasoned technical team can handle. This method needs continuous integration chains that ensure the pushed code is valid. It also involves creating optimized continuous deployment chains. Writing these chains takes time and requires experience.

If you don't have a properly equipped team, it's recommended to stick to the GitHub flow.

But adhering to a common contribution methodology isn't enough. While collaboration might now be easier, you're not equipped to understand everyone's status. In the next chapter, we'll explore a project management method.

Flexible flow: a balanced git workflow

You might not have a large team at hand but want to benefit from best organizational practices on your git project. This chapter provides insights into a method suitable for most development teams. It's the one I use for the vast majority of my projects, whether professional or personal.

Drawing from the best of various Agile methodologies, it borrows their pragmatism without their organizational overhead. This methodology will be more suitable for a transforming hierarchy compared to the more demanding trunk-based development. I-T managers also prefer it because it establishes software versions and facilitates long-term project maintenance. Finally, it allows both project managers and developers to easily track developments.

Named "Flexible flow", it is based on GitHub flow but adds a link between project management teams and technical teams.

To bridge project management and technical contributions, GitLab or GitHub projects use issues. These are tasks assignable to a collaborator, describing which feature to develop or which bug to fix.

With the Flexible flow, every contribution must refer to an issue that describes the task's genesis, how it can be resolved, and centralizes stakeholders' thoughts. Anyone can create these tasks : project managers, developers or users. The project manager then prioritizes them. Any newly assigned developer should know: what to do, where to start, and why by consulting the issue. Each is automatically numbered by the software forge.

For Project Management:

  1. Use the Kanban view of your software forge (e.g, GitLab).
  2. Create four columns: Open, To do, Doing, Done.
  3. Create and document the issues in the Open column.
  4. Create contribution type labels. They help teams understand the nature of the contribution to be made.
  5. Create contribution domain labels. They allow specialized teams to know which task to handle. For instance, on GitLab, any team concerned by a label can "subscribe" to it to know when a new task is added.
  6. Create commercial value labels. They able prioritizing tasks based on the commercial value the task realization brings: p1 indicates high commercial priority. p4 denotes low commercial priority.
  7. Create complexity labels. They allow prioritizing tasks based on complexity and the time the task demands: 1 is for a simple task, while 4 is for a highly complex or lengthy task.
  8. Create priority labels. They help prioritize tasks based on the current project development context (political or client priorities).
  9. For each issue, assign a label from each category (e.g., domain, type, commercial value, complexity and priority).
  10. Order the issues by priority: the higher an issue is in the Open column, the more important it is.
  11. Assign an issue to a team member and move it to "To do".
  12. Optionally, create a milestone to group issues meant for a specific software version. A milestone is often tied to a software release date (deadline).

Contribution Management:

  1. Trunk branch: a singular "main" branch from which developers can branch off to add a contribution.
  2. Features branching: code change = 1 branch = 1 issue.
  3. Mention the issue number in every commit.
  4. Once the contribution is ready, create a merge request to the trunk branch.
  5. Release: update the software version (in files like package.json) and make a release from the trunk branch.

Other Recommendations:

  1. Do not make a merge request from the trunk branch to a feature branch ;
  2. Limit the size of contributions, preferring to create multiple smaller tasks ;
  3. Limit the time spent on a review to avoid time-consuming conflicts;
  4. Every contribution must pass a CI pipeline ;
  5. As your team matures, consider adding a Continuous Deployment pipeline for every contribution that passes CI on the trunk branch.

Author's Note: This method has proven effective over time in the projects I've contributed to. Easy to grasp even for beginners, I've refined it over time to be less cumbersome, yet addressing software technical debt issues and team turnover.

To introduce this methodology to your teams and easily access references, view its full-resolution illustration.

12-Factor methodology

Cloud technologies offer undeniable flexibility and allow for serving an increasing number of clients compared to traditional technologies. However, transitioning from a monolithic software to a scalable application requires adhering to certain design principles.

The 12-factor methodology (Twelve-Factor Methodology) encompasses a list of best practices for creating applications suitable for Cloud platforms. It summarizes the experiences of Adam WIGGINS and his engineers at Heroku. The aim is to prevent "software erosion", a phenomenon defined by the slow degradation of software over time, which eventually becomes faulty or unusable. In other words, it helps in creating applications that are easier to maintain, deploy, scale, and more resilient to failures.

The website 12factor.net, created by Adam WIGGINS, lists and details these principles:

  1. Single Codebase. Centralize the code in one place (e.g., GitHub or GitLab) and assign a unique directory/project for each software. It should be adaptable to different environments. For instance, avoid creating separate "production" and "development" projects for the same software.
  2. Declared and Isolated Dependencies. All dependencies should be declared in a file - not implicitly loaded based on their presence or absence in a machine's folder. For example, using package.json for NPM and requirements.txt for Python. They should be isolated during execution to ensure no dependencies are pre-installed on the machine. This can be achieved using Python's virtualenv, Ruby's bundle exec, or Docker for any language.
  3. Environment-based Configuration. The software should adapt to the deployment environment, not the other way around. Utilize environment variables and avoid constants in your applications to tailor your software's behavior to its deployment setting.
  4. Access Third-party Services via Connection Credentials in Variables: Databases, queue systems, SMTP-type emailing services, caching, or other APIs used by the application should be interchangeable based on the deployment environment. The software relies on URLs or credentials set as environment variables. For instance, the software should be able to connect to databases from two different Cloud providers without any code changes, as long as the technologies are the same.
  5. Distinct Build and Run Stages. Freeze the code at runtime, making alterations impossible. Assign a unique identifier for each software release (e.g., a timestamp) and make the code immutable for that version. Any code change necessitates a new release.
  6. Create Stateless Applications. Every application should be self-sufficient and connect to an external service when it needs to interact with data (e.g., the databases mentioned in point 4). For example, each request to an API route shouldn't include a caching mechanism for a user session. The API should solely rely on parameters present in the request to respond. This also encompasses the concept of "microservices." The goal is to separate software functionalities into independent modules, each scalable on its own. This is referred to as horizontal scaling. This contrasts with "monolithic" architectures.
  7. Access Services through Port Binding. An application shouldn't require adding a web server to operate. Each application should come with a means to serve its content and expose its port. It should be possible to assign a port for one specific environment and a different one in another environment.
  8. Concurrency. Allowing application concurrency means having the ability to instantiate multiple clones without the need for coordination or shared state among them. This concept aligns with point 6, implying that various application instances rely on third-party services to manage data. This facilitates scaling individual components of the application (microservices) independently based on user load (e.g., the Unix Process Model, Horizontal Pod Autoscaling in Kubernetes).
  9. Disposability and Restart Control. An unexpected application shutdown shouldn't impact its restart; it should continue to operate as before, adapting to the current infrastructure state. The application startup should be quick (within a few seconds). The software shutdown should be controlled upon receiving a SIGTERM signal.
  10. Environment Parity. Different environments (e.g., development, pre-production and production) should be as similar as possible. Using platform services (e.g., databases, caching services) with different versions can lead to incompatibilities and errors once the software is in production.
  11. Treat Application Logs as Streams. An application should never handle redirection or storage of activity logs. It shouldn't attempt to write or manage log files. Instead, it should write logs to stdout as promptly as possible. This enables the Cloud platform to easily process logs from deployed applications.
  12. Execute Administrative Tasks with One-off Commands. Applications should include scripts or tools for executing administrative actions. For instance, initiating a database migration with a Python script, accessing a console to investigate a production database with the P-S-Q-L command line, or triggering a backup with a command. The idea is to facilitate script execution in the same environment where the software is deployed.

Implementing these criteria - particularly breaking software into microservices - combined with continuous deployment pipelines, increases the chances of anticipating software incidents by 43% according to research. Containerization is especially suited to these practices. Concepts of isolation are recurrent, and technology like Docker aptly addresses them.

Even though these are now standard practices in the industry, it might be beneficial to include them in a guide for newcomers.

Instant messaging

A simple yet especially effective way to bridge the gaps between silos is to implement a shared instant messaging system. Through this medium, team members can quickly communicate without overloading their email inboxes, engage in group discussions about the next feature to develop, share snippets of code or documents effortlessly, foster cohesion by sharing memes, make general announcements, or even conduct polls to decide on a list of options. Beyond facilitating collaboration, messaging enables remote work or collaboration with decentralized teams.

In the context of system reliability improvement, messaging is the ideal location to centralize production alerts for SRE teams. System monitoring tools can be set up to raise alerts in a single location. SREs are immediately notified when an alert is issued. When set up properly, these alerts provide all the information the SRE needs to promptly address the issue. Production teams can also easily inform their users of actions they are taking on the production system.

Messaging platforms like Mattermost, Element, Zulip, and Slack come with built-in VoIP and video conferencing capabilities. Most also natively support integration with tools used in production (e.g., automatic notifications for every GitLab release, incident reports in chats, updates on the status page or postmortem timelines).

Several companies, like Scaleway, open their corporate messaging to their customers. This forms a community of mutual assistance and a knowledge base for new users. It fosters engagement and reassures potential users, knowing there will be someone to answer in case of problems. Users facing issues can ask their questions, to which another user or a company expert can respond. At Canonical and Prefect, there are even "Community Engineers" whose specific role is to help the community with issues they may encounter. Some companies opt to charge entirely for this user support.

Remote work

Large organizations are often hesitant about offering remote work to their employees. They fear that employees might not focus on company tasks.

If you need to convince your superiors, clearly list the objectives for the remote-working employee. If that doesn't suffice, you might suggest the employee submits a daily work summary. However, this approach essentially tells the employee, "I don't trust you to be diligent." Think twice before proposing it.

Research has shown that a flexible work environment is linked to a reduction in burnout and an increased likelihood of an employee recommending their company.

Software architectures and agility

Understanding different software architectures will help you grasp how software is deployed in cloud architectures.

Depending on your organizational maturity and team sizes, certain architectures will make your software easier to maintain, update, or ensure longevity.

This chapter introduces three renowned architectures and describes their pros and cons. Finally, we will explore how to progressively transition your legacy software into microservices.

Monolithic architecture and microservices

A monolithic application is developed as a single, indivisible entity where every function or module is interconnected. The software components depend on each other.

While initially easier to develop and use, software designed as monoliths complicates the addition of new features as they grow.

Updates to one part of the system impact the entire application, necessitating extensive testing to ensure it functions as expected upon deployment. The potential "blast radius" of a bug is significant in such an architecture.

Many renowned software like Wordpress and Magento still employ a monolithic architecture today. However, the trend is shifting towards microservices architectures, which are more scalable and resilient.

An application designed with microservices breaks down each software functionality into isolated services (e.g., email sending management, login management, order management). Each operates independently. Every microservice communicates with others using a predefined exchange format (through an API). Updates can be deployed without disrupting the whole system.

By splitting your software into microservices, you can parallelize team work on each part of your software. Each team develops and deploys independently.

But one of the major benefits of microservices is the ability to scale easily: the most in-demand services can be instantiated multiple times simultaneously to distribute the load. Some service orchestrators, like Kubernetes, allow automating this behavior.

However, this architecture requires advanced tools to maintain hundreds of intercommunicating microservices. DevOps teams facilitate the implementation of such architectures. For instance, they provide developers with application templates (or boilerplates), containing everything needed to kick-start a microservices application on its own infrastructure.

Serverless architectures and functions as a service

To enable fine scaling on isolated functionalities, so-called "serverless" architectures have emerged. The advantage of a serverless architecture over traditional micro-services approaches is multifaceted:

  • No longer having to manage the underlying infrastructure ;
  • Only pay when the service is used ;
  • Automatically provision resources during high traffic ;
  • Automatically remove unused resources.

Included in this are Function as a Service or FaaS technologies, Container as a Service, serverless compute platforms or SCP, self-managed storage services, and self-managed messaging services.

By only charging for resources when they are used, serverless represents a significant economic and ecological argument. These technologies can reduce your bill from tens of euros to a few cents every month.

For instance, Functions as a Service each represent an isolated feature of your microservice. If you know that only 10% of your functions are used 90% of the time, there's no need to pay for 100% of the resources constantly.

Consider a specific case: you decide to start an email marketing campaign. Precisely the email sending function will be highly solicited for a short moment. There's no need to scale the function that lists your products. The infrastructure will only provision instances for the email sending function.

However, serverless architectures require specific skills to maintain. They might also tie you to a Cloud provider's proprietary technologies or explode costs if the use case isn't suitable.

From monolithic to microservices

The leap to switch from monolithic software to a microservices architecture is often significant. However, this approach provides unprecedented flexibility in development and makes scaling drastically more efficient. But how do you make this transition without disrupting your entire operation?

Deciding to switch to microservices is tempting but involves compromises. British software engineer and author Martin FOWLER enlightens us on the prerequisites your team must have before starting this journey:

  • Ability to quickly provision machines
  • Ability to deploy quickly
  • Have the tools to monitor your services

In essence, we're talking about Cloud technologies and DevOps techniques. At this point, you just want to validate the development process of a microservice and deploy it automatically.

To practice creating microservices, decouple an initial feature that doesn't need to be modified throughout your software. For example, an app's authentication mechanism is often centralized in a class or function: create and interface this microservice.

  1. Set up a development environment with automated tests, continuous deployment, and supervision tools, to grasp a generalizable first microservice.
  2. Implement a proxy around our application to control the flow.
  3. Minimize callbacks to the monolith.
  4. Segment the software by logical domain and prioritize the most complex functions at the beginning.
  5. Consider rewriting capabilities rather than extracting and reusing the code.

These insights and tips will allow you to confidently approach the task of rewriting your software to better integrate it into a Cloud infrastructure, taking advantage of the agility it offers your organization.

Embracing failure

You should be prepared to see failure as an opportunity to correct your course toward a better direction. If you face significant failure, it indicates a lack of elements to control the situation.

Using indispensable tools and methodologies in the field, this chapter aims to make you understand the importance of a company culture that accepts failure. It will enable you to better anticipate risks, embrace them more confidently, and increase your velocity.

Instilling this mindset is a cultural shift that an organization must implement at all hierarchical levels.

Psychological safety

Amy C. EDMONDSON, Professor of Leadership and Management at Harvard Business School says : "Psychological safety is the shared belief that one will not be punished or humiliated for speaking up with ideas, questions, concerns, or mistakes."

An organization's culture is foundational to its potential. A positive culture promotes team collaboration and communication, essential for the successful implementation of a DevOps initiative. This idea isn't new and was theorized in 2004 by sociologist Ron WESTRUM in his article "A Typology of Organizational Cultures".

By ensuring your employees' psychological safety, you encourage shared responsibility for successes and failures. Shared outcomes rather than attributing them to specific teams or individuals.

In a work environment that overlooks psychological safety, collaborators:

  • Keep their concerns or ideas to themselves ;
  • Fear appearing incompetent or ignorant ;
  • Are scared of being ridiculed.

As mentioned by Kiran VARMA in her course on SRE culture at Google, research identified two primary drivers fueling the human tendency to blame others: hindsight bias and "discomfort discharge."

Hindsight bias is when an individual overestimates their ability to have foreseen an event. People often struggle to grasp that something only seemed obvious after it happened. In a professional setting, this might lead to blaming a person responsible for a task, claiming they should've "seen the obvious thing" coming.

The concept of "discomfort discharge" refers to the neurobiological phenomenon where we blame others to alleviate mental pain. Sociologist Brené BROWN suggests that humans do this involuntarily but blaming hampers our ability to learn from mistakes.

In organizations uncomfortable with failure, team members might hide information or not report incidents for fear of punishment. Similarly, out of fear of appearing foolish, they might hesitate to ask questions that could identify a problem's root cause. However, mistakes only become opportunities for improvement if their true causes are identified. This can only happen in a psychologically safe work environment.

A psychologically safe organization believes:

  • Failure should be seen as an opportunity for growth ;
  • New ideas are welcomed and should be discussed ;
  • Failure results from lacking methods and procedures, not from individual fault.

This mindset fosters trust. Organizations should shift from asking "Who did this?" to "What happened?" Focus on methods and procedures rather than on individuals. It's best to assume employees act in good faith, making decisions based on the most relevant information available. Investigating the source of misinformation is more beneficial for the company than blaming someone.

Innovation requires risk-taking. No product or strategy comes with a 100% guarantee of success. So, if everyone's afraid of taking risks, nobody will, and your organization will stagnate.

There are numerous other decision-making models and project management methods available. Don't hesitate to explore and adopt them.

Responsibilities in a DevOps model

When discovering the plethora of experimental technologies to implement for operating in a DevOps mode, you might be intimidated by the idea of becoming responsible for this vast, new system.

This chapter aims to juxtapose a traditional responsibility model with the DevOps model. It also offers a tool to guide decision-makers, the DACI. It's up to you to pick and choose methodologies from each that seem most appropriate for your organization. However, be bold enough and avoid reverting to a traditional model, which would only give you the illusion of a transformation.

The RACI model

One of the responsibility-sharing models is "RACI", which stands for Responsible, Accountable (the Owner), Consulted, and Informed. It ensures all stakeholders are aware of their roles and responsibilities in a project.

Let's give an example for the creation of a new website. We have a responsible, an executor, consultees, and informers that are designated for each activity.

  • "R" for the Executor, which is the person who does the work to complete a deliverable
  • "A" for the Owner, which is the person who delegates the work and inspects the finished tasks
  • "C" for the Consulted, which is the person who contributes to a deliverable based on their expertise or responsibilities
  • "I" for the Informed, which is the person who needs to be kept in the loop about the project's progress

An extension of RACI is RACI-V-S, which includes a validator (who is an authority) and a signer (who is the person in charge of the official approval of the deliverable).

The RACI model is built on a clear separation of roles and responsibilities. This can be counterproductive in a DevOps initiative, which seeks to foster collaboration among teams. Moreover, RACI doesn't consider the dynamic and ever-changing nature of project development.

In a DevOps model, everyone can contribute to a project. Everyone is accountable at the same level based on their expertise. A significant change—like launching a new software version—requires the approval of each team (from the design team to marketing, engineering, and security).

Of course, you wouldn't ask the I-T security team for their opinion on changing a button's color... But the point is, there isn't just one owner or executor: everyone is responsible and can validate or invalidate a change based on current rules and constraints.

RACI, as used in large organizations, plays a referee role. A specific team is designated for specific tasks. However, DevOps is a collective set of practices that are interconnected. If you aim to evolve multiple organizational teams, RACI can have a discouraging, perverse effect. It might allow Team A to ignore Team B's issues under the pretense that it's not their responsibility.

To mitigate this, while still reassuring your managers, RACI's deliverables can initially be broadened. For instance, instead of referring to "a mechanism that collects application logs," you might refer to "a set of tools for observability." Specific sub-objectives can still be defined, but responsibilities should remain broad. If some don't play along, RACI can force them to do the work, but they might not be the right members for your project.

As a leader of an initiative involving new technologies and practices, your superiors will ask you to take on many of the roles in the table above. Take on this responsibility to reassure your authorities. There's no need to fear since you know the methodology you want to implement is collective and iterative.

The DevOps model

Most of the time, it's not advisable to immediately abandon a RACI-like model. It's a matter of evolving culture and implementing tools. But that's the goal: a cultural shift in your organization so that authorities overcome their fears.

However, by assuming shared responsibilities without placing blame, you focus on improving the service to achieve the desired end result (e.g., a more stable infrastructure) rather than finding a culprit. With this principle in mind, let's analyze a common concern.

As you've realized, DevOps encourages not blaming stakeholders. It might seem logical to argue that if no one is personally accountable, teams might be less diligent in their daily duties. How can we imagine a production head deleting the entire client database without consequences? Leaders must realize their actions have repercussions. DevOps addresses this challenge in two ways:

  1. If your procedures are sound, there's no reason the engineer could've executed that command. If a mistake was made, it indicates that the rules governing the security of your production infrastructure weren't robust enough (for example, due to manual access to production machines, lack of commands validation, no backups, poorly described procedures or communication gaps...).
  2. You hired an employee because they know their job (you did interview them, after all). If you're worried they won't take responsibility, speak with them or consider letting them go and revising your hiring policy. Trust your experts. If you have doubts, ask them to strengthen control rules and reassure you with typical scenarios.

That's why you should start with available resources, but have the boldness to start small in your transformation journey. Your company should gradually implement procedures based on available human and financial resources. Once these techniques have been tested, iterate on a larger scale.

The DACI model

DACI is not a means to define responsibilities for a project. Instead, it's a document and method to organize occasionally, aiming to make a group decision when faced with multiple options. It's typically used to ensure a decision is made by the end of a meeting. Consider it a tool that supports the approach of shared responsibilities in DevOps mode.

A meeting based on the DACI model involves defining four roles:

  1. The driver. It is the person guiding the committee towards a decision. They ensure that everyone is well-informed about the meeting's progress and answer questions. While they ensure a decision is made, they don't necessarily influence the process. Often a program manager.
  2. The approver. It is the individual with the final say during decision approval. Usually, a manager or a company executive with decision-making power. Consider inviting a stakeholder or a customer for whom the project was designed to take on this role.
  3. The contributors. They are those with the knowledge needed to shed light on the decision-making process. These are the experts and professionals.
  4. The informed stakeholders. They are the individuals affected by the decision without being directly involved in making it. Those who might need to revise their work following the decision, like legal, sales, or logistics teams. Limit their number: perhaps just sending them a brief email at the end of the meeting with actions to take is sufficient.

Next, the goal is to collaboratively list the considered options in a few words. For each, indicate its cost, time required, and other pros or cons.

In the remaining 5 minutes, set a date for when the decision should be made (if not immediately). Based on these preliminary options, if any need further details, assign the task to the person responsible for fleshing them out.

Once the options are consolidated, the approvers make the decision, and tasks are delegated accordingly.

Once your decision on which option to choose is made, it's time to communicate it so everyone is on the same page. Send the document to those who need to be informed and then archive it.

Once archived, it will help new stakeholders understand why specific decisions were made. By conducting this collective reflection, individual cognitive biases are also avoided.

Investigating incidents

You receive an alert message from customer support on Slack. They inform you that your file-sharing platform is down.

This marks the beginning of the problem investigation. The most common technique is root cause analysis, inspired by quality control techniques in manufacturing. It helps understand the factors behind the incident and determine its source. The goal then is to establish procedures to prevent the incident from happening again.

In RCA, your primary goal is to restore the services. This is followed by a solution to permanently resolve the issue. Lastly, preventative action is implemented to prevent future recurrence.

For instance, in the case of a faulty coffee maker:

  • the immediate action is to replace the broken part ;
  • the permanent solution is to redesign the coffee maker, accounting for manufacturing disparities ;
  • the preventive (or "systemic") action is to change the design process, integrating a study of manufacturing disparities among suppliers.

To make upper management understand the value of this method, present it as an investment in saving time and money. RCA reduces the risks of expensive software redesign and is time-efficient. Focus your RCA efforts on the incidents that cost your organization the most. Establishing procedures and preserving knowledge on incident resolution also improves team communication. Instead of merely applying patches, the idea is to find a lasting solution.

Here are the 5 steps of Root Cause Analysis:

  • Identify the problem ;
  • Contain and analyze the issue ;
  • Determine the cause of the problem ;
  • Resolve the issue permanently ;
  • Validate the fix and ensure the incident does not happen again.
  1. Identify the problem.

    Analyze the situation to ensure that it's indeed an incident and not just a harmless alert. The company should set a threshold to classify an event as an incident, such as an anomaly lasting more than 1 minute. If the event threatens the stability of your resilience indicators, treat it as an incident.

    When in doubt, the best practice is to report incidents early and often. It's better to report an incident, quickly find a fix, and then close it, rather than allowing it to persist and worsen. If a major incident arises, you'll likely need to handle it as a team. You can distinguish a major incident from a minor one if you answer "yes" to any of these questions:

    • Do you need to call in a second team to solve the problem?
    • Is the outage visible to customers?
    • Does the problem still affect the system even after an hour of intense investigation?

    As soon as the incident starts, begin taking notes on what you observe and the actions you plan to undertake. This will be helpful for your postmortem. Next, classify the problem using the "5W2H" method (5 whats, 2 hows).

    On mature infrastructures, problem identification is made easier by monitoring systems. They notify about detected anomalies. For example, these anomalies can be detected by tools like Statping, which trigger an alert when a service goes down. But they can also be detected by machine learning mechanisms, revealing unusual trends. The advantage is that alerts are not just triggered when a simple threshold is breached, but when something unusual occurs.

    Tools like OpenRCA, OpenStack Vitrage, and Datadog can help identify the root cause of a problem by highlighting anomalies within your infrastructure.

    At this stage, you only recognize the symptoms of the problem, not its severity.

  2. Contain and analyze the problem.

    Always start by resolving the problem. Restore service as soon as possible to prevent further escalation, even if the solution is temporary or not deemed "clean."

    The trust your users place in your service is linked to your responsiveness during incidents. Users don't expect 100% uptime but do expect clear communication during outages. Transparency is vital.

    A service status page is an excellent way to inform your users about an incident's progress. You can also notify about upcoming maintenance.

    With each update on the incident's status, communicate:

    • The current situation and the measured impact ;
    • What's known about the problem and what has changed ;
    • Ongoing affected services.

    To analyze the issue in more detail and locate the source of the malfunction, use your observability tools.

    At this stage, you must find an immediate action. For instance, a manufacturer could decide to re-inspect ready-to-ship parts, rework them, or issue a recall. For software, the idea is to find a way to restore service, often by pushing a quick fix (known as a "hotfix").

    Your SRE team must ensure deployed fixes work. They can do this by running pilot tests prepared in advance.

  3. Determine the cause of the problem.

    With the incident's impact now managed, it's time to investigate the root cause.

    As a team, list likely factors contributing to the problem. Structure your hypotheses using a cause and effect diagram.

    Choose by majority vote the causes that seem most likely to reoccur. According to Pareto's principle, 80% of effects come from 20% of the problems. You now have a focus for your investigation.

    Use the "5 Whys" method. The idea is to identify cascading symptoms until the root cause of a problem is found. "5" is an arbitrary number; it can be less or more, depending on the situation.

  4. Address the problem in a sustainable manner.

    After identifying the root cause of the issue, it's time to design a solution to address it.

    Validate that the solution works by performing a proof of concept before deploying it to production.

    Define and outline the steps to rectify the problem, assign responsibility, and set a timeline for completion.

    Determine how the solution's effectiveness will be measured, such as through telephone surveys, online polls, automated feedback, manual measurements, etc. Set a timeframe for monitoring the action, then implement the fix.

  5. Validate the fix and ensure the incident doesn't recur.

    Using the measures set in step 4, ensure that the actions taken have had the desired effect.

    The incident is now resolved and the cause understood. Now is the time to inform your users on the status page: "Root cause analysis complete, incident resolved, incident documented."

    All that remains is to draft your post-mortem, noting what measures you've put in place to prevent a recurrence. Use the notes from previous steps to structure this document.

    Publish and share this document internally or publicly. This keeps customers informed and satisfied, and recognizes the hard work of the operations teams.

Just as airplane pilots train for emergency scenarios, your SRE teams should practice to save time when an incident occurs.

Your incident response procedure should be easily accessible and written for all audiences. It should provide guidance not just for SRE teams but also for non-SREs who might encounter an incident. Regardless of your organization's size, you need an incident response procedure.

Postmortems

A postmortem is an incident investigation technique. Its purpose is to determine corrective actions to prevent recurrences. Your SRE team should draft this document based on information gathered during the root cause analysis.

In the military and aerospace fields, the After Action Review (abbreviated AAA) or "debrief" is systematically practiced following an event to learn from it. The postmortem, on the other hand, is only initiated when an incident or failure has occurred. That is what distinguishes them.

It's recommended to store these documents in a git project to track changes over time. My personal recommendation is to draft them in Markdown format.

Historically, the Latin term "post mortem" means "after death" and refers to investigations carried out by law enforcement to understand how a crime occurred. They analyze evidence, identify the cause of death (e.g., through an autopsy), then attempt to apprehend the perpetrator or amend the law to prevent future occurrences. The concept is similar for I-T incidents.

Postmortem structure

The structure recommended by Google serves as a good example. The document consists of two parts: one that describes what happened and another that details the steps to be taken following the incident.

Create a new Markdown document and prefix its name by the date, a short title and the duration of the event.

For the first part, set the following 14 headings:

  • Title
  • Alert Date
  • Incident Start Date
  • Incident End Date
  • Incident Duration
  • Authors
  • Status
  • Summary
  • Impact
  • Detection
  • Problem Sources
  • Triggering Event
  • Resolution
  • Lessons Learned

The second part describes what your team might do differently next time. As a conclusion to your postmortem, it lists actions to prevent recurrence of the issues. Focus not only on bug fixes but also include necessary procedure changes to mitigate similar future incidents.

Define a table with four columns and as many rows as desired:

  • The person responsible for the action ;
  • The actions to be taken ;
  • The priority of this action ;
  • The issue or associated ticket.

As your team or projects grow, a more formal structure for your postmortems might be needed. The postmortem model proposed by Atlassian is a good example.

For minor incidents or daily bugs, use a Q&A service like Scoold or question2answer. This can address technical problems (e.g., "How to resolve a dependency conflict") or more general queries (e.g., question: "I can't connect to service X" with answer "Did you try registering at this URL?").

With such software, your SREs will have a list of problems that are easily solvable in the future. A private alternative to StackOverflow, it also allows your developers to ask questions to other company colleagues confidentially.

Postmortem for retention and attraction

As discussed in the chapter "Investigating incidents", publicly sharing one's work allows it to be recognized by the community. This practice also enhances retention by enabling collaborators to build their reputation.

Video creator Bastien MARÉCAUX (known as "Basti UI") introduced the concept of "teletra-live", a blend of "telework" and "live streaming". He broadcasts live work sessions on the Twitch platform, with permission from his clients. This highlights the significance of publicizing one's work. This is a trend that might gain traction in the future.

Beyond personal perception, showcasing one's work to an informed audience motivates the individual behind the published name to deliver quality work. Initially, it might just involve sharing internally with company colleagues. A mere notification about the postmortem in the company's messaging platform could suffice.

Transparency is also an excellent way to attract talent. In the industry, companies brave enough to document and publish their incidents are deemed trustworthy. This is because they are not hesitant, given their robust procedures and meticulous work, inspiring confidence and attracting talent.

Many companies like Spotify, LinkedIn, Meta, Airbnb, and Capgemini share articles on their respective blogs. Topics range from postmortems to best internal practices or overcome challenges.

For instance, Cloudflare is renowned for its high-quality postmortems regularly published on its blog. Newsletters such as SRE Weekly also list public incidents every week.

A public postmortem is typically less detailed than an internal one. The former often summarizes the latter, excluding sensitive parts.

Organizing Incident Response

For significant incidents, effective organization is crucial. A productive technique is the 3 Commanders system (abbreviated 3Cs). Initially theorized in 1968 by firefighters as the Incident Command System, it was later adapted for I-T incidents. Today, it's employed by Google's SRE teams.

When a major incident occurs, the immediate challenges are task coordination, incident resolution, and communication—all simultaneously. Imagine driving while navigating using a map.

Ouch! The server handling employee authentication for the intranet just crashed.

To manage, designate 3 individuals for the following roles:

  1. Incident Commander (IC). He coordinates tasks and delegates roles, including designating the OL and CL. Initially, the IC is the one discovering the incident. If someone more experienced arrives, the role can be passed on, allowing the initial IC to return to work or become the new CL. If needed, the IC calls for backup and instructs the rest of the team on continuity.
  2. Communications Lead (CL). He manages the status page and informs employees, customers, and management about the incident's progress. He also creates a dedicated internal communication channel for the incident and invites stakeholders, acts as the interface between the incident management team and external parties and shields the OL from external interruptions.
  3. Operations Lead (OL). He resolves the issue and drafts notes for the postmortem, refers to the IC when needing additional help and updates the CL on the incident's progress.

In smaller teams, the IC often assumes all three roles. However, preparation for delegating these tasks during severe incidents is essential.

Defining and organizing roles should be a part of your incident response procedure. Ensure clarity in your knowledge base so teams know how to act. Regularly train your teams for potential incidents. Setting a low alert threshold can help expose teams to your incident response procedures more frequently.

The importance of communication

Communication is vital, whether with customers or internal teams. During a significant incident, Datadog emphasized in its postmortem the need for early communication about outages to both customers and internal teams. Here are some insights:

For incidents not fully identified and impacting clients differently (e.g., based on location or product used), the rule is to report the "worst symptoms". Instead of potentially frustrating clients or spending excessive time communicating for each region or product, decide to communicate promptly about the most affected area or product. For instance, if the "EU" region shows more severe symptoms than the "US" data centers, report the EU's issues, even if the exact scope of the outage is unclear. Clearly state that the "US" might be similarly affected as the "EU".

During major incidents, many customers might open tickets. Assigned "support" engineers then address these distressed clients. Without clear internal communication for support engineers, both your teams and customers can grow impatient. Ensure a dedicated internal communication channel (e.g., Slack channel, Google Docs) so support engineers can provide consistent responses to clients.

More broadly, the company learned over time that updating clients every 30 minutes was optimal. This frequency allows technical teams to focus on problem resolution without being interrupted too often for updates.

Anticipating incidents

In this chapter, we will explore two techniques to proactively anticipate potential incidents: the premortem and the cause-and-effect analysis.

  • The premortem answers the question: "What elements could cause this architecture/approach to fail?" ;
  • The cause-and-effect analysis answers the question: "What incidents could arise from this architecture/approach?".

If multiple approaches are being considered, start with a DACI. Once a decision has been made, the team should have an intuition about which approach to pursue: this is the moment to test it with a premortem.

The cause-and-effect analysis (or FMEA) focuses on technical considerations taking place after the decision regarding which approach to use has been made.

Premortems

Before the start of a project, project managers and engineers should gather to list out potential reasons for its failure.

The premortem is also known as the "study of nonconforming cases". The military strategist Sun Tzu already advocated in the year five before Christ for planning as many possible war scenarios as possible (or "nonconforming cases") before a battle, in his writing "The Art of War".

The premortem is a project management methodology that involves imagining that the project has failed even before it has started. The result is a document listing the incidents the team needs to prepare for to ensure project success.

For example: "Our team currently manages its infrastructure using traditional methods. We want to establish a plan to work in a DevOps mode."

  1. Organize a meeting with stakeholders. Ask them to imagine a year ahead and that the transformation plan failed.
  2. Create a collaborative document (e.g., with Google Docs) with the following headings. Include potential Failure Factors, Solutions, Most Dangerous Factors and an Action Plan.
  3. Potential Failure Factors. Include reasons that might lead to project failure (e.g., lack of support from management, difficulty integrating DevOps practices with existing processes and systems, insufficient team training or expertise in Cloud technologies and resistance to change from certain members...).
  4. Solutions. For each failure factor, brainstorm solutions that could be implemented now to reduce the project's risk of failure (e.g., conduct awareness presentations, start with a proof of concept for a specific use case, prepare a training plan and identify early adopters...).
  5. Most Dangerous Factors. List out the riskiest factors that the team can still influence.
  6. Action Plan. List solutions to the most dangerous factors and turn them into an action plan. Each solution becomes a task assigned to a member with a deadline.

Here's a more technical example: "Our team deploys its software using Docker Compose. They now want to deploy using Kubernetes."

  1. Organize a meeting with stakeholders. Ask them to imagine a few months ahead and that Kubernetes ends up requiring a lot of effort without offering significant benefits.
  2. Set up the collaborative document.
  3. Note down the potential failure factors (e.g., team's insufficient knowledge or expertise in Kubernetes, online documentation not sufficient for our use cases, complexity of integration into our development environment, security vulnerabilities due to maintenance complexity or increased HR costs during the transition...).
  4. List the solutions (e.g., prepare a training plan, invest in cloud-specialized consultants, set up an automatic cluster update service or hire an intern to create the initial cluster version...).
  5. Identify the most dangerous factors (e.g., team's insufficient knowledge of Kubernetes or security vulnerabilities due to maintenance complexity).
  6. Create your action plan (e.g., prepare a training plan or contract with Company X for specialized cloud support).

Cause and effect analysis

While the RCA is a "reactive" method employed after an issue has occurred, the failure modes and effects analysis (abbreviated FMEA) is a "proactive" method to attempt to anticipate failures before they occur. Introduced by the U.S. military in 1949 and later adopted by the automotive industry, it lists out product or software error states, prioritized by risk. Based on the potential consequences of a risk, design teams prioritize the development of mechanisms to prevent its occurrence.

In FMEA, one can visually represent a cause that might lead to an error situation. A chain of causes and effects can be established to better visualize the consequences of a problem.

You can do the same with malfunction scenarios of software or infrastructure. Set up a table with 7 columns. For each hypothesized incident, the author should determine:

  • Error Situation (e.g., "The software update failed on one of the servers")
  • The effects (e.g., "Client requests reaching this server will fail. This represents 20% of our requests due to our load-balancing architecture")
  • Probability. Rate from 1 to 10 the likelihood of the event happening.
  • Severity. Rate from 1 to 10 the severity of the problem should the event occur.
  • Detection Difficulty. Rate from 1 to 10 the likelihood that the event won't be detected.
  • Risk Level. Product of probability, severity, and detection difficulty.
  • Countermeasures. Describe how to respond should the event happen (e.g., "Configure the load-balancer to exclude the server where the update failed. Roll back the software version. Restore the load-balancer to its initial configuration")

From this table, prioritize tasks for your teams to work on anticipating the most critical situations.

For infrastructure maintenance, a best practice is to create "incident sheets". Each includes a breakdown scenario, coupled with potential solutions. Typical breakdowns include running out of disk space, a poorly migrated database, or a failed backup export. Catalog them in your knowledge base (e.g., GitLab, Confluence).

Structuring incident responses

If you are a small organization, start by formalizing your procedures to conduct an RCA and write postmortems. Then, gradually establish FMEAs and try to start your projects with premortems. Periodically, conduct FMEAs.

The decision to invest time in conducting premortems, FMEAs, or postmortems is governed by your priorities in terms of resilience. Research shows that service downtime can be costly for large organizations, averaging $500,000 to $1,000,000 per hour of unavailability.

Reducing the cost of change

Don't disrupt

DevOps is often portrayed as a disruptive organizational approach, meaning a paradigm shift in technologies and practices. To avoid intimidating the stakeholders of your transformation, instead present DevOps as an evolution of traditional technologies.

For instance, Windows 10 (released in 2015) is merely an evolution of Windows NT 3.1 (released in 1993) and still contains code from the early days of the Windows NT architecture (designed in 1988).

Here are some parallels concerning the Cloud:

  • A container is just a more flexible tiny VM. It is managed with different commands, the nomenclature is different, but the concepts remain the same: an OS from which the container is created, a configurable network, and the ability to add storage ;
  • An orchestrator is just a hypervisor managed with different commands. But its components remain the same: configurable network policies between containers/VMs, storage management with VMWare's datastores in place of Kubernetes' PersistentVolumes, or VMWare's NSX Controller in place of Kubernetes' Ingress Controller ;
  • However, there are specific evolutions that one must simply accept. For instance, the use of good practices mentioned in the chapters "A Foundation for Your Resilience" and "12-Factor methodology": favoring stateless services, using only micro-services, exposing one's activity logs differently... ;
  • Micro-services are merely a division of traditional software into multiple independent blocks. Each block can be scaled according to user load.

Traditional VMs also have their place in a Cloud DevOps infrastructure; they can be part of it.

Along with these technological evolutions come methodologies to manage technical debt, accelerate deployments, and maintain a high level of resilience: a software forge, gitops, continuous integration, continuous deployment, postmortems... That's DevOps.

By implementing the methodologies covered in this book and using standardized administration technologies (e.g., Kubernetes), you will ultimately reduce administrative costs.

Avoiding design mistakes

As discussed in the chapter "Staying close to business needs", it's common to not meet the initially expressed need using traditional methods. Setting the requirement at a fixed point is not a reliable way to deliver the expected product. Needs continually evolve, and clients often can't articulate exactly what they need.

Agile methodology aims to reduce this risk by offering several short delivery cycles (sprints). After each cycle, the client provides feedback. This loop continues until the project suits the client or the contract ends. DevOps provides the tools for a company to streamline these interactions. In the most efficient companies, sprints are merely a contractual detail to discuss progress: the software is already in production and ready to use.

On the contrary, this methodology avoids being trapped by clients too specific in their requests. Some are convinced of how software should be designed to best meet their needs. However, their suggestion might not be the most suitable option. Throughout your deliveries, the client will always have feedback or details they forgot to communicate. These details—big or small—accumulate over time and can lead to excessive delays.

If software is meant to deeply change its recipient's habits, delivering it early is necessary for them to gradually adapt to the imposed changes. They can, for instance, modify their internal procedures, hire the right profiles, and prepare their communication strategy. This will prevent frustrations and ensure a deliverable that closely meets business needs.

Avoiding this approach, especially with a very demanding client, can in extreme cases lead to projects that drag on for years. Or even worse, to abandoned projects. This will inevitably cause mutual frustrations among the team leader, the development team, and the client.

Avoiding development errors

Human error is the primary cause of mistakes. That's why automation is a fundamental component of a DevOps mode organization. Continuous integration and deployment chains are particularly effective in streamlining the software delivery cycle.

If you currently feel friction in your production cycle, you likely need to invest time in automation. In mature companies, teams dedicated to developing automation tools for development teams exist. Their mission is to listen to developers to enhance their development experience. For example, they might develop internal tools that analyze added code to suggest readability or security improvements. At Google, an internal platform handles this kind of suggestion: if the code doesn't conform, a click is enough to reformat it. If a library is considered vulnerable, an alternative is suggested.

These tools generally speed up the development process and expedite code reviews to get the software into production as quickly as possible. These methods are especially effective when you regularly onboard new staff unfamiliar with your development practices. Inexperienced newcomers, without explicit and restrictive rules (such as CI/CD pipelines), can quickly impact the quality of your codebase. A slight oversight, and a bug can quickly emerge.

Deployment techniques like blue/green also reduce the risk of software regressions.

Design Thinking

Companies with a strong SRE/DevOps culture encourage innovations put forward by their team members. Thanks to techniques discussed in the previous chapter (CI/CD, blue/green, premortems, FMEA), it's fortunately possible to control the risks brought by these innovations.

To keep your employees motivated to achieve great things, it's crucial to avoid limiting their creativity or ideas. That's why design thinking and the creation of prototypes are key techniques for an efficient organization.

Design thinking is an innovation technique that merges creativity and method to try to solve complex problems. It comprises 5 phases:

  1. Empathize. Start by meeting the end-user and immerse yourself in their environment to understand their challenges. This helps to set aside any preconceptions and gain an authentic perspective ;
  2. Define the Problem. Clearly outline the problem you're trying to solve. Express it from the user's viewpoint, rather than describing what you'd like to achieve ;
  3. Ideate. Now that the problem is identified, you can start brainstorming solutions ;
  4. Prototype. Bring your idea to life with a prototype. Spot the weak points and find solutions, or move to another idea if the one you're testing isn't viable ;
  5. Test. Evaluate your prototype in an environment that mirrors your target user's setting.

In summary, you need to put yourself in the user's shoes, and techniques like continuous deployment help streamline this process. When faced with reality, innovation isn't stifled by the organization.

If it fails, the organization learns more about its customer and environment. If it succeeds, it's a win for everyone: the innovation team, the organization, and the customer.

This prototype-driven culture is vital because a company that doesn't prototype launches fewer ideas, thus experiences fewer successes and takes longer to fail. Conversely, a company accustomed to testing its prototypes will fail faster and consequently achieve more successes.

You don't have to build the software from scratch before presenting it to the customer. You can create a mockup on Figma or Penpot, use a low-code/no-code solution, or find someone to play the customer role.

Continuous training

A good culture is nurtured by knowledge of cutting-edge techniques. The technical skills of your teams are the foundation of your organization and bolster your reputation as a resilient structure.

Continuous training is a straightforward way to prevent your organization from losing millions each year. Indeed, if your staff stays updated with the latest technologies, they're less likely to be deceived by third parties. These third parties often promise "the ideal solution" through impressive and ambitious presentations, which, more often than not, hide an underdeveloped or wholly deficient service. By staying updated, your team will make better decisions for your budget and the organization's future.

However, keeping pace isn't easy, especially considering how fast technology evolves. All the more reason to implement good training practices right from your employees' onset.

For instance, at Google, interns start with a full week dedicated to training. They get briefed on security best practices, administrative tasks they need to complete, and are introduced to internal technical tools. Later, like all employees, they must periodically complete awareness modules on a dedicated platform with written or video courses.

The United States Air Force has, since 2019, invested heavily in self-learning solutions. In a podcast, its former Chief Software Officer, Nicolas CHAILLAN, explains how he deployed this system for over 100,000 developers. A web platform was launched with educational content specially selected or created by his teams. He added that an hour a day was allotted to employees to "catch up and stay updated on the latest technologies."

Nicolas CHAILLAN says : "Training is an investment for the company and for themselves. People who don't want to learn by themselves don't have much chance of succeeding in I-T. Anyway, the industry moves so fast they don't have a choice."

Following the USAF's footsteps, one successful approach I witnessed in one of my previous experiences was: we managed to get one day of remote work per week. It wasn't easy to get approval from our managers, but they finally granted it after understanding its benefits. This day was dedicated to our continuous training as AI, data, and DevOps experts. But we were equipped, and our progress was measurable: almost unlimited access to a Cloud service and an e-learning platform. The latter allowed our management to see statistics on our training time and completed courses. The cost of these services was negligible compared to the knowledge they imparted.

If you already have technical teams, allow them to experiment and practice. From what I've observed, the most effective approach for an organization is investing time in training its staff. For instance, provide them access to machines or cloud hosting services to experiment with the latest innovations from the private sector or open-source projects. Your teams will be thrilled to have access to these services, while the management will be assured of receiving the best advice from updated employees.

It might be tempting to think that training staff in innovative technology - making them attractive to competitors - might encourage them to switch companies. Firstly, leaving just because of acquiring a new skill indicates limited prospects within their current company, reflecting already demotivated, thus less productive, personnel. Secondly, research suggests that staff training in their free time tend to look for other jobs more often. The opposite is true when the company provides the training.

In any case, present your transformation as a career growth opportunity. And be honest with those who need to upscale: yes, it will require personal effort and time. But developing these new skills is worth it.

Leveraging automation

In increasingly complex information systems, it is essential to automate recurring tasks. Humans are the primary source of errors within an information system. Any seasoned engineer will confirm this. That's why Google teams try to minimize operator interactions when managing their systems.

Carla GEISSER, SRE at Google says : « If a human operator needs to touch your system during normal operations, you have a bug. The definition of normal changes as your systems grow. »

If you want to make your I-T system an integral tool within your company, you must first automate repetitive and time-consuming actions: manual tasks (or toil).

This notion of toil describes all manual, repetitive, and automatable tasks. Essentially, these are all the intellectually uninteresting tasks that a robot would be far better suited to do than your brilliant engineers.

Google's SRE teams aim to keep operational work below 50% of the time for each SRE. At least 50% of each SRE's time should be devoted to engineering projects that will reduce the future amount of manual tasks or add functionalities to the infrastructure.

This process can begin with small things within your existing infrastructure. In this chapter, we'll categorize them based on organizational maturity levels. It's up to you to determine which level of automation is most suitable for your organization, based on the technological acculturation of your engineering teams and the time you want to allocate to implementing these technologies.

Keep this in mind: automation is the action that reduces technical debt. Make sure you give your teams enough time to work on it.

Infrastructure as Code (IaC)

This popular term is easy to understand: it encompasses practices and technologies that make your infrastructure configuration explicit, in the form of computer code.

Here are some configuration examples:

  • Setting the new time server for all your machines ;
  • Updating software in production;
  • Updating the wallpaper of all your machines ;
  • Adding a new domain name.

Of course, when I mention "all your machines", IaC scripts allow you to specify which machines exactly, so changes are applied only to specific groups of machines.

This practice offers several benefits:

  • Documentation. IaC scripts are written in programming languages or using standardized configuration files. The engineer reviewing the project can directly see how the configuration works and how to use or modify it.
  • Reliability. IaC scripts can be executed by machines or humans, depending on the desired environment following algorithmic rules. There's nothing more reliable than code executed by a machine over a human. It's also possible to implement security checks depending on who runs these scripts.
  • Replayability. every IaC script should be idempotent, meaning running the same script one or more times should produce the same effect on the infrastructure. This makes it faster to develop and modify compared to traditional scripts.
  • Versioning. IaC scripts - like any other algorithm - can be versioned. This allows tracking their changes and being peer-reviewed by all technical teams over time.

Common technologies for these tasks include: Ansible, Terraform, Puppet, and SaltStack.

Each has its pros, cons, and community. Some complement each other. The key is to adopt a standardized format so your SRE teams can navigate it. A newcomer will greatly benefit from these practices, and your most seasoned engineers can incrementally improve these algorithms.

You can start by automating your infrastructures with basic scripts (e.g., bash, Powershell) and then move on to more advanced technologies like Ansible that will standardize your configurations.

For supervising and automating these admin tasks, advanced tools like Ansible AWX, Ansible Tower or Palantir Apollo might be worth considering, depending on your organization's maturity level.

Remember that maintaining infrastructure is complex, so keep it simple! Don't rush to adopt the latest technology just because it's "sexy": the more technologies and abstraction layers you add, the larger and more experienced your team needs to be to maintain and fix it.

Test-driven development

Test-driven development (or TDD for short) is a software development practice that dates back to the early 2000s. The objective is to control software erosion, which means preventing regressions and managing technical debt over time. Put simply: it's about avoiding bugs as contributions accumulate.

The idea is to write tests before developing the actual functionality. The TDD development cycle is as follows:

  1. Add a test. Introducing a new feature begins with writing a test that passes if and only if the feature's specifications are met. By the way, my personal recommendation is to write at least one passing test and one test that is supposed to fail. This helps understand the bounds of the use-case a test should cover.
  2. Run all software tests. Your new test should fail at this point, since the responding function hasn’t been written yet.
  3. Develop an initial version of the function: Whether crude or hard-coded, the goal is to have a function that meets the test as simply as possible. It will be refined in step 5.
  4. Run all software tests. Every test, including yours, should pass at this point.
  5. Refactor the code if needed. Use tests after each change to ensure the functionality remains intact. Now that you’re confident the initial code meets the requirements, you can enhance it by breaking up functions, removing duplicated code or improving naming conventions.

This approach is common in tech companies, especially within giants like Google, Apple, Meta or Amazon. They rely on it to manage their technical debt, despite having thousands of developers contributing in parallel to their software daily. Most of the time, software is developed with few or no tests. It can be challenging to justify to non-technical superiors the time spent on test development instead of focusing on new features. While working with TDD might impact productivity, it significantly enhances code quality.

For legacy software, it's advisable to at least adopt the TLD (test-last development) approach, which means developing tests after the functionality has been created. Then, progressively transition to TDD to improve code quality and reduce complexity. For new projects, prioritize TDD.

In all scenarios, the goal is to test your code to prevent unpleasant surprises in production. According to Atlassian, it's recommended to have 80% of your code covered by tests (known as "code coverage").

TDD is recommended in certain scenarios but not all. For instance, if you operate in a regulated industry like banking or healthcare, it is imperative to test your code. Software malfunctions can impact your organization's legal liability. If your software is designed for long-term use and maintenance - such as in defense - TDD is advised. However, if you're a start-up in the proof-of-concept phase, software malfunction consequences might be less severe, allowing you to prioritize productivity. As your organization grows and the number of contributors increases, testing becomes essential. For a new contributor, a test can serve as an example of how a function operates, aiding code understanding.

In essence, it's about striking a balance between productivity, functionality assurance, and technical debt control.

This chapter serves as an introduction to the importance of testing your code. Many complementary approaches and best practices concern software engineering more generally (such as YAGNI, KISS or DRY methods) rather than specifically DevOps. For instance, TDD can be supplemented with BDD (behavior-driven development) or ATTD (acceptance test-driven development), if your organization's maturity and team size allow.

All these tests can be automatically verified before any production release. Let’s explore what this entails and how to implement it in the next chapter.

Continuous Integration

Continuous Integration (abbreviated CI) is a development practice within the software factory. The idea is as follows: with every code change, automated scripts are triggered to check the conformity of the contribution. This conformity can relate to security standards, verify software quality, or check prerequisites for production deployment.

For instance, your security teams may not have the time to validate the conformity of every contribution. They can then delegate part of these checks to scripts that will automatically and consistently ensure the codebase meets your security standards. The benefits are threefold:

  1. Your security engineers can work on higher value-added tasks
  2. The compliance with your security rules is no longer "dictated" but guaranteed by "coded" checks
  3. Developers see directly if their code is compliant and can immediately modify it if isn't

Thus, in a DevOps approach, security managers are no longer individuals setting rules on paper but engineers "coding" security rules in the form of automated scripts, within the software forge. This ensures these rules are respected by developers and production.

Here are some examples of algorithms that can be executed to automatically check rules or take actions upon a triggering event:

  • Ensure the presence of documentation
  • Ensure documentation follows the organization's defined formatting
  • Verify that the documentation is up-to-date
  • Ensure all environment variables are declared in the appropriate files
  • Check that passwords haven't been mistakenly added
  • Ensure the presence of a required configuration file
  • Ensure code adheres to development and formatting standards (e.g., PEP8, black, pylint)

All these tasks contribute to reducing the technical debt of your codebase and facilitate the deployment of your projects, ensuring the effectiveness of the standards defined by your DevOps teams.

It's common to hear about a so-called continuous integration "pipeline", which accompanies other terms in the CI/CD tech universe. Let's define the most common ones:

  • A Job is a task or script triggered automatically upon an event ;
  • A Pipeline is a sequence of jobs ;
  • Stages are generally the three steps of a continuous integration pipelines (build, test, deploy) ;
  • The "Build" stage includes jobs ensuring the code compiles correctly, and the Docker image builds properly with the directory contents ;
  • The "Test" stage includes jobs checking the code/contribution's conformity ;
  • The "Deploy" stage includes jobs executing actions impacting the infrastructure or production.

As mentioned earlier, the advantage of a continuous integration pipeline is also to test the pushed code across multiple environments automatically: your development and pre-production environments before deploying to production. However, these multi-environment pipelines introduce additional complexity, which requires a larger technical team to manage.

Within a software factory, technologies such as GitLab Runners, GitHub Actions, or services like Circle CI are used to execute continuous integration tasks.

Continuous deployment

Continuous deployment (abbreviated CD) is a DevOps practice that allows the triggering of administrative actions or the deployment and updating of software in production. The triggering is not necessarily automated, but the applied actions are coded. This means they are predictable, traceable, and replicable. This reduces the time to provide a new feature to its users, minimizing manual intervention and the risk of errors by administrators.

This practice aligns with the principle of "continuous delivery", which encompasses steps prior to deployment. For instance, publishing the binaries or images of the software's latest version, or creating the latest release of the project in the software factory.

Most of the time, continuous deployment pipelines are technically similar to continuous integration pipelines. For example, they replay tasks from continuous integration pipelines before deploying the software. However, they might require more specific parameters, such as environment variables or secrets (e.g., Hashicorp Vault or Conjur). Indeed, deployed software often relies on environment variables to run correctly on a target infrastructure.

It is common to encounter different staging and production environments. These validate the proper functioning of software before its production release. Continuous deployment pipelines automate all or part of this process, optionally adding smoke tests or functional tests.

Initially, the goal is to at least automate the update of your software in production. You can do this similarly to continuous integration pipelines, using GitLab Runners or GitHub Actions.

More advanced practices exist for seasoned users. As discussed in the "GitOps" chapter, our git repository is the "single source of truth" for software. Therefore, infrastructure should ideally rely on it to determine the expected state of software in production. For instance, ArgoCD continually checks for changes in a git repository on a specific branch (often "main" or "master"). When ArgoCD detects a change, it attempts to deploy the very latest version of the monitored software.

Tools like ArgoCD or Jenkins X allow visual tracking of software deployment status. They shine in a Cloud environment, observing the state of each micro-service.

Built on the same mechanics, it's possible to deploy multiple instances of software simultaneously. For instance, during a code review in a merge request, you can configure ArgoCD to temporarily and independently deploy this "under evaluation" version of the software. This technique allows engineers to quickly test software, rather than deploying it themselves. The URLs often look like xyz.staging.myapp.com.

Using these same tools, you can adopt and automate a blue/green deployment strategy. This technique gradually shifts users to a new software version, ensuring it functions properly. The idea is to instantiate the new software version (the green one) alongside the current one (the blue one). The system then directs a limited proportion of users to the new software (e.g., 10%). This proportion is gradually increased over a set period, while measuring the error rate for each request. If the rate is the same or lower than the previous deployment, the software is rolled out to all users. Otherwise, deployment is canceled, and the old version remains in production.

Even more advanced tools exist to address large-scale deployment challenges. We'll explore Palantir's Apollo as an example in the chapter "Deploying simultaneously in different environments".

Moreover, continuous deployment pipelines are not limited to software deployment or administrative task launches. They can be the starting point for monitoring your software. For instance, a continuous deployment pipeline can set up a Prometheus / Grafana instance and start sending its activity logs. Deploying your software doesn't mark the end of your infrastructure's resilience cycle: now you need to monitor it. We'll delve into these techniques in the chapter "Measure Everything".

Measuring everything

In the previous chapter - "Leveraging automation" - we saw how automation greatly saves time in managing our infrastructure and enhances its security and resilience.

In this chapter, we'll discuss a significant dimension of automation: observability. It's through measurements that systems can be massively automated, and better decisions can be made organization-wide. Measuring everything achieves three objectives:

  1. Technical and commercial teams can know the state of a service at any time (whether it is operational, partially accessible or down) ;
  2. Technical teams can analyze data to pinpoint issues and attempt to resolve them ;
  3. With these data insights, technical teams can assist commercial teams in making better decisions for the organization.

Trusting decisions based on its own data marks the culmination of a successful DevOps transformation. This is commonly known as "data-driven decision making."

The 3 pillars of observability

Activity logs, metrics, and traces are regarded as the three pillars of observability. These three types of data can be generated by software to identify and address issues that might arise once deployed.

Observability is a vast topic in the realm of system reliability. In this chapter, we will only touch upon the essentials.

The field of observability can be summarized as the set of tools and practices that allow engineers to detect, diagnose, and resolve system issues (e.g., bugs, latencies and availability) as swiftly as possible. Beyond the need for resilience, the collection of some of this data is sometimes legally required.

Let's delve deeper into what each of these data types can tell us:

  • logs are immutable and timestamped records describing specific events over time. The code generating a log entry is usually manually added by a developer within the software ;
  • metrics are numerical representations of phenomena measured over time, such as the number of requests, response times, or resource usage (such as RAM, CPU, disk or network usage) ;
  • traces are a kind of log that follows the path of an operation (e.g., a request). A trace is a set of logs with additional information to trace an operation across the various services it traverses. Each stage and sub-operation traversed is termed a span. Logs for a trace are typically generated automatically.

Let's focus on traces to better grasp their implications. Consider an application sending a request to a REST API. A trace consists of spans and metrics, associated with a unique identifier. This ID differentiates the path of our request as it moves through all the services it touches.

Traces are independently relayed by libraries like OpenTelemetry's SDK. These are then sent to a trace collector such as Jaeger, Tempo, or Zipkin for validation, cleansing, and/or enrichment. They are subsequently stored in centralized log servers like Prometheus or Elasticsearch. The trace identifier allows us to retrieve the chronology of the operations it underwent.

The biggest challenge of tracing is its integration within an existing infrastructure. To fully utilize traces, every component the request goes through must emit a log and propagate the tracing info. Tracing via a service mesh might be a quick way to avail tracing features without altering the software code. We'll explore what a service mesh is and how this technology works in the chapter "Service Mesh".

Within vast infrastructures, logs and traces might be too extensive for timely ingestion by their log server. Data can then be lost. To avert this, it's common to use services that stagger the indexing. A server like Kafka can be positioned in front of the log server to gradually absorb logs. Then, a tool like the Jaeger Ingester steadily indexes them. For rsyslog logs, protocols like RELP might be necessary to ensure proper storage.

Whether using Logstash or Loki for logs, or Jaeger or Tempo for traces, normalizing your data is crucial for proper storage and processing. To address this challenge, the OpenTelemetry library defines semantic conventions. It's commonly used.

By implementing observability mechanisms, you'll be better equipped to answer questions like "what caused this bug to happen?". Your engineers can rely on comprehensive data to fix bugs faster. This data will enable us to construct our resilience indicators, leading to more informed decisions.

Knowing When to Innovate and When to Stop

At first glance, it's not clear where to draw the line between resilience and innovation. The idea is to measure the state of services to determine when it's appropriate to innovate.

Measurement is one step, but it's crucial to measure the right things, at the right level. In a distributed infrastructure, one of the servers can fail without necessarily affecting the availability of software for your customers. Measuring a server's availability might be interesting for your technicians, but it may not be the right metric to determine the impact of a malfunction on the user. This is something your organization needs to define:

  • What metrics indicate a service that is functioning "properly"?
  • What downtime percentage do you allow?

For the second question, you can't answer "100%". If you put all your efforts into keeping the service available 100% of the time, you'll slow down the release of new features. And it's these features that propel your project forward. That's where the concept of an "error budget" comes in.

The error budget is the amount of time within a given period that your company allows for your teams, during which your services can be unavailable. As long as your service availability exceeds the allowed downtime, you can take the opportunity to deploy a significant new service, highly interactive with others, or even update a critical system. But this budget is crucial for addressing hardware malfunctions requiring replacement or for intervening in a system during a planned outage.

For instance, if your error budget is 54 minutes per week, and you haven't exceeded 10 minutes in the past three weeks, allow yourself to take more risks. If it's the opposite, work on making your infrastructure more resilient.

In short, the error budget is an agreement between management and the technical teams, helping to prioritize innovative efforts versus work to enhance infrastructure resilience.

It enables engineering teams to reassess goals that might be too ambitious concerning the acceptable risk. This way, they can set realistic goals. The error budget allows teams to share responsibility for a service's resilience: infrastructure failures impact developers' error budget. Conversely, software failures affect the SRE teams' error budget.

Be mindful of your error budget consumption peaks: if an engineer spends ten hours instead of one to fix an incident, it's advisable to open a ticket with someone more experienced. This will prevent consuming the entire error budget.

To answer the first question, let's look at the possible indicators to monitor in the next chapter.

Resilience indicators

The 4 golden signals

Monitoring distributed systems presents a real dilemma. SRE teams need to monitor them effortlessly - allowing quick interventions - even though their architecture is often complex. Indeed, various technologies make up these systems. The 4 key signals offer a unified method of characterizing the most vital phenomena to watch.

Let's explore the four metrics that will allow us to create our resilience indicators:

  1. Latency, represents the time the system takes to respond to a request. It is essential to distinguish between successful requests and failed ones. For instance, if your systems return server errors quickly, it doesn't mean your system is healthy. Therefore, you should filter your latency measurements by excluding error responses.
  2. Traffic, represents the number of requests coming into a system. Usually expressed in requests per second or MB/s for data streams.
  3. Errors represent the rate of failed requests. Requests can fail "explicitly" or "implicitly". Explicit errors might return an HTTP 500 code, for example. Implicit errors could return an HTTP 200 code, but the content is not as expected.
  4. Saturation, represents the extent to which your system's resources are used. Resource utilization rate compared to the maximum load your system can handle. It helps answer questions like "Can my server handle client requests if the traffic doubles?" or "When is my hard drive likely to be full?". It's based on measurements of RAM, CPU, network, and I/O usage.

Within a Cloud infrastructure, a service mesh automates the collection of these measurements. We'll explore this technology in the "Service mesh" chapter. We'll also discuss tools available to gather and visualize these metrics. But before that, let's see how to create our resilience indicators in the next chapter.

SLI, SLO, and SLA

The value of your error budget stems from your Service Level Objectives.

An SLO defines a target resilience level for a system. It is represented as a ratio of "good" events to be honored, out of all monitored events, over a specific time period. For instance, your SRE team may set the following objective: "99% of pages should load in under 200 milliseconds over 28 days."

The "right" objective of an SLO is determined by the threshold of tolerance your customer can bear against an annoying issue. For example, quantify what it means for them to experience a "slow" website (e.g., through an SEO study). If your clients typically leave your pages after waiting more than 200 milliseconds, set your SLO to "99.9% of responses should be returned in under 200 milliseconds, over 1 month."

A good SLO should always be close to 100% without ever reaching it; we discussed the reasons for this in the chapter "Knowing When to Innovate and When to Stop." As for how often you should achieve this target (99.9% over 1 month), there's no initial rule to define it. You can base it on the average of your past measurements or experiment. This value should match the workload your team can handle.

SLOs are built upon one or more "Service Level Indicators" (abbreviated SLI). The SLI is the current rate of good events measured, from all considered events, over a given period. Built on one or more measurements, it measures one aspect of a system's resilience. It characterizes a phenomenon that can negatively impact your user: response time to a query, the number of up-to-date returned data, or even read and write latency for data storage.

An SLI can be composed of one or more measurements. However, avoid creating overly complex SLIs or SLOs, as they might represent vague or misleading phenomena.

An SLO sets a service quality to maintain, which means a certain value for an SLI. An SLO takes a format like: "SLI X should be maintained Y% of the time over Z days."

If SLOs should primarily represent your users' pain points, you can also establish them for your internal teams. For instance, your infrastructure could ensure each server responds "99% of the time in under 500 milliseconds to ICMP requests over 1 week." In this case, determine your SLOs based on your historical measurements. For instance, if 99% of your ICMP requests responded in under 300 milliseconds last month, set the SLO to "99% of ICMP requests should respond in under 300 milliseconds over one month."

SLOs should be defined in collaboration with decision-makers. Their involvement in this definition is essential for them to understand how their decisions (e.g., prioritizing tasks or workload demands) impact the resilience of the infrastructure. If a decision-maker wants engineers to frequently and quickly roll out new features, SLOs help understand when the imposed pace is too demanding. Conversely, consistently exceeding SLOs suggests that your company can move faster without compromising service quality. The participation of decision-makers is all the more critical as they sometimes stake the company's reputation on the line. It's the SLA that is the root cause of this.

The Service Level Agreement (abbreviated SLA) is a contract between your organization and a client. If your service quality falls below what your SLAs dictate, your organization faces penalties. An SLA is built on one or more SLOs, setting intentionally lower resilience rates for safety. Here are some examples:

  • Below 99.9% availability, Google starts refunding its Google Workspace clients. Between 99.9% and 99% availability, 3 extra access days are added to the client's account. Below 95%, it's 15 days ;
  • Below 99.5% availability, AWS begins refunding its EC2 instance clients. Between 99.9% and 99% availability, the client is refunded 10% of their expenses. Below 95%, they're refunded in full (100%) ;
  • Below 99.9% availability, Microsoft starts refunding its Teams customers. Below 99.9%, the client receives a credit amounting to 25% of their expenses. Below 95%, they get a credit for 100% of their expenses.

SLAs are not mandated by law. However, they can be part of your service contract to clarify your commitments and avoid disputes. Indeed, it's always preferable to list clear terms both you and your client have agreed to. SLAs also serve as a competitive edge: your company commits to a certain quality of service, whereas your competitors might not. Implementing an SLA in your governance approach holds stakeholders accountable and shares expectations. The company now pivots based on the metrics it gathers and interprets as SLOs. This is referred to as being data-driven.

Within an institution, you can use SLOs and SLAs as a means to gain credibility among your superiors or specific teams. An SLA might justify hiring necessary personnel to maintain a certain service level. Or it could warrant a budget increase to enhance the team's operations. Conversely, superiors might demand a certain service quality level from your teams, reflecting in the annual objectives of staff members. For pilot projects, establishing SLOs is sufficient. Setting reliable SLOs is a challenge in itself. Maintaining those objectives is another.

Alerts and percentile aggregation

Your alerting mechanisms must continuously monitor your SLIs to ensure they don't exceed your SLOs. And most importantly, that they don't surpass your SLAs! But how do you raise an alert before your SLOs or SLAs are breached?

Take this SLO as an example: 99% of pages must load in under 200 milliseconds over 28 days.

The simplest method is to calculate the average page load time over a short period. For instance, the average load time over 5 minutes. When set to this approach, your alerting mechanism triggers if, over the past 5 minutes, the average load time exceeds 200 milliseconds.

However, basing alerts on averages or medians is not ideal. This approach might miss widespread failures. Google recommends another method using percentiles. This distribution method highlights trends among the top X% of gathered measurements.

Imagine your infrastructure serving millions of users, handling billions of requests. A faulty page, affecting only a few hundred users on your site, might go unnoticed if you rely on averages or medians. But with percentile aggregation, you can spot these anomalies more clearly.

The original book includes illustrations to better understand SLIs and SLOs.

To develop your intuition regarding these indicators, start with classic SLIs and SLOs. Once your infrastructure matures – especially in user count – you can shift to advanced SLIs and SLOs.

M-T-T-x

M-T-T-x refers to metrics that qualify the average time it takes for an event to occur or conclude. The "x" in the acronym M-T-T-x represents the multiplicity of these types of metrics. For instance, M-T-T-R (an abbreviation for mean time to recovery) is used to track how long a team takes to restore a failing system.

Monitoring these metrics over time allows you to gauge the effectiveness of your resilience efforts. It also helps in assessing the efficiency of your teams in responding to incidents. If the metrics deteriorate, you will need to examine the reasons and possibly reshuffle your priorities so as not to compromise your SLOs. The advantage is that you will know what to focus on.

There are numerous M-T-T-x in literature, each with their particularities and nuances.

You can begin tracking your M-T-T-x using collaborative spreadsheets (e.g., Baserow, NocoDB or Google Sheets) and later transition to more integrated tools like Jira Service Management or Odoo. The idea is to be able to compute and visualize the trend of your M-T-T-x over time.

If you choose a spreadsheet, you can use a defined structure. The original book shows a table of successive M-T-T events associated with a start date, end date and a link to the detailed incident.

  • The metric column denotes the M-T-T-x name ;
  • The start date indicates when the event began ;
  • The end date represents when the event ended ;
  • The incident column can reference an incident ID or link to the postmortem.

Calculate your M-T-T-x by averaging the differences between start and end dates for each incident. Sample over a calendar month period.

Most of the metrics can be derived from your postmortem. They are intrinsically linked to it and complement it. Ensure to keep your M-T-T-x updated to quantify your resilience level and pinpoint critical areas affecting it.

Service mesh

Despite its very tangible and practical application, the service mesh or "service grid of services" can seem complex at first glance.

Let's approach it through some challenges that illustrate its significance:

  • "Our software is written in 6 different languages, and we don't have a unified way to gather telemetry (which are application logs, error logs and metrics)." ;
  • "We have 70 system administration teams, and getting them to implement TLS between all their services would be an organizational nightmare." ;
  • "We have hundreds of containers running on multiple geographically distributed machines with no unified way to analyze network latencies." ;
  • "We're experiencing slowness in our service usage and can't determine if it's a network or software issue." ;
  • "We have no means of assessing if a newly deployed software version introduces slowdowns.".

Thanks to the standardized deployment mechanisms offered by container orchestration systems (e.g., Kubernetes), a service mesh can address these challenges by "plugging into" your orchestration system. It can enhance the security, stability, and observability of your infrastructure by.

  • Managing security certificates in one place ;
  • Handling advanced authorizations in the administration of network flows ;
  • Controlling network flows with rules (with A/B testing, canary or blue/green deployments and request rate limits) ;
  • Distributing network load equally among services (with load balancing) ;
  • Automatically gathering network metrics based on the "4 golden signals", regardless of where the pods are deployed ;
  • Collecting application access logs ;
  • Providing details on the routing of requests across pods distributed over multiple nodes.

As these metrics are standardized, most service meshes allow for their use in setting up automatic rules based on the network activity of the infrastructure.

In summary, a service mesh manages all or part of the following aspects: network traffic management, flow security, and network observability. This leads to better infrastructure security, improved auditability, and reduced service disruption.

An overview of the workings of a service mesh: "proxy" containers are added to each pod to manage interactions with the service mesh.

Technically, a service mesh will install on your orchestration software (e.g., Kubernetes) and attach a container called sidecar to each pod. This sidecar acts as a network proxy, managing the above-mentioned interactions with the service mesh.

However, a service mesh is not lightweight technology: it requires internal administration and training before you can reap its benefits. Don't expect a technology that reduces your system administrators from 50 to 5 to be manageable by only 2 people. Service meshes are undoubtedly beneficial, but ensure you're sized to manage them.

Several service meshes are available, each with its strengths and weaknesses. Take your time to compare them before selecting one. For instance, Linkerd is easier to deploy than Istio but offers fewer features. Consul is another alternative.

Extensions to simplify infrastructure

As described in the chapter "A Foundation for Your Resilience", Cloud platforms offer the advantage of including a variety of services that cater to common security and monitoring needs. These services automatically handle features that historically were tedious to develop individually for each software or for the infrastructure itself.

Using CRDs or by deploying the Helm configurations of Cloud native tools, it is possible to easily "install" foundational services within a Kubernetes cluster. Here's a non-exhaustive list of services that can be natively supported in your cluster and managed centrally:

  1. Centralization of application and network logs and traces (with technologies such as Filebeat, Fluentd, OpenTelemetry or Jaeger).
  2. Centralization of performance metrics of cluster nodes and containers (with technologies such as Mimir or Metricbeat).
  3. Antivirus analysis of node and container content (with technologies such as Docker Antivirus Exclusions or Kubernetes ClamAV).
  4. Detection of suspicious Linux system call behaviors (with technologies such as Sysdig Falco).
  5. Control and audit of cluster configurations (with technologies such as Gatekeeper or OpenSCAP).
  6. Management of application secrets (such as passwords and tokens) (with technologies such as Vault or Sealed Secrets).
  7. Automated backup of persistent volumes (with technologies such as Velero).
  8. Encryption of network traffic between containers.
  9. Management of security certificates.
  10. Management of web service authentication (with technologies such as Istio Ingress Gateway or Keycloak).

Integration and automation are the fundamental characteristics of a Cloud foundation. Once again, in DevOps, it's believed that something not automated won't be used.

The technologies mentioned above automatically interface with the deployed software. In the Cloud, it's not the software's job to interface with the foundation's technologies, but rather the foundation interfaces with the software.

Leveraging available resources

Finding ambassadors for your project

The project manager is responsible for doing everything necessary to ensure the project meets its objectives. They often play the role of a product owner - a term defined in the Agile methodology - who acts as a liaison between technical and business teams. They are the one who "sells" your project to its users.

It's vital for this role to be close to both the end-users, to understand business challenges, and to the technical team to grasp engineering stakes.

Sometimes, project managers tend to "over-promise" timelines. This practice creates stress for teams and ultimately leads to client frustration. Indeed, clients are promised a tool that will only be delivered later. Thus, it's essential to manage expectations.

Always keep in mind: "Under-promise, over-deliver"

To accelerate the adoption of your solutions, invite a business representative to your presentations. If this person is convinced by your product, they might be inclined to present it themselves, explaining its significance in their daily work.

Getting clients to vouch for you is the best way to gain credibility. It proves that your solution meets a current need. By illustrating a use case, your audience can quickly envision how they might use your tool. If you want to convince a hard-to-reach audience, a client testimonial is your best ally.

Try to establish a strong network of a few "ambassadors" within your organization to assert your legitimacy and support your initiative. Besides this support, the ambassador will help capture user feedback or provide it themselves to refine your value proposition.

Reservists or "20% project"

In the private sector, especially among the GAFAM, it's common for employees to get one day a week dedicated to participating in a different project within the company. One day out of five, they choose to work for another team. This option benefits both the employee and the company: the employee explores different technologies and practices, enhances skills in those areas, and then leverages this knowledge for other projects they handle.

Another example is the "10% program" by french governmental organizations DINUM and INSEE. Based on volunteering, the aim is for public service agents to dedicate 10% of their working time to common interest projects.

Try to offer your hierarchy this possibility so that each employee can benefit from this program: this will encourage exchanges, bring the teams closer together and build loyalty among your employees by allowing them to discover and work on new subjects.

To take advantage of all the resources at your disposal, consider employing reserve personnel within your team if your organization allows it. Even if they are only present a few days a year, they can support you on specific tasks. For example, an information systems security reservist will help you complete certification. A data scientist to evaluate an artificial intelligence solution or provide one-off support on a complex dataset to process.

Public/private synergy: a win-win approach

Major organizations today primarily rely on services provided by industrial partners for their technical projects. This might be due to a lack of in-house experts, a lack of human resources, or both. It's a mistake to simply trust the industrial partner thinking, "they are the experts, everything will work, I just need to pay." Anyone who has led an industrial program has faced challenges with stakeholders understanding the business stakes and has seen that a project never goes 100% according to the planned blueprint.

It's a strategic error to believe that merely paying a service provider will get you the solution you expect. If you're not a technical expert in the field who has practiced recently, you'll never be at a level to effectively challenge your provider's proposals. You risk either not addressing your business challenges, losing money, or likely both.

This is why it's crucial to have in-house, within your own teams, experts who are practitioners on the topic you want to develop. They are the only ones capable of critically assessing your service provider's proposals to save you time and prevent you from being tricked with features that have exorbitant costs or unrealistic promises.

Every DevOps and SRE engineer knows: it's impossible for a system to function 100% of the time. That's why you cannot expect from a service provider, regardless of the price you pay, to deliver something 100% functional.

For instance, even Google does not promise more than 99.9% availability with its capitalization of over 1.6 trillion dollars and its approximately 150,000 rigorously selected employees. Amazon (more exactly AWS) with its capitalization of more than 1.4 trillion dollars does not guarantee more than 99.5%.

Better organization to avoid failure

The traditional approach of institutions working with industrial partners resembles "waterfall" developments: a major meeting is set up to gather requirements, a technical and functional specification document is drafted to structure the contract, developments are then undertaken, and the final product is delivered, concluding the contract.

Given the intense dynamics of the digital realm, this method is suboptimal. According to Procter & Gamble, the average lifespan of software doesn't exceed 3 to 5 years, even if periodic updates are provided.

Now consider this scenario: you are tasked with equipping your organization with a new digital tool.

  • If you've reached the point of initiating the project, it's likely that the need for this tool arose several months or even years ago ;
  • You then gear up to compare existing solutions and engage with an industrial partner. This takes between 1 to 3 months ;
  • Once you've chosen your industrial partner, you arrange a meeting between stakeholders and the industry experts to help them understand the challenges and your expectations ;
  • Drafting the specification takes an additional month. Some back-and-forths to refine it and that's 1 month more ;
  • You'll likely need to get approval for this new tool to comply with your organization's I-T security policies: even if conducted concurrently, this will likely add another month ;
  • Contract finalization also takes 1 month. Development lasts between 3 to 6 months ;
  • Presentations and operational verification: 1 month ;
  • Deployment gets you an additional 2 weeks to 2 months, depending on your I-T security policies and available networks.

In the end, the entire process might take roughly a year, and you've yet to place the tool in the users' hands. At this point, you can't even be sure it meets the need, considering that the stated requirement often differs from the actual requirement.

Unfortunately, when the users get their hands on the tool, it might not fully meet the needs. The tool might be impractical, and your colleagues might prefer the old one they are familiar with.

Such an approach is untenable today. One of the tenets of DevOps is the ability to "fail fast," iterate frequently, and swiftly arrive at a version that meets requirements. In this context, the DevOps methodology advises against rushing into a "fully fleshed out" specification. Start with an initial version, fail, iterate, and perfect the tool alongside your customer.

Remember this principle? "Break down organizational silos by involving everyone": it's vital to engage your customers throughout the project lifecycle. If you don't frequently consider their feedback, the end product might be misaligned. Even if it does meet their needs, it might be too complex and thus unappealing.

Thus, if you aim to work efficiently with an external company, you should bring all project-related stakeholders closer. Ensure everyone's voice is heard by establishing an easy and practical communication tool for feedback and suggestions. For instance, you could ask the industrial partner to grant access to their software factory (e.g., GitLab, BitBucket and GitHub) for your teams to provide comments, and for engineers to address them in a feedback loop.

GitLab also supports continuous deployment, allowing the industrial partner to provide clients with a URL to access the latest version of the software. This way, you avoid lengthy meetings and achieve flexibility. The goal is met: you iterate, quickly.

The figure showcases the Kanban view in GitLab, where comments on software (such as tasks, feedback or bugs...) are consolidated.

If you can't influence your collaboration practices with the industrial partner, at the very least, internally organize to have a collaborative project management tool. For instance, use software like Atlassian Confluence to create an internal knowledge base.

Kanban view in Atlassian Confluence showing consolidated comments on software (such as tasks to be done, feedback or bugs...).

For instance, the ITZBund (German Federal Center for Information Technology) has been using the open-source software Nextcloud in its Bundescloud (inter-ministerial cloud) since 2018. It enables file sharing and collaboration on a unified platform. Roughly 300,000 institutional and industrial users employ it. A year later, the French Ministry of the Interior also adopted it.

This practice is a win-win for everyone: clients experience shorter delivery times, end-users get a tool that better fits their needs, the industrial partner sees the potential for renewed contracts with satisfied clients, and taxpayers get value for their money. Overall, everyone saves time, is pleased with the outcomes, and feels more engaged in every interaction.

Measuring the success of your transformation

It is crucial to measure the efforts you invest in your initiative. This allows for a factual assessment of the effectiveness of your decision-making. It's not initially uncommon to witness a degradation in performance since you are altering routines, your organization's equilibrium. If you notice a decline in the metrics over time, you know you need to adopt a different strategy to reverse the trend.

According to research, an organization's technical maturity can quadruple its team's performance. Let's explore some indicators used in the industry. These indicators are frequently debated but still seem to be the widely accepted reference.

The success of a DevOps initiative is measured using 4 theorized measures. An additional fifth measure reflects the organization's operational performance. These metrics showcase results at the overall scale of your I-T systems and your organization rather than just software measures. The latter might stem from local improvements, compromising overall performance. Let's dive into them:

  • Deployment Frequency : For the primary software or service you are working on, how often does your organization deploy code to production or make it available to its users?
  • Lead Time for Changes : For the primary software or service you're working on, how long does it take to get it into production ? (i.e., the time from validated code to functioning code in production)
  • Time to Restore Service : For the primary software or service you are focusing on, how long does it typically take to restore the service when an incident or fault impacting users occurs ? (e.g., an unplanned outage or degraded service)
  • Change Failure Rate : For the primary software or service you're working on, what percentage of production updates or new version releases lead to service degradation (e.g., deterioration or service interruption) and subsequently require fixes (e.g., a hotfix, a rollback, a fix delay or a patch)?

All these measures are based on the infrastructure's availability rather than its resilience. DORA report researchers subsequently posed a new question to organizations in 2021. This led to the introduction of a fifth metric:

  • "Operational Performance" or "Resilience". This evaluates the ability to meet or exceed resilience targets. The expected responses regarding resilience goals for this measure are: "often meets them", "meets them most of the time", "always exceeds them". This can be gauged, among other things, by SLOs or a user satisfaction rate.

If you are starting your initiative from scratch, comparing yourself to industry performance might not be relevant. Keep them in mind to know what goals to aim for but don't judge your success based on them. Gauge it based on the progression of your own measures over time. Everyone starts from an initial state with the aim to improve it.

The DORA 2022 report classified the surveyed organizations into three performance categories (low, medium, and high) for its four key measures.

GitLab even allows for real-time visualization of these metrics starting from version 12.3.

If you have a relatively recent version of GitLab or have set up continuous integration pipelines, you can measure most phenomena. Otherwise, ask your teams to record events on a collaborative interface (e.g., Google Sheets, Airtable or Atlassian Confluence).

Added to these measurements is one I call the "resilient collaboration trend". It captures the essence of a DevOps initiative in my view: succeeding in continuous innovation while maintaining low technical debt and providing the most available service possible. The following factors are multiplied to total the value of the resilient collaboration index (RCI):

  1. Number of days since the software's creation ;
  2. Number of contributors to the software since its creation: These first two factors determine the company's ability to maintain software that is maintainable over time, easy to grasp and modify. That is, its ability to maintain low technical debt ;
  3. Number of successful deployments in the quarter: This factor determines the company's ability to innovate regularly, from code writing to production ;
  4. Quarterly software availability in production (in percent): This factor determines the company's ability to provide a stable service to its users.

We then observe and compare the trend of this index over time. It is this trend that can be compared to other projects.

For example, the GitLab project - one of the largest collaborative open-source projects - displayed a resilient collaboration index that is 10.5% lower in Q3 than in Q4.

This index should be updated every quarter. This time interval can be shortened or extended depending on the maturity of your organization: the more confident you are in your ability to deploy regularly, the shorter your measurement interval can be. First instance over a semester, a quarter, a month, or a week.

Unlike the SRE, which relies on specific measurements (e.g., "The 4 golden signals" or "Resilience indicators"), the DevOps lead has some freedom to choose the measurements that seem most relevant to them. That is, those that best assess the service they provide to internal teams. However, the modus vivendi between DevOps and SRE is the "deployment lead time": both strive to make this parameter as satisfactory as possible.

Integrated DevOps platform

Deploying simultaneously in different environments

Your organization is sometimes tasked with deploying software in environments as diverse as they are unique. If you're lucky, these environments are few and connected. But things get complicated when the number starts to grow and they're isolated. It becomes essential to find a standardized way to deploy updates while minimizing delays.

Built on Kubernetes, Apollo is the product used by Palantir to deploy and keep its services up to date across all its client bases. With hundreds of engineers, over 400 software products, and thousands of deployments every day, Palantir boasts deploying its services across a hundred different computing environments (e.g., AWS, GCP, Azure, classified private clouds disconnected from the internet or edge servers with intermittent connections...).

Driven by the constraint of regularly deploying on varied infrastructures, work on Apollo began in early 2015. It was progressively rolled out to its clients from 2017 and has been commercially available since the start of 2022, powering Palantir's internal infrastructure. The product's philosophy is to interface with your existing infrastructure and services (e.g., your software forge, continuous integration engine or artifact registry...).

The company operates under the belief that software engineers and SREs each have their areas of expertise. On one hand, software engineers have a better understanding of how and when the software they develop should be updated. On the other, SREs are more familiar with the specifics and constraints of the environments in which they deploy. Thus, software engineers develop the code, Apollo deploys it, and the SREs monitor to ensure everything went as planned.

That's why Apollo's interface primarily showcases two menus: "Environments" (which is SRE-oriented) and "Products" (which is developers-oriented).

  • The "Environments" menu allows connection to different environments, defining deployment strategies across multiple environments (e.g., AWS, Azure or on-premise) and channels, setting software quality and security criteria, and approving infrastructure changes ;
  • The "Products" menu ensures that a software's new version is correctly deployed: Apollo automatically manages blue/green deployments and rollbacks. It enables the declaration of update strategies by specifying which service needs updating before another.

Connected to git repositories, it allows tracking and approving any code modifications before deployment.

Lastly, Apollo offers centralized monitoring of the status of services deployed across all your environments from a single platform. Whether connected to your favorite observability service (e.g., Datadog, Prometheus or Pagerduty) or operating independently via the Apollo Observability Platform, it incorporates feedback of all sorts of measurements to investigate incidents in detail.

Constraint-based deployment

With Apollo, Palantir introduces the concept of constraint-based continuous deployment. Apollo deploys an agent in each environment, reporting the real-time status of that environment to determine how updates should be deployed. This means Apollo knows both the expected state of deployments on infrastructure and the real-time, up-to-date state of these deployments.

Considering modern applications often rely on external services, this mechanism helps avoid incompatibilities between different versions of an app deployed across varied environments.

For instance, if application "foo" requires the service "bar" to be deployed, Apollo won't update "foo" until "bar" is available and deployed. The deployment of a new version of an application, dependent on a specific version of another, is often manually managed, even if continuous deployment is in place. Teams first ensure the dependent service "bar" is available and deployed before deploying its new version "foo". These dependencies are recorded in a specific file within the same project as the application's source code.

Another example is database schema migration. By declaring a database schema version compatible with a specific application version, Apollo prevents deploying an app incompatible with a database yet to be updated.

Distribution of initiatives

In 2016 surveys, 47% of organizations claimed to adopt a DevOps approach. This number rose to 74% in 2021.

From 2019 to 2022, the distribution of DevOps initiatives by industry remained roughly the same: primarily dominated by the tech sector (at around 40%), followed by the financial sector (at around 12%) and e-commerce (at around 8%). The institutional sector accounted for 2% to 4% of these initiatives, indicating ample room for innovation in this domain.

Here's a breakdown of companies practicing DevOps in 2022 : large companies are around 30%. Medium-sized companies are around 38%. Small companies are around 26% and very small ones are around 6%.

The 2019 crisis accelerated digital transformation initiatives, leading to a 23% growth in DevOps team sizes during that period.

In 2022, the geographical distribution of organizations adopting DevOps practices is still challenging to pinpoint. However, North America seems to be a major hub, accounting for about 33% of DevOps initiatives. Europe and Asia follow closely with approximately 33% each (and India at 21%). In 2019, North America accounted for 50% of these initiatives, Europe 29%, and Asia 9%. This indicates a growing interest in the subject among Asian countries.

The average size of DevOps teams remains small, averaging around 8 members.

This positions DevOps as a methodology primarily adopted by companies that have reached a critical mass and is yet to gain traction in non-tech businesses.

Conclusion

Transforming an organization, regardless of its size, is a complex task involving significant political, technical, and human challenges. Should this transformation fail, the consequences can be severe. At the same time, it's crucial for your organization to consider the long-term implications of continuing with its current model. DevOps aims to minimize these risks through standardized methodologies and tools.

Research and the experience of thousands of businesses today allow us to understand the challenges related to transitioning organizations to the Cloud. Having proven its effectiveness, institutions are gradually shifting their focus to DevOps, although few have fully embraced it yet. One major hurdle remains in sourcing talent in this area, but the foremost challenge is to persuade the leadership.

Several strategies can be adopted depending on your hierarchical and technical position. The most common is to start with a pilot project that addresses internal needs (e.g., deploying software co-developed with your business teams). This can attract initial internal partners.

Provide services promptly to demonstrate the efficiency of your approach compared to traditional methods (e.g., software better suited to needs, streamlined deployment or quick response to incidents...). Once the early adopters are convinced, have them testify during your presentations to decision-makers. Business teams often agree to do this, feeling indebted for the services you've provided. With such a powerful impact, you can gradually rally a community to elevate your vision.

Facilitating change is primarily about minimizing risks undertaken. Starting small and iterating is the best approach to success. Moreover, by understanding the psychological and technical realities behind a transformation project, you'll have all the tools and arguments for a quicker and less perilous transition. Presenting Cloud technologies and DevOps as evolutionary rather than disruptive techniques is an effective way to persuade.

Like major corporations that constantly invest in new technologies, every organization must be willing to take risks to remain competitive. Your executive committee should remain open to surprising perspectives and encourage experimentation.

For instance, it's vital not to underestimate the potential of employees deemed challenging to manage. Some might be the visionaries that will define your future. Seriously considering the impact of their ideas is essential, lest you miss critical opportunities for the organization's future.

While business teams you assist see immediate benefits, this value is often more abstract for the leadership. As the instigator of a transformation, you need to invest time in familiarizing organizational decision-makers. Don't hesitate to start with basic Cloud concepts and gradually clarify the implications of DevOps for stakeholders. It's crucial to provide examples of how you've addressed internal dysfunctions with your approach.

The initiator should always be prepared to answer the following questions:

  1. Why do we need to change? Provide specific examples of dysfunctions within the organization.
  2. What's the benefit of this approach for our organization and my mandate? Quantify the amount of time or money this approach could save. Also explain how the image of the decision-maker could be enhanced by your project.
  3. What will this transformation cost us and what is its ROI? Quantify the investment required for this transformation Also present your transformation plan: training schedule, contracting plan, equipment purchasing plan.
  4. What does the rest of the organization think? List the strengths and weaknesses of the approach. This requires having consulted internal teams for their perspectives.

With the leadership convinced and granting you both technical and political resources, the journey is only beginning! Don't advertise capabilities you don't yet master. Start by providing access only to a subset of willing users and establish your procedures.

Your initiative will inevitably face challenges initially. Welcome feedback graciously and enhance your services. Once confident in the service reliability, expand its deployment and communicate extensively.

You'll soon notice that operational or business priorities often sideline infrastructure work in favor of product developments. Yet, research shows that structuring around these proven methods enhances long-term efficiency. Ensure you allocate time for resilience work in your engineers' schedules.

A DevOps infrastructure realizes its full potential once connected to your organization's main network. This is when it can deploy frequent updates, respond quickly to incidents, and consolidate your teams' work. If your project began on an isolated platform, focus now on connecting where your users are present.

Measuring the effectiveness of one's initiative over time is critical: both to ensure that one is moving in the right direction without dogmatism, and to provide quantifiable arguments to superiors or teams that still need convincing. Make sure to maintain a clear dashboard of these indicators.

Tools such as ChatGPT based on LLMs offer as many new opportunities (e.g., GitLab Duo or GitHub Copilot) as they introduce new threats. Concurrently, security standards will continue to evolve at a breakneck pace. This advocates for a transformation of organizations towards a more agile digital universe. The future is shaping today, and the companies that will succeed best are those that manage to leverage the latest technologies and integrate them into their software development cycle.

Beyond the speed at which technology evolves and as with any area of expertise, this type of infrastructure requires the maintenance of the skills necessary to administer it.

We can easily imagine that a fighter pilot maintains his or her flying skills. Why would it be any different for engineers who maintain critical software vital to the institution's operation? You and your teams must continue to stay ahead by training regularly.

In DevOps mode, organizations can afford to fail faster, with controlled risk, to innovate ahead of their competitors.

"Ops" Terminology

Now that you understand the array of challenges in DevOps, it's insightful to explore some terms that one might come across in the field.

You've probably already heard numerous terms suffixed with "Ops": in industrial proposals, job offers, or online services. All these terms describe specialties in computer system operations using various techniques and methodologies. Let's define a few:

  • Dev-Ops (abbreviation for Development and Operations) is a methodology aimed at bringing developers and engineers managing production together to accelerate software release and resilience.
  • Dev-Sec-Ops (abbreviation for Development, Security, and Operations) is a subset of DevOps focusing on integrating security principles from the onset of a new software or infrastructure design. The goal is to organize the company in such a way that the Security of I-T Systems teams are involved in all project discussions with your development teams.
  • I-T-Ops is a set of practices centered on the maintenance and management of I-T systems. This is subtly distinct from DevOps, which concentrates more on improving the software development and deployment process. Synonymous with system administrator.
  • Fin-Ops (abbreviation for Financial Operations) is a collection of practices to better understand and manage the financial costs of a cloud infrastructure. This includes monitoring and optimizing expenses, as well as managing billing and payments, possibly using dashboards or automated algorithms.
  • ML-Ops (abbreviation for Machine Learning Operations) is a set of practices for collaboration and communication between datascience teams and production teams for effective development and deployment of machine learning models. The aim is to enhance the speed, quality, and resilience of ML models by automating and standardizing used technologies.
  • Git-Ops is a set of guidelines for using git as the single source of truth to standardize development practices, deployment, and to bolster a company's I-T resilience. That is to say IaC, CI/CD and all tools helping to maintain lifecycle of modern software.
  • Emp-Ops (abbreviation for Employee Operations) is a set of tools to manage a company and its employees (such as projects, vacations, one-to-one interviews or a knowledge base) on a unified platform (e.g., CRMs).
  • Data-Ops (abbreviation for Data Operations) is a set of ractices that assist in managing data, considering it a strategic asset. They emphasize collaboration between "data" teams and other I-T teams, automating data management processes, and regular feedback to ensure data meets business needs.
  • Dev-Data-Ops (abbreviation for Development and Data operations) is a variation of DataOps tailored for organizations adopting a DevOps approach for their software developments. In a DevDataOps approach, data management practices are integrated into the software development lifecycle, facilitating coordinated and efficient management of data and code.
  • Edge-Ops (abbreviation for Edge Computing Operations): Edge computing is a decentralized I-T architecture model where data management or transformation occurs close to where it's collected or generated. This contrasts with the traditional approach where data is processed only on a remote server, optimizing network bandwidth. Edge-Ops incorporates certain DevOps principles into this infrastructure (e.g., zero trust or air-gapped monitoring).
  • Chat-Ops (abbreviation for Chat Operations) is a domain advocating for the use of instant messaging tools to facilitate software development and maintenance. The idea is to quickly and easily converse with peers (e.g., easy-access messaging, file or image import capabilities or visibility of time flows...).
  • Live-Ops (abbreviation for Live Game Operations): Refers to all activities ensuring the smooth operation and maintaining excitement around a video game. Informally, it's about "keeping the hype" for the game. Activities include: monitoring player count, playtime or reviews, fostering customer engagement, organizing tournaments, and providing player support.

The emergence of these terms denoting specialties or practices in I-T infrastructure administration is likely tied to the maturity the industry has achieved thanks to Cloud services. These services have greatly streamlined infrastructure administration, paving the way for advanced optimization discussions.

Each of these specialties is a way to optimize your DevOps practices and should adapt to the maturity of the company. Don't rush to implement all of them before you've thoroughly understood and practiced DevOps in your organization.