Skip to content

About Alibaba cluster and why we open the data

Zhen Zhang edited this page Sep 6, 2017 · 1 revision

With the Internet popularization of the past 20 more years, especially with the Mobile Internet and Internet+ wave of the recent 10 more years, the Internet technology penetrated into all walks of life, influenced all aspects of people's lives. This brought the the substantial increase of Internet service scale and the data size. Increased service scale and data size bring about a dramatic expansion of the data center. In the large-scale data center, the traditional operation and maintenance approach can not meet the needs of large-scale data center, thus many cluster management systems based on automated scheduling have emerged.

These systems often have a common goal, that is to improve the data center’s machine utilization. For the huge amount of data center machines, it will bring a very considerable cost savings even with a small increase of the average utilization rate. We can intuitively feel this point through a simple calculation. Assuming that a data center has N servers, with the utilization rate rising from R1 to R2, how many machines will be saved? if the number is X, without considering other practical constraints in this case, we have the ideal equation

   N*R1 = (N-X)*R2   
=> X*R2 = N*R2 – N*R1
=>    X = N*(R2-R1)/R2

Suppose we have 100,000 servers, the utilization rate rise from 28% to 40%, then according to the above formula we have:

N = 100000,  
R1 = 28%, 
R2 = 40%  
X= 100000* (40%-28%)/40% = 30000

That means with 100,000 servers, utilization rate rise from 28% to 40%, we can save 30,000 machines. Assuming one machine cost of 5,000$, then the total cost savings will be 150,000,000$

But unfortunately, according to the research data of Geithner and McKinsey several years ago, the global server utilization is very low, only 6% to 12%. Even if optimized by virtualization technology, the utilization rate is still only 7% -17%; This is the biggest problem that the traditional operation and coarse-grained use of resources bring to us. The main goal of the scheduler and cluster management system is to resolve this problem.

Through the fine grained scheduling of resources, as well as the virtualization practice, such as virtual machine and container technology, we can consolidate different service instances together to form deployment of higher density. This can effectively improve resource utilization. But this model may have a significant impact on the online business application. Because of the sharing of resources between online services, high-density deployment can bring serious contention at all levels of resources, as a result, it will increasing the latency of online services, especially the delay of long tail requests.

For the online business, the increase in the response time often impact immediately to the loss of the user and the decline in revenue, which is unacceptable. In recent years, with the popularity of the Large Data technology and application, the scale of batch jobs which do not need real-time response is growing. The amount of machines used by the batch jobs has gradually caught up or even surpassed that of the online services. So it is natural to consider that how about to deploy batch jobs and online services together? Is it possible to take full advantage of machine resources without compromising the response time of online services at the expense of some delays of batch job?

Alibaba made this attempt since 2015. Prior to this, there were two resource scheduling systems in Alibaba, for batch jobs and online services respectively. The Fuxi systems which build since 2010 is for batch jobs, and is process-based; The Sigma system which build since 2011 is for online services, and is based on Pouch container. From the middle of 2015, we tried to deploy latency insensitive batch computing jobs and latency critical online services on same machines, and made the resources which online services can not fully used dynamically, to be fully used by batch jobs, thereby to improve the overall utilization of the machines.

sigma arch

This approach has been practiced for more than two years, with architecture adjustment and resource isolation optimization, and has now deployed to large-scale production environment. The mix-running cluster now served the core trade business applications and large data computing batch jobs. The average resource utilization rate of the cluster increased from about 10% to more than 40%, and at the same time guaranteed the SLO goal of online service.

We know that there is significant development recent years in academic research to address specific issues in resource scheduling and cluster management. But there is a big difference between the academic research and the actual real production environment. First, the size of the machines used for academic research is relatively small and may not be able to expose the actual problems in large production scale. Second, the data used in academic research are often not produced by the actual production environment, and may be impact the accuracy and comprehensiveness of the study.

So we hope to share the data of this core mix-running cluster from Alibaba internal to help academic research. We hoped that the academic community can use this relatively large scale and real production environment data to find better models or approaches for resource scheduling and cluster management, and the data can guide the optimizion of the actual production scenario and improve the machine utilization and service quality to a higher level.

We’ll first open a data of 1000 server over 12 hours. The data format description and data download link can be found in the github

Any questions or suggestions can be send to us by email: alibaba-clusterdata@list.alibaba-inc.com

Clone this wiki locally