Catla is a self-tuning system for Hadoop parameters to improve performance of MapReduce jobs on Hadoop clusters. It is template-driven, making it very flexible to perform complicated job execution, monitoring and self-tuning for MapReduce performance.
- Task Runner: To submit a single MapReduce job to a Hadoop cluster and obtain its analyzing results and logs after the job is completed.
- Project Runner: To submit a group of MapReduce jobs in an organized project folder and monitor the status of its running until completion; eventually, all analyzing results and their logs that contain information of running time in all MapReduce phrases are downloaded into specified location path in its project folder.
- Optimizer Runner: To create a series of MapReduce jobs with different combinations of parameter values according to parameter configuration files and obtain the optimal parameter values with least time cost after the tuning process is finished. Two tuning processes, namely exhaustive search and derivative-free optimization (DFO) techniques, are supported.
- You should run Catla in a Windows computer located in the same network as Hadoop clusters. It means Catla is able to access master host via network.
- Standard Java environment on the computer should be properly installed.
- Hadoop must enable Yarn Log Aggregation by setting value of 'yarn.log-aggregation-enable' to true.
- Critical information of master host, like username, userpassword, SSH port, etc. must be known because Catla needs the information to run MapReduce jobs.
- You must change the configuration of master host's information in the env_* files in the example folder before you try to run any examples here.
- In your master host, please use 'sudo mkdir' command to create a new folder /usr/hadoop_apps in Ubuntu and change the folder's permission to every-one access.
- This project is built on Hadoop 2.7.2, which means it may work in all Hadoop 2.x.x versions.
- Copy Catla.jar from '/catla-dist' in the Github repo to 'examples' folder; thus, the example folders and Catla.jar are in the same folder.
- Change master host's information in the file 'HadoopEnv.txt' according to your actual Hadoop cluster, such as master's IP, master's username, password, master port, Hadoop bin path, and root folder of App (the same as set in 6 of Prerequisites).
- Open a Windows Command program, change current directory into the '/examples' folder by using 'CD' command
- Simply run the Java command as bellows: 'java -jar Catla.jar -tool task -dir task_wordcount'.
- After finished, the 'task_wordcount' folder should create a new folder 'downloaded_results' which stores the analyzing result of WordCount MapReduce job.
- The above step is a simple demonstration example. Advanced example?
Fig. 2 Three-dimensional surface plot of running time of a MapReduce job over two Hadoop configuration parameters using the exhaustive search method Fig. 3 Change of running time of a MapReduce job over number of iterations when tuning using a BOBYQA optimizer
This project is established upon the project Apache Hadoop, Apache Commons Math3 and Apache MINA SSHD under APACHE LICENSE, VERSION 2.0.
See the LICENSE file for license rights and limitations (GNU GPLv3).