-
Notifications
You must be signed in to change notification settings - Fork 1
Apache Hadoop Use case
CaT's repository contains the folder experiments/cat-hdfs with the necessary scripts to install and run the experiments performed with the Apache Hadoop application.
Next, we detail how to install HDFS and the benchmark BigDataBench, and how to use CaT to trace the activity of the benchmark.
- Update the variables on file "conf-files/vars.sh" and execute the script before running any other script.
- Run the command "installation/utils.sh all" to install some required packages.
- At "conf-files/hosts" add the hostname and IP of the machines to install the NameNode and the DataNodes.
cloud124 192.168.112.124 <-namenode
cloud125 192.168.112.125 <-datanode1
cloud126 192.168.112.126 <-datanode2
cloud128 192.168.112.128 <-datanode3
-
Go to the installation directory:
cd installation -
For installing HDFS run the following commands:
- For the DataNodes and Client:
./hadoop-distributed-setup.sh download_hadoop_source
./hadoop-distributed-setup.sh set_hadoop_environment
./hadoop-distributed-setup.sh set_site_files
- For the NameNode:
./hadoop-distributed-setup.sh generate_cluster_ssh_key
./hadoop-distributed-setup.sh download_hadoop_source
./hadoop-distributed-setup.sh set_hadoop_environment
./hadoop-distributed-setup.sh set_site_files
./hadoop-distributed-setup.sh format_namenode
./hadoop-distributed-setup.sh start_dfs
-
Go to the installation directory:
cd installation -
For installing BigDataBench run the following command:
./install-bigdatabench.sh
This script allows running the Naive Bayes algorithm (a classification algorithm used in data mining) with the Amazon movie review dataset.
Note2: Do not forget to update the file "whitelist.txt" with the correct path to the BigDataBench folder.
- First argument is the name of the function to run: "run_bdb"
- Second argument is the size of the data to generate (in GiB)
- Third argument is the number of runs to execute
- Forth argument is the deployment:
-
vanillafor training without tracing -
catbpffor tracing the training with the CatBpf tracer -
catstracefor tracing the training with the CatStrace tracer
-
./bdb-bayes-run.sh run_bdb 1 1 catbpf <- 1 Gib, 1 run, tracing with catbpf
The results are saved to the path specified by the variable $RESULTS_PATH (on each machine).
$ tree bdb-bayes-results (results on the client machine)
bdb-bayes-results/
└── 1G
├── catbpf-genData-1G-client-1.txt <- catbpf log for the gen phase
├── catbpf-run-1G-client-1.txt <- catbpf log for the running phase
├── dstat-genData-1G-client-1.csv <- dstat output for the gen phase
├── dstat-run-1G-client-1.csv <- dstat output for the running phase
├── time-genData-1G-1 <- time info for the gen phase
├── time-run-1G-1 <- time info for the running phase
├── trace-genData-1G-client-1.json <- catbpf trace for the gen phase
└── trace-run-1G-client-1.json <- catbpf trace for the running phase
CaT prototype was used to intercept network and disk I/O calls across HDFS client, NameNode, and DataNodes. The setup included 1 client, 1 NameNode and 3 DataNodes. Experiments were executed three times for each deployment with dataset sizes of 16GiB and 32GiB.
./bdb-bayes-run.sh run_bdb 16 3 vanilla
./bdb-bayes-run.sh run_bdb 32 3 vanilla
./bdb-bayes-run.sh run_bdb 16 3 catbpf
./bdb-bayes-run.sh run_bdb 32 3 catbpf
./bdb-bayes-run.sh run_bdb 16 3 catstrace
./bdb-bayes-run.sh run_bdb 32 3 catstrace