Apache Hadoop Use case

Installation and evaluation steps

CaT's repository contains the folder experiments/cat-hdfs with the necessary scripts to install and run the experiments performed with the Apache Hadoop application.

Next, we detail how to install HDFS and the benchmark BigDataBench, and how to use CaT to trace the activity of the benchmark.

Setup environment

Update the variables on file "conf-files/vars.sh" and execute the script before running any other script.
Run the command "installation/utils.sh all" to install some required packages.

Install HDFS

At "conf-files/hosts" add the hostname and IP of the machines to install the NameNode and the DataNodes.

Example:

cloud124 192.168.112.124               <-namenode  
cloud125 192.168.112.125               <-datanode1
cloud126 192.168.112.126               <-datanode2
cloud128 192.168.112.128               <-datanode3

Go to the installation directory: cd installation
For installing HDFS run the following commands:

For the DataNodes and Client:

./hadoop-distributed-setup.sh download_hadoop_source
./hadoop-distributed-setup.sh set_hadoop_environment
./hadoop-distributed-setup.sh set_site_files

For the NameNode:

./hadoop-distributed-setup.sh generate_cluster_ssh_key
./hadoop-distributed-setup.sh download_hadoop_source
./hadoop-distributed-setup.sh set_hadoop_environment
./hadoop-distributed-setup.sh set_site_files
./hadoop-distributed-setup.sh format_namenode
./hadoop-distributed-setup.sh start_dfs

Install BigDataBench

Go to the installation directory: cd installation
For installing BigDataBench run the following command:

./install-bigdatabench.sh

How to run

Script "tools/bdb-bayes-run.sh"

This script allows running the Naive Bayes algorithm (a classification algorithm used in data mining) with the Amazon movie review dataset.

Note2: Do not forget to update the file "whitelist.txt" with the correct path to the BigDataBench folder.

Arguments:

First argument is the name of the function to run: "run_bdb"
Second argument is the size of the data to generate (in GiB)
Third argument is the number of runs to execute
Forth argument is the deployment:
- vanilla for training without tracing
- catbpf for tracing the training with the CatBpf tracer
- catstrace for tracing the training with the CatStrace tracer

Example:

./bdb-bayes-run.sh run_bdb 1 1 catbpf            <- 1 Gib, 1 run, tracing with catbpf

Results:

The results are saved to the path specified by the variable $RESULTS_PATH (on each machine).

$ tree bdb-bayes-results (results on the client machine)
bdb-bayes-results/
└── 1G
    ├── catbpf-genData-1G-client-1.txt                         <- catbpf log for the gen phase
    ├── catbpf-run-1G-client-1.txt                             <- catbpf log for the running phase
    ├── dstat-genData-1G-client-1.csv                          <- dstat output for the gen phase
    ├── dstat-run-1G-client-1.csv                              <- dstat output for the running phase
    ├── time-genData-1G-1                                      <- time info for the gen phase 
    ├── time-run-1G-1                                          <- time info for the running phase 
    ├── trace-genData-1G-client-1.json                         <- catbpf trace for the gen phase
    └── trace-run-1G-client-1.json                             <- catbpf trace for the running phase

Content-aware Tracers Evaluation

CaT prototype was used to intercept network and disk I/O calls across HDFS client, NameNode, and DataNodes. The setup included 1 client, 1 NameNode and 3 DataNodes. Experiments were executed three times for each deployment with dataset sizes of 16GiB and 32GiB.

For vanilla deployment (without tracing):

./bdb-bayes-run.sh run_bdb 16 3 vanilla

./bdb-bayes-run.sh run_bdb 32 3 vanilla

For CatBpf deployment (tracing with CatBpf):

./bdb-bayes-run.sh run_bdb 16 3 catbpf

./bdb-bayes-run.sh run_bdb 32 3 catbpf

For CatStrace deployment (tracing with CatStrace):

./bdb-bayes-run.sh run_bdb 16 3 catstrace

./bdb-bayes-run.sh run_bdb 32 3 catstrace

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Apache Hadoop Use case

Installation and evaluation steps

Setup environment

Install HDFS

Example:

Install BigDataBench

How to run

Script "tools/bdb-bayes-run.sh"

Arguments:

Example:

Results:

Content-aware Tracers Evaluation

For vanilla deployment (without tracing):

For CatBpf deployment (tracing with CatBpf):

For CatStrace deployment (tracing with CatStrace):

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Contents

Clone this wiki locally