Skip to content

Apache Hadoop Use case

Tânia Esteves edited this page Oct 27, 2021 · 6 revisions

Installation and evaluation steps

CaT's repository contains the folder experiments/cat-hdfs with the necessary scripts to install and run the experiments performed with the Apache Hadoop application.

Next, we detail how to install HDFS and the benchmark BigDataBench, and how to use CaT to trace the activity of the benchmark.

Setup environment

  1. Update the variables on file "conf-files/vars.sh" and execute the script before running any other script.
  2. Run the command "installation/utils.sh all" to install some required packages.

Install HDFS

  1. At "conf-files/hosts" add the hostname and IP of the machines to install the NameNode and the DataNodes.
Example:
cloud124 192.168.112.124               <-namenode  
cloud125 192.168.112.125               <-datanode1
cloud126 192.168.112.126               <-datanode2
cloud128 192.168.112.128               <-datanode3
  1. Go to the installation directory: cd installation

  2. For installing HDFS run the following commands:

  • For the DataNodes and Client:
./hadoop-distributed-setup.sh download_hadoop_source
./hadoop-distributed-setup.sh set_hadoop_environment
./hadoop-distributed-setup.sh set_site_files
  • For the NameNode:
./hadoop-distributed-setup.sh generate_cluster_ssh_key
./hadoop-distributed-setup.sh download_hadoop_source
./hadoop-distributed-setup.sh set_hadoop_environment
./hadoop-distributed-setup.sh set_site_files
./hadoop-distributed-setup.sh format_namenode
./hadoop-distributed-setup.sh start_dfs

Install BigDataBench

  1. Go to the installation directory: cd installation

  2. For installing BigDataBench run the following command:

./install-bigdatabench.sh

How to run

This script allows running the Naive Bayes algorithm (a classification algorithm used in data mining) with the Amazon movie review dataset.

Note2: Do not forget to update the file "whitelist.txt" with the correct path to the BigDataBench folder.

Arguments:

  • First argument is the name of the function to run: "run_bdb"
  • Second argument is the size of the data to generate (in GiB)
  • Third argument is the number of runs to execute
  • Forth argument is the deployment:
    • vanilla for training without tracing
    • catbpf for tracing the training with the CatBpf tracer
    • catstrace for tracing the training with the CatStrace tracer

Example:

./bdb-bayes-run.sh run_bdb 1 1 catbpf            <- 1 Gib, 1 run, tracing with catbpf

Results:

The results are saved to the path specified by the variable $RESULTS_PATH (on each machine).

$ tree bdb-bayes-results (results on the client machine)
bdb-bayes-results/
└── 1G
    ├── catbpf-genData-1G-client-1.txt                         <- catbpf log for the gen phase
    ├── catbpf-run-1G-client-1.txt                             <- catbpf log for the running phase
    ├── dstat-genData-1G-client-1.csv                          <- dstat output for the gen phase
    ├── dstat-run-1G-client-1.csv                              <- dstat output for the running phase
    ├── time-genData-1G-1                                      <- time info for the gen phase 
    ├── time-run-1G-1                                          <- time info for the running phase 
    ├── trace-genData-1G-client-1.json                         <- catbpf trace for the gen phase
    └── trace-run-1G-client-1.json                             <- catbpf trace for the running phase

Content-aware Tracers Evaluation

CaT prototype was used to intercept network and disk I/O calls across HDFS client, NameNode, and DataNodes. The setup included 1 client, 1 NameNode and 3 DataNodes. Experiments were executed three times for each deployment with dataset sizes of 16GiB and 32GiB.

For vanilla deployment (without tracing):

./bdb-bayes-run.sh run_bdb 16 3 vanilla

./bdb-bayes-run.sh run_bdb 32 3 vanilla

For CatBpf deployment (tracing with CatBpf):

./bdb-bayes-run.sh run_bdb 16 3 catbpf

./bdb-bayes-run.sh run_bdb 32 3 catbpf

For CatStrace deployment (tracing with CatStrace):

./bdb-bayes-run.sh run_bdb 16 3 catstrace

./bdb-bayes-run.sh run_bdb 32 3 catstrace

Clone this wiki locally