# Data-driven Simulation: Fill Chameleon Resource Usage Gaps Using HTC Workloads

This is the experiment entrace for the SC2021 SRC poster: Yin and Yang: Balancing Cloud Computing and HTC Workloads.

The code repository is available at: https://github.com/VUZhuangweiKang/CHISim

CHISim is a data-driven simulator developed for evaluating the strategies of co-locating Chameleon Cloud with High Throughput Computing(HTC) workloads. CHISim replicates the components and processing logic of the OpenStack Blazar, the Chameleon resource manager. This Jupyter Notebook illustrates how to set up the experimental environment (create and configure a bare-metal instance using the Blazar API and Ansible scripts) for the CHISim simulator and run experiments.

The below figure shows the main workflow of CHISim. In the below experiment, CHISim takes Chameleon [trace data](https://www.scienceclouds.org/cloud-traces) (March 2018 to May 2020) and HTC workloads(replay a 3-day OSG log file) as inputs. 

<img src="../images/arch.png" width="600">

The Request Forecaster estimates Chameleon on-demand requests with the below strategies:
- Baseline: run Chameleon user requests only(no advance notice for preemption);
- Greedy algorithm: filling lease gaps with HTC jobs on all available resources(no advance notice for preemption);
- Predictive filling: use forecasters and preemption policies (conducted 12 experiments with 4 preemption policies and 3 prediction models).
    
The Resource Manager involves four preemption policies as below:
- Random: preempt nodes from the HTC pool arbitrarily;
- Recent-Deployed: preempt nodes that are assigned to HTC most recently;
- Least-Core-Used: preempt nodes with the least number of cores assigned to HTC jobs;
- Least-Resubmit: preempt nodes with the least number of re-submissions.

Three experiments were conducted in the 
__Requirements:__ This experiment was packaged to run on the [Chameleon testbed](https://www.chameleoncloud.org), using Jupyter Notebook. To run this, you'll need a Chameleon account and an active project allocation. 

__Estimated Time:__ depends on the number of configurations(preemption policy * request forecasting algorithm) you want to evaluate.


__Steps:__     
1. __Create a Chameleon Lease and Launch an Bare-metal Instance:__      
    a. Obtain a lease using Blazar API    
    b. Create a stack of openstack services based on template: _chameleon/jupyterhub_heat_template.yml_    
    c. Install and configure Ansible and JupyterHub   
2. __Prepare and Run Experiment:__
    a. Install depedent packages for CHISim: ansible/chisim-setup.yml
    b. Define the experiment profile: [config](exp-config.yaml)
    c. Start experiments using SSH, experiment logs are in: logs/xxx.log
3. __Visualize Experiment Process:__   
    a. Create Grafana dashboard: ansible/chisim-visual.yml  
    b. Visualize the experimental metrics: http://$fip_addr:3000

__Contact:__ 
[Zhuangwei Kang](zhuangwei.kang@vanderbilt.edu)

## Step 1. Create a Chameleon Lease and Launch a Bare-metal Instance

Please replace the values of use_site and OS_PROJECT_NAME with yours. The suggested runtime environment for CHISim is a ComputeHaswell node  with the Ubuntu 20.04 OS.

In [None]:
use_site "the Chameleon site you are using" # for example: "CHI@TACC"
export OS_PROJECT_NAME="your project name"  # e.g. "CHI-000000"

NODE_TYPE=compute_haswell
IMAGE=CC-Ubuntu20.04
EMAIL="$(openstack user show $USER -f value -c email)"

# A unique name for most provisioned resources to avoid collisions
RESOURCE_NAME="${USER}-jupyterhub-$(date +%b%d)"

[[ -n "$EMAIL" ]] || {
  echo >&2 "Could not look up your user, check your OS_PROJECT_NAME"
}

### 1.1 Obtain a lease using Blazar API    

Execute commands in the below cell and wait until you can see an ACTIVE lease on the page: https://chi.tacc.chameleoncloud.org/project/leases/.

In [None]:
lease_name="$RESOURCE_NAME"
network_name="$RESOURCE_NAME"
public_network_id=$(openstack network show public -f value -c id)

blazar lease-create \
  --physical-reservation min=1,max=1,resource_properties="[\"=\", \"\$node_type\", \"$NODE_TYPE\"]" \
  --reservation resource_type=network,network_name="$network_name",resource_properties='["==","$physical_network","physnet1"]' \
  --reservation resource_type=virtual:floatingip,network_id="$public_network_id",amount=1 \
  --start-date "$(date +'%Y-%m-%d %H:%M')" \
  --end-date "$(date +'%Y-%m-%d %H:%M' -d'+2 day')" \
  "$lease_name"

# Wait for lease to start
timeout 30000 bash -c 'until [[ $(blazar lease-show $0 -f value -c status) == "ACTIVE" ]]; do sleep 1; done' "$lease_name" \
    && echo "Lease started successfully!"

#
# Fetch information about which resources were reserved for later use
#

reservations=$(blazar lease-show "$lease_name" -f json \
  | jq -r '.reservations')
host_reservation_id=$(jq -rs 'map(select(.resource_type=="physical:host"))[].id' <<<"$reservations")
fip_reservation_id=$(jq -rs 'map(select(.resource_type=="virtual:floatingip"))[].id' <<<"$reservations")

fip=$(openstack floating ip list --tags "reservation:$fip_reservation_id" -f json)
fip_id=$(jq -r 'map(.ID)[0]' <<<"$fip")
fip_addr=$(jq -r 'map(.["Floating IP Address"])[0]' <<<"$fip")

### 1.2 Create a stack of openstack services

The below create a stack of OpenStack services based on a set of input settings and a template file. More details about OpenStack Orchestration can be found [here](https://docs.openstack.org/mitaka/user-guide/dashboard_stacks.html). 

The stack status is trackable [here](https://chi.tacc.chameleoncloud.org/project/stacks/).

This step usually takes about 20 minutes. The instance status will become ACTIVE if the creation is successful. To check the instance status, please see: https://chi.tacc.chameleoncloud.org/project/instances/.

In [None]:
# Ensure your Jupyter keypair is present
key_pair_upload

In [None]:
stack_name="$RESOURCE_NAME"
export OS_KEYPAIR_NAME="your account name-jupyter"

openstack stack create "$stack_name" --wait \
  --template chameleon/jupyterhub_heat_template.yml \
  --parameter floating_ip="$fip_id" \
  --parameter reservation_id="$host_reservation_id" \
  --parameter key_name="$OS_KEYPAIR_NAME" \
  --parameter network_name="$network_name" \
  --parameter image="$IMAGE" && wait_ssh "$fip_addr"

### 1.3 Install and configure Ansible and JupyterHub

The underlying base image does not have JupyterHub OS_KEYPAIR_NAME installed. To install and configure it, this example uses [Ansible](https://www.ansible.com/). First, some configuration of Ansible is required:

In [None]:
# Install Ansible dependencies
ansible-galaxy install -r ansible/requirements.yml

# Configure Ansible to run against provisioned nodes
sudo mkdir -p /etc/ansible
sudo tee /etc/ansible/hosts <<EOF
[jupyterhub]
$fip_addr ansible_user=cc ansible_become=yes ansible_become_user=root
EOF

In [None]:
export ANSIBLE_HOST_KEY_CHECKING=False
ansible-playbook --extra floating_ip="$fip_addr" --extra email_address="$EMAIL" ansible/bootstrap.yml

In [None]:
ansible-playbook ansible/configure.yml

## Step 2. Prepare and Run Experiment

### 2.1 Install depedent packages of CHISim

This step downloads CHISim git repositpory on the remote node you just created. Then it installs and configures serveral fundamental components: influxdb, rabbitmq and mongodb.

In [None]:
ansible-playbook ansible/chisim-setup.yml

### 2.2 Configure Experiment

We use a [YAML file](exp-config.yaml) to define the experiment profile in CHISim. The below shows the meaning of each field. 

The input temporal data files include:
- [cloud user requests](https://github.com/VUZhuangweiKang/CHISim/blob/main/simulator/datasets/user_requests/compute_haswell.csv)
- [machine events](https://github.com/VUZhuangweiKang/CHISim/blob/main/simulator/datasets/machine_events/compute_haswell.csv)
- [osg jobs](https://github.com/VUZhuangweiKang/CHISim/blob/main/simulator/datasets/osg_jobs/osg_jobs.csv)

```yaml
---
simulation:
  termination_policy: random  # Options: ['random', 'least_core', 'least_resubmit', 'recent_deployed']
  request_predictor: baseline  # Options: ['baseline', 'rolling_mean', 'rolling_median', 'lstm']
  scale_ratio: 10800  # the ratio of scaling in the timestamp of input time-series data
  credential:  # username and password for influxdb, mongodb, and rabbitmq connections
    username: chi-sim
    password: chi-sim
  enable_osg: yes  # whether to enable osg jobs, this should be set to no in the Baseline experiment

framework:
  global_mgr:
    clean_run: yes  # whether to clear the old rabbitmq queues and exchanges.
  rsrc_mgr:
    host: localhost
  frontend:
    request_forecaster:
      window: 168  # the length of the sliding window in hours
      steps: 3  # the length of the current time slot in hours; for example, 3 means the Forecaster predicts the total cloud user requests in the upcoming 3 hours.
      retrain:
        enabled: yes  # whether to retain the LSTM model periodically, this is only usable for the LSTM-based Forecaster
        length: 30000   # the period(# of samples) of retraining LSTM models
  databus:
    rabbitmq: 127.0.0.1
  database:
    influxdb: 127.0.0.1
    mongodb: 127.0.0.1

workloads:  # the workload entry is composed of the path of the data file and the index of the temporal column.
  machine_events:
    payload: ../datasets/machine_events/compute_haswell.csv
    timestamp_col: 0
  osg_jobs:
    payload: datasets/osg_jobs/osg_jobs.csv
    timestamp_col: 13
  chameleon_requests:
    payload: datasets/user_requests/lease_info.csv
    timestamp_col: -1
 ```

In [None]:
# copy the experiment profile to the remote machine
ansible-playbook ansible/chisim-config.yml

### 2.3 Start Experiments

Launch components of CHISim running on the remote machine through SSH and redirect execution log files to the [logs](./logs) directory. Each component runs as an independent process and the process id is saved in [pid.text](pid.txt).

In [None]:
nohup bash -c "ssh -i ~/work/.ssh/id_rsa cc@$fip_addr \"cd /home/cc/CHISim/simulator/ && python3 global_manager.py\"" > logs/global_manager.log & echo $!>> pid.txt

In [None]:
nohup bash -c "ssh -i ~/work/.ssh/id_rsa cc@$fip_addr \"cd /home/cc/CHISim/simulator/ && python3 resource_manager.py\"" > logs/resource_manager.log & echo $!>> pid.txt

In [None]:
nohup bash -c "ssh -i ~/work/.ssh/id_rsa cc@$fip_addr \"cd /home/cc/CHISim/simulator/ && python3 frontend.py\"" > logs/frontend.log & echo $!>> pid.txt

In [None]:
nohup bash -c "ssh -i ~/work/.ssh/id_rsa cc@$fip_addr \"cd /home/cc/CHISim/simulator/ && python3 backfill.py\"" > logs/backfill.log & echo $!>> pid.txt

In [None]:
nohup bash -c "ssh -i ~/work/.ssh/id_rsa cc@$fip_addr \"cd /home/cc/CHISim/simulator/ && python3 workload.py\"" > logs/workload.log & echo $!>> pid.txt

## Step 3. Visualize Experiment Process

### 3.1 Install Grafana and customize the dashboard

The dashboard is defined as a [json file](ansible/chisim-deploy/grafana-dashboard.json).

In [None]:
ansible-playbook --extra floating_ip="$fip_addr" ansible/chisim-visual.yml

In [None]:
# Monitor the experiment
echo "Grafana: http://$fip_addr:3000"

## Step 4. Shutdown Experiments

In [None]:
kill -9 $(cat pid.txt)
rm pid.txt
ansible-playbook ansible/chisim-shutdown.yml

After the execution, you can easily export measured metrics as CSV files through the Grafana dashboard. Since the data source of Grafana is InfluxDB, an alternative way of extracting data is querying InfluxDB directly.