### Explanation of the Script

- **Purpose**: This script is designed to automate the benchmarking of a large language model (LLM) hosted on a remote server using LLMPerf. It runs performance tests across varying numbers of concurrent requests and stores the results for further analysis.

- **Setup**:
  - You can modify key parameters such as:
    - **Max Concurrent Requests**: Number of concurrent requests to test (default is 8).
    - **Remote Server IP**: IP address where the LLM API is hosted.
    - **Model**: Specify the model to use for the benchmark (default: `meta-llama/Meta-Llama-3.1-8B-Instruct`).
    - **Input/Output Token Distribution**: Define the mean and standard deviation for input and output tokens to simulate realistic request sizes.

- **Repository Setup**:
  - The script clones the `llmperf` repository (if it doesn’t already exist) to the specified directory. This repository contains the necessary benchmarking tools.
  - The working directory is then switched to the `llmperf` directory for execution.

- **Environment Configuration**:
  - The script sets up the necessary environment variables for interacting with the LLM API.
  - The OpenAI API base is dynamically set to point to the specified remote IP.

- **Benchmark Execution**:
  - The script runs benchmarks in a loop, starting from 1 concurrent request and increasing up to the specified `max_concurrent_requests`.
  - For each iteration:
    - A results directory is created with a timestamp to store the output of each run.
    - The script constructs and prints a custom benchmark command to execute the benchmarking tool for the current number of concurrent requests.
    - The benchmark is run by executing the generated command.

- **Results**:
  - After running all benchmarks, the script lists all files and directories inside the base results directory, allowing you to see all results from each benchmark run.
  - The results are saved in a structured manner within subdirectories, categorized by the number of concurrent requests.

- **Customizable Parameters**:
  - `max_concurrent_requests`: Maximum concurrent requests to test.
  - `remote_ip`: IP address of the remote server where the LLM is hosted.
  - `model`: LLM model name for benchmarking.
  - `mean_input_tokens` and `mean_output_tokens`: Define token distributions for input and output sizes.
  - `timeout`: Maximum duration for each benchmark run.
  - **Destination Folder for Results**: The `base_results_dir` specifies the directory where all benchmark results will be stored. This path is timestamped and organized by concurrent request numbers.


In [None]:
import os
from datetime import datetime

# User Variables - Modify these as needed

# Maximum number of concurrent requests to test
max_concurrent_requests = 18

# IP address of the remote server
remote_ip = 'x.x.x.x'

# Model to use for benchmarking
model = 'meta-llama/Meta-Llama-3.1-8B-Instruct'

# Mean and standard deviation of input tokens
mean_input_tokens = 6000
stddev_input_tokens = 200

# Mean and standard deviation of output tokens
mean_output_tokens = 150
stddev_output_tokens = 50

# Absolute path to clone the llmperf repository
llmperf_dir = '/home/ec2-user/SageMaker/llmperf'

# Maximum time to wait for each benchmark run (in seconds)
timeout = 7200

# Additional sampling parameters for the benchmark
additional_sampling_params = '{}'

# Clone the llmperf repository if it doesn't exist
if not os.path.exists(llmperf_dir):
    !git clone https://github.com/ray-project/llmperf.git {llmperf_dir}
else:
    print(f"'llmperf' directory already exists at {llmperf_dir}")

# Change the current working directory to the llmperf directory
os.chdir(llmperf_dir)
print(f"Current working directory: {os.getcwd()}")

# Install required packages
!pip install setuptools==65.5.0
!pip install -e .

# Set environment variables for the OpenAI API
os.environ['OPENAI_API_KEY'] = 'EMPTY'
os.environ['OPENAI_API_BASE'] = f"http://{remote_ip}:8000/v1"

# Generate a timestamped directory name for results
date_str = datetime.now().strftime('%Y-%m-%d-%H-%M-%S')

# Modify this to change the results directory
base_results_dir = f"vllm_bench_results/tp32-b18/{date_str}"

print(f"Base Results Directory: {base_results_dir}")

# Create the base results directory
os.makedirs(base_results_dir, exist_ok=True)

# Run benchmarks for concurrent requests from 1 to max_concurrent_requests
for concurrent_requests in range(1, max_concurrent_requests + 1):
    print(f"\nRunning benchmark with {concurrent_requests} concurrent requests...")
    
    # Compute the maximum number of requests for this run
    max_requests = concurrent_requests * 1
    
    # Create a subdirectory for this run's results
    results_dir = os.path.join(base_results_dir, f"concurrent_{concurrent_requests}")
    os.makedirs(results_dir, exist_ok=True)
    print(f"Results Directory: {results_dir}")
    
    # Construct the benchmark command
    benchmark_command = f"""
    python3 ./token_benchmark_ray.py \\
     --model {model} \\
     --mean-input-tokens {mean_input_tokens} \\
     --stddev-input-tokens {stddev_input_tokens} \\
     --mean-output-tokens {mean_output_tokens} \\
     --stddev-output-tokens {stddev_output_tokens} \\
     --max-num-completed-requests {max_requests} \\
     --timeout {timeout} \\
     --num-concurrent-requests {concurrent_requests} \\
     --results-dir "{results_dir}" \\
     --llm-api openai \\
     --additional-sampling-params '{additional_sampling_params}'
    """
    
    # Print the command for verification
    print("Executing Benchmark Command:")
    print(benchmark_command)
    
    # Execute the benchmark command
    !{benchmark_command}
    
# List the contents of the base results directory
print("\nAll Benchmark Results:")
!ls -l "{base_results_dir}"
