## Use slurm to perform Hyperparameter Optimization with GridSearch

This script performs a systematic search for the optimal hyperparameters for a Support Vector Machine (SVM) classifier. 

### Why do we use a cluster?

The cluster allows us to distribute computationally intensive tasks across multiple computers to save time and increase efficiency. In hyperparameter optimization, where potentially thousands of combinations need to be tested, using a cluster can significantly speed up the process.

### Why do we store results in a database?

Since jobs on a cluster can be interrupted for various reasons (such as system failures, maintenance, or time limits), we store the results in a PostgreSQL database. This allows us to resume the optimization process, without having to start over. Each parameter combination and its result are stored so that combinations can be skipped if the script is restarted.

### Key components of the script

- **Establishing a database connection**: We use `psycopg2` to connect to our PostgreSQL database. 
- **Creating a data table**: If not already present, a table is created in the database to store the results.
- **Defining the parameter space**: We define a space of possible values for the hyperparameters `C`, `gamma`, and `kernel` to be assigned to the SVM.
- **Evaluation**: For each combination of parameters in the `ParameterGrid`, we perform cross-validation and save the results in the database.
- **Retrieving results**: After all computations are complete, we query the database for the best parameter combination.

In [20]:
%%file gridsearch.py

import numpy as np
from sklearn import datasets
from sklearn.model_selection import cross_val_score, ParameterGrid
from sklearn.svm import SVC
import psycopg2
import json

# Datenbankverbindung herstellen
conn = psycopg2.connect(
    dbname='postgres',
    user='postgres.ymxgukzcysicyvmmouxx',
    password='PW',
    host='aws-0-eu-central-1.pooler.supabase.com',
    port='5432'
)
cur = conn.cursor()

# Datenbanktabelle erstellen, falls noch nicht vorhanden
cur.execute("""
CREATE TABLE IF NOT EXISTS gridsearch_results (
    id SERIAL PRIMARY KEY,
    params JSON,
    mean_score FLOAT
)
""")
conn.commit()

# Datensatz laden
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Parameter-Raum definieren
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': ['scale', 'auto'],
    'kernel': ['rbf', 'linear']
}

# Prüfen, welche Parameter bereits evaluiert wurden
cur.execute("SELECT params FROM gridsearch_results")
evaluated_params = {json.dumps(row[0]) for row in cur.fetchall()}

# ParameterGrid erstellen und durch Iteration Parameter evaluieren
param_list = list(ParameterGrid(param_grid))


for params in param_list:
    params_json = json.dumps(params)
    if params_json not in evaluated_params:
        model = SVC(**params)
        score = np.mean(cross_val_score(model, X, y, cv=3, n_jobs=-1))
        # Ergebnis in die Datenbank schreiben
        cur.execute(
            "INSERT INTO gridsearch_results (params, mean_score) VALUES (%s, %s)",
            (params_json, score)
        )
        conn.commit()
        print(f"Evaluated: {params}, Score: {score}")

# Beste Ergebnisse abrufen
cur.execute("SELECT params, mean_score FROM gridsearch_results ORDER BY mean_score DESC LIMIT 1")
best_result = cur.fetchone()
print(f"Beste Parameter: {best_result[0]}, Beste Score: {best_result[1]}")

# Datenbankverbindung schließen
cur.close()
conn.close()

Overwriting gridsearch.py


## Slurm Script

The slurm script below configures job scheduling parameters and the computational environment needed to execute the grid search.
 
- **Shebang (`#!/bin/bash`)**: Specifies that the script should be run in the Bash shell.
- **`#SBATCH` Directives**: These lines configure the resources and settings for the Slurm job scheduler:
  - `--job-name=gridsearch`: Sets the job's name to 'gridsearch', which helps in identifying the job within the job queue.
  - `--output=./logs/gridsearch_%j.out`: Directs the standard output (stdout) of the job to a file in the `logs` directory. The `%j` is replaced by the job ID, allowing for unique log files for different job executions.
  - `--error=./logs/gridsearch_%j.err`: Similar to the output directive, this directs standard error (stderr) to a file, helping in debugging if the job encounters issues.
  - `--time=01:00:00`: Limits the job's maximum running time to one hour. If the job exceeds this time, it will be terminated by Slurm.
  - `--cpus-per-task=4`: Allocates 4 CPU cores to the job, which is particularly useful for tasks that can exploit parallel processing.
  - `--mem=4G`: Assigns 4 gigabytes of RAM to the job, ensuring sufficient memory is available for processing.
  - `--partition=base`: Specifies the partition where the job should run. The 'base' partition is a general-purpose queue configured on the cluster.
  
- **Environment Setup**:
  - `module load python/3.8`: Loads the Python 3.8 module, setting up the software environment required to run the Python script.
  - `pip install --user scikit-learn pandas psycopg2-binary`: Installs the necessary Python packages in the user's home directory.

- **Script Execution**:
  - `python gridsearch.py`: This line is the core task, using all the previously configured settings and resources.

In [17]:
%%file run_gridsearch.slurm
#!/bin/bash
#SBATCH --job-name=gridsearch
#SBATCH --output=./logs/gridsearch_%j.out
#SBATCH --error=./logs/gridsearch_%j.err
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=4G
#SBATCH --partition=base

module load python/3.8
pip install --user scikit-learn pandas psycopg2-binary

python gridsearch.py

Overwriting run_gridsearch.slurm


## Prepare execution
The `sinfo` command provides detailed information about the state of the cluster's nodes and partitions. 

You can use `sinfo` to quickly check the status of the cluster's partitions and nodes. This is especially useful when planning to submit large jobs, as you can identify which partitions are least busy and what limits you are working within. 

### Understanding `sinfo` Output

When you run `sinfo`, it provides several key pieces of information:

- **PARTITION**: The name of the partition. Partitions are subsets of the cluster, often configured with specific types of nodes or for particular groups or job types.
- **AVAIL**: Indicates whether the partition is available for job submission.
- **TIMELIMIT**: Shows the maximum duration that jobs are allowed to run in the partition.
- **NODES**: Lists the number of nodes in each state within the partition.
- **STATE**: Displays the current state of nodes. Common states include `idle` (available for new jobs), `alloc` (allocated to a job), `down` (not operational), `drain` (being removed from active duty, usually for maintenance), etc.
- **NODELIST**: Identifies the specific nodes within each state.




In [18]:
!sinfo

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
base*        up 14-00:00:0      3  down* lab-aicl-n[1,11,17]
base*        up 14-00:00:0      1   comp lab-aicl-n16
base*        up 14-00:00:0      6  alloc lab-aicl-n[2-3,5-6,12,19]
base*        up 14-00:00:0      9   idle lab-aicl-n[4,7-10,13-15,18]


## Submitting a script to the  Cluster

When you run `sbatch run_gridsearch.slurm`, the Slurm scheduler receives and processes the script. It schedules and executes the job according to the resources specified in the script.

In [19]:
!sbatch run_gridsearch.slurm

Submitted batch job 202646
