# Overview
This notebook shows how to use the distributed environment in the `DistributedJobs` module, which leverages the base Julia `Distributed` package to run jobs across multiple processes. Each job is just a function that takes an `Array` input and produces an `Array` output. 

# Distributed Environment Setup 
First define the number of processes to be used and include the required modules in each of the processes.

In [1]:
# Set the number of processes
NUM_PROCS = 4

# Set the number of processes to use for parallel computing
using Distributed
addprocs(NUM_PROCS)

4-element Vector{Int64}:
 2
 3
 4
 5

In [2]:
# Add packages to all processes
@everywhere include("../src/industrial_stats.jl")
@everywhere using .IndustrialStats

# Example Distributed Job
Suppose I wish to run an optimization algorithm against $10,000$ randomly initialized designs on the simplex and I want to spread the computing out over all four of my available processes. To do this I can generate large batches of designs and then optimize the batches in parallel using different processess for each batch. If I choose a batch size of $500$, for example, then each process will handle $5=(10,000/(4\cdot 500))$ total batches, or $2500$ design matrices.

In addition to choosing the batch size, I must create a list of jobs to run. A job is a pair of functions $(f, g)$:
- $f: \mathbb R \to \mathbb R^{n \times N_1\times K_1}$ is a function that takes the batch size as input and produces a tensor of shape $n \times N_1\times K_1$, where $n$ is the batch size, $N_1$ is the first dim and $K_1$ the second.   
- $g: \mathbb R^{n \times N_1\times K_1} \to \mathbb R^{n \times N_2\times K_2}$ is a function that processes a batch and produces a new tensor possibly with a different shape

Each job uses the generator function to create the input, then applies the processing function. In other words, the output is $g \circ f$.

These functions need to be implemented in `@everywhere` blocks in order to run in multiple processes.

For this example, I will use some existing code from the repository to generate mixture designs for a first order Scheffe model; the actual $f$ and $g$ implementations can be anything as long as they operate on Julia `Array`s.

## Implement Data Generator & Processing Function

In [6]:
@everywhere begin 
    # Returns a function that produces batch_size randomly initialized mixture designs
    # This is the f batch generator function
    function create_design_generator(N1, K1, model_builder; batch_size=50)
        # Init is a function that produces (batch_size)xN1xK1 tensors
        init = IndustrialStats.DesignInitializer.initializer(N1, K1, model_builder; type = "mixture")

        # Each time the generator is invoked, it will in turn invoke the init function to create a new batch
        return () -> init(batch_size)
    end

    # Simpler generator example for basic uniform samples 
    # Another example of f 
    function create_random_generator(N1, K1, batch_size)
        return () => rand(batch_size, N1, K1)
    end

    # Stand-in optimizer function
    # Replace with any function that accepts Array input
    # This is a g batch processing function
    function my_optimizer(data::Array)
        return data
    end
end

## Create & Run Jobs

In [None]:
# Configure experiment settings
N = 7
K = 3

# Using the first-order scheffe implementation in the ModelBuilder module
model_builder = IndustrialStats.ModelBuilder.scheffe(1)

# Define batch size
num_samples = 10_000
batch_size = 500

# Define data output location
# Data is saved to the disk and can be loaded later on
path_prefix = "../data"

# Function that maps indices to Jobs
job_creator = (idx) -> 
    IndustrialStats.DistributedJobs.create_job(
        my_optimizer, # processing function
        create_design_generator(N, K, model_builder; batch_size = 500); name = "compute_job_$idx" # tensor generating function
    )

# Create vector of jobs
jobs = map(job_creator, 1:(num_samples / batch_size))

# Run jobs
results = IndustrialStats.DistributedJobs.run_jobs(jobs; path_prefix=path_prefix)

## Load & Analyze
The stored data can be loaded for analysis using HDF5.

In [8]:
results = IndustrialStats.DistributedJobs.load_and_concatenate(results)
size(results)

(10000, 7, 3)