# Objective
This notebook proposes a parameter searching class for the StarFISH pipeline.

# Overview
In short, the ParamSearch class runs a StarFISH analysis pipeline and evaluates the results over a grid of parameters. The pipeline is defined using a list of dictionaries and a custom objective function can be applied. ParamSearch should be able to test any pipeline component that adheres to the StarFISH standards. In addition to being useful for visualizing and gaining an intuition for the effect of parameters, ParamSearch may also be a core component in a parameter learning/optimization tool.

### Requirements
The ParamSearch class should...
* evaluate analysis results across a parameter space
* be compatible with all filter and spot detection modules in the StarFISH pipeline
* support asynchronous concurrent threads

### Approach

**Inputs:** 
* stack \[Stack\]: Stack object containing the images to be processed
* components \[list\]: a list of dictionaries describing the modules to include and the parameters to test (in order of execution).

**Outputs**
* results \[list\]: all of the outputs of the objective function.

**Public Methods**
* constructor:
* run(): loops through the grid of parameters and returns the results of the objective functinos

**Private Methods**
* _attachObjective(objective): saves the objective function to a property. May include some checks in the future.
* _generateParams(params): generates the iteratiable grid of parameters
* _runPipeline: exectutes one iteration of the pipeline using a set of parameters. Putting this in it's own method should allow for parallelization.

# Discussion
* Currently, we use ParameterGrid to create an iterable grid of parameters. The input argument is a dict, which works well, but if we have two pipeline steps with the same parameter names, we cannot add both to the same dict. Should we prepend each step's parameters with some sort of step ID to prevent this? It's pretty straightforward to clean when we recall the params under the run method, but perhaps a bit messy.
* I haven't implemented the spot detectors in the runPipeline() method because the API hasn't been set. I suspect the way we pass the results to the objective function will be a bit different after spot detection. The plan is to have a switch case based on base class since the pipeline asserts that all spot detectors and filters inherit from their respective base classes.
* What do you think is the best way to load the appropriate modules. Currently, we dynamically load them, which ensures only the relevant modules are loaded. I think the Python interpreter is smart and will not reload the module if it has already been loaded, but we should probably verify there isn't significant overhead to this implementation.

In [1]:
from importlib import import_module
from copy import deepcopy
from tqdm import tqdm

class ParamSearch:
    def __init__(self, stack, components, objective):
        # todo: add error checking on inputs
        
        # Save the input parameters
        self.stack = stack
        self.components = components
        
        # Create the parameter grid
        self._createParamGrid()
        
        # Attach the objective function
        self._attachObjective(objective)
        
    def run(self):
        # Check if the param grid has been made
        if self.param_grid is None:
            # TODO: raise error
            print('do the param grid')
            
        # TODO: add option to use concurrent futures module
        # Run through the pipeline
        results = []
        
        for params in tqdm(self.param_grid):    
            results.append(self._runPipeline(deepcopy(self.stack), params))
        
        #results = self._runParallel()
        
        return results
    
    def _runParallel(self):
        # TODO: test/fix
        import concurrent.futures
        from itertools import repeat
        
        results = []
        
        with concurrent.futures.ProcessPoolExecutor() as executor:
            results =  executor.map(self._runPipeline(), repeat(deepcopy(self.stack), len(self.param_grid)), self.param_grid)
                
        return results
            
    def _attachObjective(self, fcn):
        self.objective = fcn
        
    def _createParamGrid(self):
        from sklearn.model_selection import ParameterGrid
        
        # Create a dictionary of all params from the components list
        # TODO: prepend UID for each pipeline step to prevent collision of keys
        self._params = {}
        for component in self.components:
            self._params.update(component['parameters'])
        
        # Create the iterable grid of parameters
        self.param_grid = ParameterGrid(self._params)
    
    def _runPipeline(self, stack, params):
        # Run through the pipeline
        
        for idx, component in enumerate(self.components):
            # Get the parameters for this module
            # TODO: add UID for each pipeline step to prevent collision of keys
            keys = list(component['parameters'].keys())
            component_params = dict((k, params[k]) for k in keys)
            
            # Get the pipeline component class
            cls = getattr(import_module(component['module']),
                         component['class'])
            
            # Instantiate the pipeline component object
            obj = cls(**component_params)
            
            # Perform the operation on the image
            # TODO: add ability to run spot detectors - need interface to be unified
            obj.filter(stack)
            
        # Evaluate the result
        res = self.objective(stack)
        
        return res
            

# Pipeline definition
The pipeline is defined as a list of dictionaries that describe each component. Each dictionary must have the following keys:
* module: a string with the module to load
* class: the name of the class to load
* params: a dictionary where each key is a parameter name and each value is a list of parameters to create the grid across.

In [2]:
pipeline_step1 = {'module': 'starfish.pipeline.filter.white_tophat',
                 'class': 'WhiteTophat',
                 'parameters': {'disk_size': [1, 2, 3]}}


components = [pipeline_step1]

# Objective function definition
Here we define a function to evaluate the results of each set of parameters. For now, I have set it as the number of spots detected. In the future if we have some reference data (e.g., a crowd-sourced annotation), we can calculate precision/recall. I look forward to discussing the best metrics to use.

In [3]:
from trackpy import locate
from showit import image

def testObjective(stack):
    ch1 = stack.image.max_proj(Indices.Z)[0, 0]
    
    results = locate(ch1, diameter=3, minmass=7500, maxsize=3, separation=5, preprocess=False, percentile=10) 
    results.columns = ['y', 'x', 'intensity', 'r', 'eccentricity', 'signal', 'raw_mass', 'ep']

    return len(results)

# Executing the run method
To run the class, simply load an image into a Stack, pass the Stack object, pipeline components, and objective function to the ParamSearch constructor and call the run() method.

In [4]:
from starfish.io import Stack
from starfish.constants import Indices

# Load the test images
experiment_json = './output/experiment.json'

s = Stack()
s.read(experiment_json)

In [5]:
searcher = ParamSearch(s, components, testObjective)
n_spots = searcher.run()

print(n_spots)

100%|██████████| 3/3 [00:03<00:00,  1.17s/it]

[60, 331, 1400]



