# Maxwell A. Fine 2024-07-10

This notebook is trying out running `predict.py` from [fetch](https://github.com/devanshkv/fetch/tree/tf2) within a python script, as well as moving the 'good' candidates into another dir and making diagnostic `.png` files. 

### Fetch
`Fetch` is a machine learning approach to identifying real and RFI pulse candidates, see the paper [here](https://arxiv.org/abs/1902.06343).

The `Fetch` package on github comes all set with a script `predict.py` that works on `.h5` files of a predetermined shape.It needs a `--datadir`, and a `--model` key flag argument to run. The output is a `.csv` file containg if the candidate file with the labeles as classified by fetch (1=good, 0=bad).

`Fetch` runs on a GPU, using CUDA. 


### Task:
I want to run `fetch` from inside another python script. This gives us two possibiltiies:
- write a simple `subprocess` command to execute the script
- import the required functions to run the `main` function in `predict.py` and use it natively in python


### Notes From my use of `fetch` so far:
- There is a comparably long start up time when running, I think its involved in setting up the GPU and moving over the data? 
- Fetch runs pretty fast per itertation, ~800-1000 candidates / minute. 


### Idea:
- It is probablly more **pythonic** to import the functions to run `fetch` inside of another script, but running it via a `subprocess` call is faster to implement, and will perform about the same. Since the pipeline already is making `subprocess` calls, I will do it with this method. 
    - using a `subproccess` call might be better for other users for future improvements, as it would use all the same commands as `predict.py`

- We could cut out much of the start up time, by running `fetch` just once per observation (as opposed to once per `.fil` file). But then we couldn't run the pipeline in **real** time. I favor running it in real time, so we will run it once per `.fil` file.
    - The pipeline is running parallel, how does our GPU handle it?

- there already exists a `move_candidates.py` script in [pipeline gitlab repo code](https://gitlab.camras.nl/dijkema/frbscripts)
    - I can maybe make a wrapper function around this, and then just import it in the `check_frb.py` script
- similariy, use the `plot_h5` code (more of a call to another function) to do the plotting of diagnostic plots




In [41]:
import subprocess
import os
from move_candidates import move_candidates # modified 
from plot_h5 import process_files_in_parallel # modified 


In [42]:
os.chdir(data_dir)

def run_predict_and_move(data_dir, model='a'):
    '''
    Runs the `fetch` program `predict.py` using a subprocess call in /process, and then moves
    the candiate files into a /process/good, and /process/bad directory based on the prediction. 
    Diagnostic png plots are produced from the `.h5` files in the /process/good dir. 

    Args:
    data_dir: (str), the data_dir argument for `predict.py`, it should be the /process dir if using `check_frb.py`
    
    model: (str), the model argument for `predict.py`, default is 'a'.
    '''
    # Define the paths to the scripts
    
    predict_script = "predict.py"
    
    # Arguments for predict.py
    # Has to be run in the /process dir
    predict_args = [
        predict_script,
        "-c", data_dir,
        "-m", model,
        "--verbose"
    ]

    # Run predict.py
    print('Running predict.py (fetch)')
    result = subprocess.run(predict_args, capture_output=True , text=True)
    # TODO Max Fine Jul 10 2024, this works but it would be nice if stdout was printed as it was made
    if result.returncode != 0:
        print(f"Error running predict.py: {result.stderr}")
        return
    else:
        print(result.stdout) 
        
        # Move Candidates into /good and /bad
        move_candidates() # Has to be run in the /process dir

        # make .h5 files in the good directory to .png
        good_dir = os.path.join(data_dir, "good")
        good_h5_files = [os.path.join(good_dir, f) for f in os.listdir(good_dir) if f.endswith('.h5')]
        if good_h5_files:
            print(f'Converting {len(good_h5_files)} .h5 files to .png')
            process_files_in_parallel(good_h5_files)
        else:
            print("No .h5 files found in the good directory to convert.")


    
    



In [43]:
# Example usage
data_dir = '/data/frb/maxfinetmp/process/bad'
run_predict_and_move(data_dir, "a")

Running predict.py (fetch)


No .h5 files found in the good directory to convert.


In [3]:
help(subprocess.run)

Help on function run in module subprocess:

run(*popenargs, input=None, capture_output=False, timeout=None, check=False, **kwargs)
    Run command with arguments and return a CompletedProcess instance.
    
    The returned instance will have attributes args, returncode, stdout and
    stderr. By default, stdout and stderr are not captured, and those attributes
    will be None. Pass stdout=PIPE and/or stderr=PIPE in order to capture them,
    or pass capture_output=True to capture both.
    
    If check is True and the exit code was non-zero, it raises a
    CalledProcessError. The CalledProcessError object will have the return code
    in the returncode attribute, and output & stderr attributes if those streams
    were captured.
    
    If timeout is given, and the process takes too long, a TimeoutExpired
    exception will be raised.
    
    There is an optional argument "input", allowing you to
    pass bytes or a string to the subprocess's stdin.  If you use this argument
    

In [10]:
ls

fetch_python.ipynb  [0m[01;32mmove_candidates.py[0m*


In [19]:
help(subprocess.run)

Help on function run in module subprocess:

run(*popenargs, input=None, capture_output=False, timeout=None, check=False, **kwargs)
    Run command with arguments and return a CompletedProcess instance.
    
    The returned instance will have attributes args, returncode, stdout and
    stderr. By default, stdout and stderr are not captured, and those attributes
    will be None. Pass stdout=PIPE and/or stderr=PIPE in order to capture them,
    or pass capture_output=True to capture both.
    
    If check is True and the exit code was non-zero, it raises a
    CalledProcessError. The CalledProcessError object will have the return code
    in the returncode attribute, and output & stderr attributes if those streams
    were captured.
    
    If timeout is given, and the process takes too long, a TimeoutExpired
    exception will be raised.
    
    There is an optional argument "input", allowing you to
    pass bytes or a string to the subprocess's stdin.  If you use this argument
    

In [34]:
pwd

'/data/frb/maxfinetmp/process/bad'

In [35]:
cd ..

/data/frb/maxfinetmp/process


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


In [36]:
cd /home_local/frb/max

/


In [39]:
cd /home_local/maxfine/subproccess_fetch

/home_local/maxfine/subproccess_fetch


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


In [40]:
pwd

'/home_local/maxfine/subproccess_fetch'