Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mode solver run_batch #1685

Merged
merged 1 commit into from
May 24, 2024
Merged

Mode solver run_batch #1685

merged 1 commit into from
May 24, 2024

Conversation

tylerflex
Copy link
Collaborator

@prashkh I took the changes from your batch (minus all of the black formats) and put them into this branch.

If you'd like, I can edit your branch to put it in this state so you can wrap it up?

btw the basic idea was:

  1. I checked out a new branch off of prash/mode/batchsolver2
  2. I rebased this branch against pre/2.7 and "dropped" the 2nd commit (with the changes and also black formatting)
  3. then I did git checkout prash/mode/batchsolver2 -- file1 file2 ... all of the files related to the mode solver that you changed.
  4. commited this and pushed it here.

you could either try something similar on your branch or I could try to do it manually for you. hopefully that makes sense and let me know how you want to proceed

(and no worries about this, I've done it many times). Usually before I run black, I make sure I have all of my actual changes commited, that way if this ends up happening, I can simply drop the black commit.

CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
Copy link
Contributor

@prashkh prashkh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few comments so that we remove any changes to the local mode solver.

@prashkh
Copy link
Contributor

prashkh commented May 7, 2024

Thanks for the explaining the detailed process and taking care of this. Next time it will be a lot smoother :)

@tylerflex
Copy link
Collaborator Author

No Problem. I think I forgot to checkout the CHANGELOG. Should be all set now. can you do one more check to makes sure I got everything? then maybe we can simply merge this one

Copy link
Contributor

@prashkh prashkh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great after a final check! Once it is merged to pre/2.7, I will pull it and run some mode solver tests with it just to be 100% sure before the release. Thank you.

@tylerflex
Copy link
Collaborator Author

Hey @prashkh i noticed the mode solver test is failing on this branch. Any idea why?

@prashkh
Copy link
Contributor

prashkh commented May 7, 2024

it seems like it is failing in this step. Not sure why since this works fine on my local machine.

  assert all(isinstance(x, ModeSolverData) for x in results)

E assert False
E + where False = all(<generator object test_mode_solver_web_run_batch.. at 0x33085f300>)

@prashkh
Copy link
Contributor

prashkh commented May 7, 2024

Could it be that we need to do
import ...web as msweb so it references local web.py file? rather than doing
import tidy3d.plugins.mode.web as msweb in the test file

@tylerflex
Copy link
Collaborator Author

That's weird. Also passed for me, but there are 125 warnings, so not sure about that. Let's see if it passes this time? hopefully it doesn't indicate some issue.

@momchil-flex
Copy link
Collaborator

I wonder if something is wrong with the as_completed behavior in that it doesn't actually wait for the result on some systems? Just a random guess... It seems like the result is not yet returned as the print statement right above the failing assert shows NoneType for all of the results. Maybe worth a try if an explicit .result will fix this?

@prashkh
Copy link
Contributor

prashkh commented May 20, 2024

I wonder if something is wrong with the as_completed behavior in that it doesn't actually wait for the result on some systems? Just a random guess... It seems like the result is not yet returned as the print statement right above the failing assert shows NoneType for all of the results. Maybe worth a try if an explicit .result will fix this?

In the test_mode_solver_web_run_batch function we set verbose=False in the run_batch function so the as_completed behavior used for the progress bar is not even triggered. Also, I don't see any error coming from the run function. Maybe the ThreadPoolExecutor for multithreading does not behave well for different execution environments during testing?

Another idea is to try to do this same thing with joblib. I can implement this today.

@tylerflex
Copy link
Collaborator Author

@prashkh One other thing I noticed is that the default folder handling here is probably not ideal. It automatically creates a directory and separate folder on the web UI when running batch. I'd say maybe it's better to leave the defaults? just my preference / impression though, maybe some users would like this.

@momchil-flex
Copy link
Collaborator

I wonder if something is wrong with the as_completed behavior in that it doesn't actually wait for the result on some systems? Just a random guess... It seems like the result is not yet returned as the print statement right above the failing assert shows NoneType for all of the results. Maybe worth a try if an explicit .result will fix this?

In the test_mode_solver_web_run_batch function we set verbose=False in the run_batch function so the as_completed behavior used for the progress bar is not even triggered. Also, I don't see any error coming from the run function. Maybe the ThreadPoolExecutor for multithreading does not behave well for different execution environments during testing?

Another idea is to try to do this same thing with joblib. I can implement this today.

I think if as_completed doesn't get triggered then there's nothing to make the threads wait for the result before the assert statement is triggered? Surprising this test would then not fail for you locally. But there could be some differences I don't understand. But yeah either joblib or making sure that you explicitly wait for the result (through as_completed or in some other way) should be good to try.

@prashkh
Copy link
Contributor

prashkh commented May 20, 2024

@momchil-flex Converted to joblib and used 'threading' as backend. It works fine locally so let's see if this fixes it. @tylerflex I also left the save directory to default as per your suggestion. Could you please help me check if this looks reasonable and if this passes the tests now during merging. Only this function and the import needs to be changed.


from joblib import Parallel, delayed

def run_batch(
    mode_solvers: List[ModeSolver],
    task_name: str = "BatchModeSolver",
    folder_name: str = "BatchModeSolvers",
    results_files: List[str] = None,
    verbose: bool = True,
    max_workers: int = DEFAULT_NUM_WORKERS,
    max_retries: int = DEFAULT_MAX_RETRIES,
    retry_delay: float = DEFAULT_RETRY_DELAY,
    progress_callback_upload: Callable[[float], None] = None,
    progress_callback_download: Callable[[float], None] = None,
) -> List[ModeSolverData]:
    """
    Submits a batch of ModeSolver to the server concurrently, manages progress, and retrieves results.

    Parameters
    ----------
    mode_solvers : List[ModeSolver]
        List of mode solvers to be submitted to the server.
    task_name : str
        Base name for tasks. Each task in the batch will have a unique index appended to this base name.
    folder_name : str
        Name of the folder where tasks are stored on the server's web UI.
    results_files : List[str], optional
        List of file paths where the results for each ModeSolver should be downloaded. If None, a default path based on the folder name and index is used.
    verbose : bool
        If True, displays a progress bar. If False, runs silently.
    max_workers : int
        Maximum number of concurrent workers to use for processing the batch of simulations.
    max_retries : int
        Maximum number of retries for each simulation in case of failure before giving up.
    retry_delay : int
        Delay in seconds between retries when a simulation fails.
    progress_callback_upload : Callable[[float], None], optional
        Optional callback function called when uploading file with ``bytes_in_chunk`` as argument.
    progress_callback_download : Callable[[float], None], optional
        Optional callback function called when downloading file with ``bytes_in_chunk`` as argument.


    Returns
    -------
    List[ModeSolverData]
        A list of ModeSolverData objects containing the results from each simulation in the batch. ``None`` is placed in the list for simulations that fail after all retries.
    """
    console = get_logging_console()
    
    if not all(isinstance(x, ModeSolver) for x in mode_solvers):
        console.log("Validation Error: All items in `mode_solvers` must be instances of `ModeSolver`.")
        return []

    num_mode_solvers = len(mode_solvers)

    if results_files is None:
        results_files = [f"mode_solver_batch_results_{i}.hdf5" for i in range(num_mode_solvers)]

    def handle_mode_solver(index, progress, pbar):
        retries = 0
        while retries <= max_retries:
            try:
                result = run(
                    mode_solver=mode_solvers[index],
                    task_name=f"{task_name}_{index}",
                    mode_solver_name=f"mode_solver_batch_{index}",
                    folder_name=folder_name,
                    results_file=results_files[index],
                    verbose=False,
                    progress_callback_upload=progress_callback_upload,
                    progress_callback_download=progress_callback_download,
                )
                progress.update(pbar, advance=1)
                return result
            except Exception as e:
                console.log(f"Error in mode solver {index}: {str(e)}")
                if retries < max_retries:
                    time.sleep(retry_delay)
                    retries += 1
                else:
                    console.log(f"The {index}-th mode solver failed after {max_retries} tries.")
                    progress.update(pbar, advance=1)
                    return None

    if verbose:
        console.log(f"[cyan]Running a batch of [deep_pink4]{num_mode_solvers} mode solvers.\n")
        with Progress(console=console) as progress:
            pbar = progress.add_task("Status:", total=num_mode_solvers)
            results = Parallel(n_jobs=max_workers, backend='threading')(
                delayed(handle_mode_solver)(i, progress, pbar) for i in range(num_mode_solvers)
            )
            # Make sure the progress bar is complete
            progress.update(pbar, completed=num_mode_solvers, refresh=True)
            console.log("[green]A batch of `ModeSolver` tasks completed successfully!")
    else:
        results = Parallel(n_jobs=max_workers, backend='threading')(
            delayed(handle_mode_solver)(i, None, None) for i in range(num_mode_solvers)
        )

    return results

@tylerflex
Copy link
Collaborator Author

Thanks @prashkh , I implemented the changes in 7b7384a. Unfortunately I get same error when running locally about None.

Feel free to check out this branch tyler/mode/batchsolver2 and add whatever commits you need.

@prashkh
Copy link
Contributor

prashkh commented May 21, 2024

@tylerflex I found the problem. When verbose=False, we don't need to update the pbar since it is None :). Committed with this change after checking it runs fine locally. Could you please try to merge and see if it passes. Thank you!

@tylerflex tylerflex force-pushed the tyler/mode/batchsolver2 branch 2 times, most recently from 1c16469 to c5b3166 Compare May 22, 2024 11:44
@tylerflex
Copy link
Collaborator Author

Thanks @prashkh ! Yea these rich errors can be a bit hard to debug..

I rebased this branch against the latest pre/2.7, updated the changlog, ran black, ruff, and then squashed all of the commits into one. Should be ready to merge once all of the tests pass so I'll check in later.

@tylerflex
Copy link
Collaborator Author

@prashkh. shoot, looks like it's still failing
image

Could you look into it?

Important note: I force pushed to this branch, so to get the changes you'd need to

git checkout tyler/mode/batchsolver2
git reset --hard origin/tyler/mode/batchsolver2

to reset your local copy to the one I pushed here.

@prashkh
Copy link
Contributor

prashkh commented May 22, 2024

@tylerflex I reset the local copy and ran it locally and it works fine. I think the problem is that when running pytest it is asking for API access to run mode solver. How does this work for pytest?

Error in mode solver 0: API key not found. To get your API key,    
             sign into 'https://tidy3d.simulation.cloud/' and copy it from your  
             'Account' page. Then you can configure tidy3d through command line 
             'tidy3d configure' and enter your API key when prompted.           
             Alternatively, especially if using windows, you can manually create
             the configuration file by creating a file at their home directory  
             '~/.tidy3d/config' (unix) or '.tidy3d/config' (windows) containing 
             the following line: apikey = 'XXX'. Here XXX is your API key copied
             from your account page within quotes.

@tylerflex
Copy link
Collaborator Author

@dbochkov-flexcompute @lucas-flexcompute

Sorry to bother you, but we're trying to get this mode solver run_batch functionality in and the web tests aren't working because the web API mocking is not kicking in for the batch mode solvers.

I spend a good amount of time trying to fix it but I think there's something I'm not understanding. It looks like both of you worked on these mocked tests for the regular mode solver run. Do you mind taking a look and modifying to make them work with batch mode solver if you get a chance?

18a26d1

@lucas-flexcompute
Copy link
Collaborator

I've added the missing response to the tests, but I'm not sure you'll want to keep it separate to the existing one or change the run_batch parameters to match the old one (if that is even possible).

@prashkh
Copy link
Contributor

prashkh commented May 23, 2024

I also don't see an explicit test function for run function under test_mode_solver? If mode solver run function works, then the only thing that is different with the run_batch is that it is using joblib to call the run function in parallel. run_batch is quite simple and it doesn't make sense to make it complicated so the test works with mock api in my opinion. Wondering if it is better to actually test the webapi rather than mockapi in this case.

@tylerflex
Copy link
Collaborator Author

thanks @lucas-flexcompute !

@prashkh there are a set of tests for run and run_local, I think. See here for example.

run_batch is quite simple and it doesn't make sense to make it complicated so the test works with mock api in my opinion. Wondering if it is better to actually test the webapi rather than mockapi in this case.

Unfortunately, in my experience, you never really know eg. if joblib upgrades and then changes behavior slightly, things could still break. So while it's a bit of extra effort, mocking the web API at least can help us isolate the issue to the run_batch internals or the web API. That being said, it would nice if one of the notebooks use the new run_batch feature so we can test it with the real web API from time to time.

@tylerflex
Copy link
Collaborator Author

Another consideration is testing speed, even with mock responses and data for everything in these tests, they take almost 10 minutes to finish. We also need to run them probably a dozen times a day, pretty much whenever a PR changes. if we ran the real web API, it would probably take hours to finish and use a ton of credits.

@prashkh
Copy link
Contributor

prashkh commented May 23, 2024

thanks @lucas-flexcompute !

@prashkh there are a set of tests for run and run_local, I think. See here for example.

run_batch is quite simple and it doesn't make sense to make it complicated so the test works with mock api in my opinion. Wondering if it is better to actually test the webapi rather than mockapi in this case.

Unfortunately, in my experience, you never really know eg. if joblib upgrades and then changes behavior slightly, things could still break. So while it's a bit of extra effort, mocking the web API at least can help us isolate the issue to the run_batch internals or the web API. That being said, it would nice if one of the notebooks use the new run_batch feature so we can test it with the real web API from time to time.

That makes sense! So, if test_mode_solver_simple is testing the run function, could joblib parallelization that just calls this function be enough to test run_batch? I don't quite understand how mock_remote_api inside test_mode_solver_simple(mock_remote_api, local) is related to msweb.run() to write this test.

@tylerflex
Copy link
Collaborator Author

That makes sense! So, if test_mode_solver_simple is testing the run function, could joblib parallelization that just calls this function be enough to test run_batch?

So test_mode_solver_simple is also using mock web API to test the run function. Maybe I wasn't super clear about that. It's mostly testing that the mode solver python front end interface logic is working as expected given some inputs and outputs from the web API call.

In principle it would be enough to just test joblib parallelization! except we need to ensure we're calling the joblib in the exact same way as the source code. in the end it just makes sense to test run_batch for consistency and automation.

I don't quite understand how mock_remote_api inside test_mode_solver_simple(mock_remote_api, local) is related to msweb.run() to write this test.

So when we call the web API, we basically send a dictionary to some URL and get a dictionary back from the server. The web API mocking basically interrupts this process and so whenever you send a specific dictionary, it just immediately returns some "mock" data without actually making the call. The mode solver mocking does this for all of the sub-steps of msweb.run() for example, upload, downloading, etc.

I think the problem was that the dictionary sent out in the batch one had different fields from the single mode solver (eg the folder name) so the mocking was getting confused. Basically we couldn't use the same mocking function for both single and batch mode solver. This is what I was trying to figure out unsuccessfully yesterday.

Basically you can think of the two mocks as the same, but they're matching on the different data sent out by the batched mode solver and the regular one. Does that clear it up?

@tylerflex tylerflex changed the title Tyler/mode/batchsolver2 Mode solver run_batch May 23, 2024
@tylerflex tylerflex added the 2.7 will go into version 2.7.* label May 23, 2024
@tylerflex
Copy link
Collaborator Author

@prashkh I'm going to try to get this feature in soon. Could you take one last look over the files changed and verify it's looking ok and then assuming all tests pass, I'll merge it.

@tylerflex tylerflex requested a review from prashkh May 24, 2024 15:25
@prashkh
Copy link
Contributor

prashkh commented May 24, 2024

@prashkh I'm going to try to get this feature in soon. Could you take one last look over the files changed and verify it's looking ok and then assuming all tests pass, I'll merge it.

Thanks for the explanation on the mock api. It makes sense! Also seems like the tests were passed!! Let me check it one more time locally and we are all good :)

@prashkh
Copy link
Contributor

prashkh commented May 24, 2024

@tylerflex I just tested it locally and the code looks good. Seems like all checks were passed as well so all good now. Thank you!

@tylerflex tylerflex merged commit 50f5e3e into pre/2.7 May 24, 2024
16 checks passed
@tylerflex tylerflex deleted the tyler/mode/batchsolver2 branch May 24, 2024 16:21
@tylerflex
Copy link
Collaborator Author

Awesome congrats on your first contribution! 🎉 . sorry it was a little rocky with the testing there.

@prashkh
Copy link
Contributor

prashkh commented May 24, 2024

Thanks for all the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.7 will go into version 2.7.*
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants