Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

257 Adding averaging time using timeit #285

Merged
merged 19 commits into from
Nov 12, 2019
Merged

257 Adding averaging time using timeit #285

merged 19 commits into from
Nov 12, 2019

Conversation

wathen
Copy link
Collaborator

@wathen wathen commented Nov 5, 2019

Description of Work

Fixes #257

Allows FitBenchmarking to calls a minimizer multiple of times and thus calculates an average elapsed time using timeit.

Testing Instructions

  1. Check the accuracy tables are still the same
  2. New timing results are consistent between runs

Function: Does the change do what it's supposed to?

Tests: Does it pass? Is there adequate coverage for new code?

Style: Is the coding style consistent? Is anything overly confusing?

Documentation: Is there a suitable change to documentation for this change?

@wathen wathen added the Enhancement New feature or request label Nov 5, 2019
@wathen wathen requested a review from TOFarmer November 5, 2019 10:48
Copy link
Contributor

@TOFarmer TOFarmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problems with where the default for num_runs is defined comes back to our discussions around where default options should be defined in general - it is not necessarily a big concern for this PR, but I think if there are any other PRs where we might be changing the default options, it would be a good idea to refactor the default definitions first.

``num_runs``
-------------------

Number of runs is used to define how many times FitBenchmarking calls a minimizer and thus calculates an average elapsed time using `timeit`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Number of runs is used to define how many times FitBenchmarking calls a minimizer and thus calculates an average elapsed time using `timeit`.
Number of runs defines how many times FitBenchmarking calls a minimizer and thus calculates an average elapsed time using `timeit`.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected

"""
Fit benchmark one problem, with one function definition and all
the selected minimizers, using the chosen fitting software.

@param controller :: The software controller for the fitting
@param minimizers :: array of minimizers used in fitting
@param num_runs :: number of times controller.fit() is run to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the docstring standardisation should include types for parameters (see #236), particularly if functions are part of the API (which this might be). Probably a good opportunity to add types for existing parameters as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have gone through fitbenchmarking/fitbenchmark_one_problem.py to standardise the doc strings. I was not sure what to do with multiple return arguments so let me know what you think about how I did it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit of a tricky one using the sphinx docstring style, as it doesn't deal with multiple return types as well as the numpy style. However I would be inclined to either put :rtype :: tuple and then define the individual types within :return :: (as is already the case), or at least put brackets around what you currently have (so :rtype :: (list of FittingResult, plot_helper.data instance)) - the reason for this is to differentiate it from the situation where you actually have single object returned but the type can vary (e.g. if you had a conditional which returned list and the else returned str). What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think :rtype :: (list of FittingResult, plot_helper.data instance) looks the best option for this, makes it clearer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

"""
Sets up the controller for a particular problem and fits the models
provided in the problem object. The best fit, along with the data and a
starting guess is then plotted on a visual display page.

@param user_input :: all the information specified by the user
@param problem :: a problem object containing information used in fitting
@param num_runs :: number of times controller.fit() is run to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as comment about docstrings below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above

start_time = time.time()
controller.fit()
end_time = time.time()
runtime = timeit.timeit(controller.fit, number=num_runs) / num_runs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning from timeit.timeit here does reveal if any of the calls to controller.fit have been cached. While this may not be likely, it is possible that we could inadvertently cache the result from the controllers we have implemented, and definitely possible that a contributor might make this mistake - this would result in significantly reduced average runtimes. I tested this with a toy problem to confirm.

timeit.repeat returns a list of n repeats (each of which is the sum of number executions, so you might want to set this to 1, rather than the default of 1000000). This could then be used to suggest if caching has occurred, which could result in a warning. It would also allow the uncertainty on the runtime to be estimated, which might be a useful additional property for a user to have access to (although probably not by default) - not for this PR but this would at least make it easier to implement this in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I checked the timings using timeit.timeit and timeit.repeat and they seem to be pretty much the same. However I have changed the code to use timeit.repeat

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah they should be exactly the same without caching - this just allows us to catch this. So maybe we should provide a warning if the stdev is 'large'? I don't really mind what is determined to be 'large'.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a check to see if the standard deviation is high

cv = np.std(runtime_list) / np.mean(runtime_list)
if cv > 0.2:
raise Warning('The ratio of the standard deviation and mean'
' is larger than {}, this may indicate caching'
' is used in the timing results.'.format())

Is 0.2 an ok measure do you think?

start_time = time.time()
controller.fit()
end_time = time.time()
runtime = timeit.timeit(controller.fit, number=num_runs) / num_runs
except Exception as e:
print(e.message)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will all e have message attributes? Otherwise this could raise an AttributeError. Do we want to create an issue for this? Does it have associated tests for exceptional behaviour?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it does but not 100% sure. At the moment there are no specific tests for this file other than the testing the examples scripts run

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, I've found the problem - Exception.message was deprecated in Python 2.5, so while it still works in 2.7 it is undocumented, and in 3+ it is unsupported. The new form is that casting the Exception to str returns the message, so this should be changed to print(str(e)) - do you want to do that here or would you prefer me to create a separate issue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its best if we do that here otherwise it might be put at the bottom of the issue pile in the future, do you mind changing it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problems, I've done it and made it PEP 8 compliant (albeit in one case by disabling a pylint warning, although I think it is probably justified here).

"""
Gather the user input and list of paths. Call benchmarking on these.

@param group_name :: is the name (label) for a group. E.g. the name for the group of problems in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not too irritating, I'd suggest adding types for these parameters and the return.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

minimizers, software = misc.get_minimizers(software_options)
num_runs = software_options.get('num_runs', None)

if num_runs is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a lot of work just to set a single variable - I could be wrong but it appears that some of this is redundant. Could you please explain the process you are following to set num_runs?

Also it appears that the default of 5 for num_runs gets set in two places: here and in the default options json. This probably shouldn't be the case, as it might produce unexpected behaviour for a user if they created their own default without num_runs (or with num_runs=None).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, users can define a software_options as a dictionary. If num_runs is not defined in the dictionary then fitbenchmarking.utils.options is used to figure out the default option from the json file. This is similar to how the comparison_mode is set:

minimizers, software = misc.get_minimizers(software_options)
comparison_mode = software_options.get('comparison_mode', None)
if comparison_mode is None:
if 'options_file' in software_options:
options_file = software_options['options_file']
comparison_mode = options.get_option(options_file=options_file,
option='comparison_mode')
else:
comparison_mode = options.get_option(option='comparison_mode')
if comparison_mode is None:
comparison_mode = 'both'

I have removed the second default option of 5 for num_runs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so I think I've figured out why this is confusing me:

line 42 can only be True if software_options['num_runs'] is None. It seems like this would be a strange situation to occur. However if it did occur, then line 51 always sets options_file to None (based on the excerpt above I think your addition should be options_file = software_options['options_file'], rather than sofware_options['num_runs']).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, and if you follow the excerpt above, then your line 50 needs to be if 'options_file'... rather than if 'num_runs'...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


return prob_results, results_dir
def _benchmark(user_input, problem_group, num_runs=5):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this another place where the default for num_runs is specified? This one should probably go as it is a 'private' function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_runs is defined in the json file so I have no removed this from this function


@param user_input :: all the information specified by the user
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, types if possible

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the doc strings


parsed_problems = [parse.parse_problem_file(p) for p in problem_group]
if not isinstance(user_input, list):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a problem with this PR, but this code could really do with a refactor - do you know if there is an issue for this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is any issue that talks is for refactoring code, as Andrew cleans up things (such as the fitting and parsing) he is refactoring those places

@TOFarmer
Copy link
Contributor

TOFarmer commented Nov 5, 2019

Just before you make any changes, I've done some refactoring for PEP 8 compliance which I need to commit - it's not letting me push at the moment so I just need to work out why.


if not controller.success:
chi_sq = np.nan
status = 'failed'
else:
cv = np.std(runtime_list) / np.mean(runtime_list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, as my original suggestion was a bad metric for determining whether caching has occurred. It would be better to test the ratio of the max to the min (as if a large number of runs were cached then the stdefv could be low despite a single outlier). I've had a look at the timeit source code for Cython and the ratio they use is 4, so we could pick this to be consistent.

Also I think rather than raising the Warning maybe we should use warnings.warn, so just:

warnings.warn('The ratio of the max time to the min is larger than {}, which may indicate that caching has occurred in the timing results`.format(tolerance))

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@@ -89,7 +88,15 @@ def main(argv):

# Processes software_options dictionary into Fitbenchmarking format
minimizers, software = misc.get_minimizers(software_options)

num_runs = software_options.get('num_runs', None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this pattern need to be included here as well as in fitbenchmark_group? Is it not dealt with there if the user doesn't define it correctly here? If it is required then maybe it would be worth adding a comment or two explaining what it does?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From our meeting on Friday I thought the idea was to remove the expert scirpt so this would not be a problem?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's fine.

@tyronerees tyronerees moved this from To do to In progress in ALC Maintenance Nov 2019 Nov 11, 2019
@tyronerees tyronerees removed this from In progress in ALC Maintenance Nov 2019 Nov 11, 2019
@TOFarmer TOFarmer self-requested a review November 11, 2019 14:44
@TOFarmer
Copy link
Contributor

I discovered that when running the example_runScripts.py, on some occasions (maybe 40% of the time) I got a UserWarning about the ratio of the maximum to minimum runtimes (see line 149 of fitbenchmark_one_problem.py). These were commonly associated with the cubic problem being run on scipy, which is the first problem run.

While @wathen did not experience the same warnings, @AndrewLister-STFC had the same while running on a clean cloud instance. The origin of this behaviour (max run time > 4 x min run time) is not obvious, and for the vast majority of problems/minimizer combinations, the ratio is < 2.

We decided that the current determination of anomalous run times (max/min ratio) is appropriate, and this warning should occur. We recommend that the html cell for each timing result (which will be the average) is hyperlinked to the individual results. This has the added benefit that the user can calculate the uncertainty on the runtimes.

NB: For context, the max/min ratio is the same criteria that timeit uses for warning the user.

Copy link
Contributor

@TOFarmer TOFarmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the accuracy tables are still the same

Accuracies are consistent with master.

New timing results are consistent between runs

Other than timing variations discussed in previous comment, timing are consistent.

Function: Does the change do what it's supposed to?

Yes, tabulated runtimes are now the average of n runs (default 5). Rather than discard the highest and lowest (as suggested in #257 ), the ratio of these is calculated and the user is warned if this is > some tolerance (set to 4).

Tests: Does it pass? Is there adequate coverage for new code?

There are currently no tests associated with either of the python modules modified, and so no new coverage has been added. This will be addressed in #290 .

Style: Is the coding style consistent? Is anything overly confusing?

Coding style is fine, other than problem with how options are set which is consistent with existing code - this will be rectified in #260 .

Documentation: Is there a suitable change to documentation for this change?

Yes.

@TOFarmer TOFarmer merged commit bec98d5 into master Nov 12, 2019
@TOFarmer TOFarmer deleted the 257_timing branch November 12, 2019 11:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent timing results
2 participants