Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to pygad.py to use multiprocessing of generations #80

Closed
wants to merge 2 commits into from
Closed

Update to pygad.py to use multiprocessing of generations #80

wants to merge 2 commits into from

Conversation

windowshopr
Copy link

@windowshopr windowshopr commented Dec 30, 2021

Makes use of concurrent.futures MultiprocessPoolExecutor to go through the generations faster.

NOTE: if using multiprocessing, the fitness_func must return the fitness score, as well as the solution_idx passed into it!

This was tested using the example PyGAD script given in the tutorial, but with a slight modification to the fitness function as described above, as well as the 2 new parameters for the ga_instance. This was the script used for testing:

import pygad
import numpy

"""
import pygad
import numpy

"""
Given the following function:
    y = f(w1:w6) = w1x1 + w2x2 + w3x3 + w4x4 + w5x5 + 6wx6
    where (x1,x2,x3,x4,x5,x6)=(4,-2,3.5,5,-11,-4.7) and y=44
What are the best values for the 6 weights (w1 to w6)? We are going to use the genetic algorithm to optimize this function.
"""

function_inputs = [4,-2,3.5,5,-11,-4.7] # Function inputs.
desired_output = 44 # Function output.

def fitness_func(solution, solution_idx):
    # Calculating the fitness value of each solution in the current population.
    # The fitness function calulates the sum of products between each input and its corresponding weight.
    output = numpy.sum(solution*function_inputs)
    fitness = 1.0 / numpy.abs(output - desired_output)
    return fitness, solution_idx

fitness_function = fitness_func

num_generations = 100 # Number of generations.
num_parents_mating = 7 # Number of solutions to be selected as parents in the mating pool.

# To prepare the initial population, there are 2 ways:
# 1) Prepare it yourself and pass it to the initial_population parameter. This way is useful when the user wants to start the genetic algorithm with a custom initial population.
# 2) Assign valid integer values to the sol_per_pop and num_genes parameters. If the initial_population parameter exists, then the sol_per_pop and num_genes parameters are useless.
sol_per_pop = 50 # Number of solutions in the population.
num_genes = len(function_inputs)

last_fitness = 0
def callback_generation(ga_instance):
    global last_fitness
    print("Generation = {generation}".format(generation=ga_instance.generations_completed))
    print("Fitness    = {fitness}".format(fitness=ga_instance.best_solution()[1]))
    print("Change     = {change}".format(change=ga_instance.best_solution()[1] - last_fitness))
    last_fitness = ga_instance.best_solution()[1]

# Creating an instance of the GA class inside the ga module. Some parameters are initialized within the constructor.
ga_instance = pygad.GA(num_generations=num_generations,
                       num_parents_mating=num_parents_mating, 
                       fitness_func=fitness_function,
                       sol_per_pop=sol_per_pop, 
                       num_genes=num_genes,
                       on_generation=callback_generation,
                       use_multiprocess=True,
                       max_workers=5)

# Running the GA to optimize the parameters of the function.
ga_instance.run()

# After the generations complete, some plots are showed that summarize the how the outputs/fitenss values evolve over generations.
ga_instance.plot_fitness()

# Returning the details of the best solution.
solution, solution_fitness, solution_idx = ga_instance.best_solution()
print("Parameters of the best solution : {solution}".format(solution=solution))
print("Fitness value of the best solution = {solution_fitness}".format(solution_fitness=solution_fitness))
print("Index of the best solution : {solution_idx}".format(solution_idx=solution_idx))

prediction = numpy.sum(numpy.array(function_inputs)*solution)
print("Predicted output based on the best solution : {prediction}".format(prediction=prediction))

if ga_instance.best_solution_generation != -1:
    print("Best fitness value reached after {best_solution_generation} generations.".format(best_solution_generation=ga_instance.best_solution_generation))

# Saving the GA instance.
filename = 'genetic' # The filename to which the instance is saved. The name is without extension.
ga_instance.save(filename=filename)

# Loading the saved GA instance.
loaded_ga_instance = pygad.load(filename=filename)
loaded_ga_instance.plot_fitness()

Makes my generations go by much faster :D

Now, because I'm using -numpy.inf's in the pygad script, this might not report proper fitness numbers generation to generation. I'm simply using numpy.inf's as really small fitness numbers so that as each member returns a fitness score, it'll increase, however it might still report as -inf which isn't desirable. Maybe a different number can be used/assumed? Or some other logic, either way :)

Makes use of concurrent.futures MultiprocessPoolExecutor to go through the generations faster. NOTE: if using multiprocessing, the fitness_func must return the fitness score, as well as the solution_idx passed into it!
I don't know if using -numpy.inf's are the best way to go, but the example code runs without error. Maybe there are some other's to test it with, I'm just using -numpy.inf's as really small numbers so that any fitness score that is greater than that will make it through to the next gen's. I will update my example script as well.
@ahmedfgad
Copy link
Owner

Thanks @windowshopr! I will review it.

@ahmedfgad ahmedfgad added the enhancement New feature or request label Jan 1, 2022
@nico1996it
Copy link

Please merge it!!!That's would be awesome

@ahmedfgad
Copy link
Owner

After testing, it seems that all the fitness values are set to -inf. As a result, there is no evolution at all and the library is doing nothing.

Even if things work properly, another bottleneck is the processing time compared to the normal case.

@ahmedfgad ahmedfgad closed this Jul 4, 2022
@windowshopr
Copy link
Author

Oh darn, I will look at it later when I get some time. Keep in mind, some objective functions aren’t just doing a simple calculation, like for example, one of my objective functions tests technical indicators with varying time periods, then backtests a trading strategy, all within the objective function. One backtest can take up to 5 minutes to perform, so using multiprocessing for that speeds it up CONSIDERABLY with no bottleneck issues. So don’t be scared of overhead issues cuz everyone’s use case is different. Would be cool to see an option for “use_multiprocessing” or something similar in PyGAD someday cuz it’s super useful on my end. (And I’m not getting -inf’s on my end so I’ll have to verify)

@ahmedfgad
Copy link
Owner

I totally agree! Some functions are intensive in their calculations and parallel processing will make a difference. Your code also is easy for users and I like it.

@ahmedfgad
Copy link
Owner

In the code, you wrote this comment:

So, if using multiprocess, the fitness_func must "return fitness, solution_idx"

This means user must also return the index in addition to the fitness. I do not agree with that. We should make things easier for the user instead of adding more overhead.

Instead of using the submit() method, I used map() instead. It already returns the results in the order they submitted instead of the order they complete.

But similar to the experiments I did before with parallel processing, there no or little speed improvement when the fitness function calculations becomes intensive. For light calculations, parallel processing is too slow (e.g. the time is 0.4 seconds without parallelization compared to 12 seconds when parallel processing is used).

Back to the question I ask months ago, is there an example where there will be a big time difference to convince us that parallelization makes a difference? Please give me an example if you have.

@windowshopr
Copy link
Author

Lol didn't you just agree that:

I totally agree! Some functions are intensive in their calculations and parallel processing will make a difference. Your code also is easy for users and I like it.

Sounds like it's something that could make a difference? :P

And I already touched on my particular use case of the objective function that can run X trading strategy backtests in parallel, instead of one at a time without using multiprocessing! Backtesting trading strategies (generally) involves iterating through a dataframe or array from top to bottom. This is done in the objective function because I'm trying to maximize a return score. That can take as long as the DF/array is, causing a bottleneck. If you can run X backtests at once in parallel, then your genetic algorithm run will be done X times faster! @nico1996it expressed interest in it too so I think it's a worthwhile investment.

As per the other, yes I think map would be better to use, I was just more familiar with submit() and needed a way to know what the current index number was, but also to just prove the concept was do-able. If you have a better way of doing it (as you know the code a bit better than I do) it would be awesome to see it implemented!

For more inspiration, I've created my own genetic algorithm that uses (or doesn't use, whichever the user has specified) multiprocessing, and the respective section of code looks like this:

image

Hope that helps!! :)

@ahmedfgad
Copy link
Owner

Thank you. I am sure there parallel processing makes a difference. It is just about making a big difference.

I already built fitness functions that use simple linear equations and also complex machine learning algorithms. Badly, the difference was not that large.

I just need to give an example to show how beneficial parallel processing would be.

@windowshopr
Copy link
Author

Well, backtesting technical trading strategies is a pretty big example. For some inspiration, I would check out these two backtesting libraries and give ‘em a run through:

Backtrader

Backtesting

running one backtest with one set of technical indicator parameters takes as long as it takes to for loop through your dataset. Being able to run multiple backtests at a time would benefit these traders in a huge way, especially if PyGAD was at the forefront :P

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants