Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

phase stagger doesn't work with madmax? #890

Closed
ghost opened this issue Jul 24, 2021 · 18 comments
Closed

phase stagger doesn't work with madmax? #890

ghost opened this issue Jul 24, 2021 · 18 comments
Labels
bug Something isn't working

Comments

@ghost
Copy link

ghost commented Jul 24, 2021

Describe the bug
Plot jobs start bypassing the stagger major:minor phase limit's

To Reproduce

Steps to reproduce the behavior, e.g.:

  1. Set up config with 'look into attached config'
  2. Run plotman.

Expected behavior
Limited jobs before stage N:N

System setup:

  • OS: Ubuntu

Config

full configuration
logging:
        plots: /root/.chia/plotman/logs

user_interface:
        use_stty_size: False

commands:
        interactive:
                autostart_plotting: False
                autostart_archiving: False
  
directories:
        tmp:
                - /plotting01
                - /plotting02
                - /plotting03
                - /plotting04

        dst:
                - /plots
                - /plots01
                - /plots02

scheduling:
        tmpdir_stagger_phase_major: 4
        tmpdir_stagger_phase_minor: 1
        tmpdir_stagger_phase_limit: 2

        tmpdir_max_jobs: 2

        global_max_jobs: 8

        global_stagger_m: 1

        polling_time_s: 20

        type: madmax

        chia:
                k: 32               
                e: False            
                n_threads: 2   
                n_buckets: 128      
                job_buffer: 3389    

        madmax:
                n_threads: 12        
                n_buckets: 256    
                n_buckets3: 256   
                n_rmulti2: 1            
@ghost ghost added the bug Something isn't working label Jul 24, 2021
@altendky
Copy link
Collaborator

You have a limit of two jobs per tmp dir. Since your tmp dir stagger limit of four is greater than that, it won't do anything.

Side note, when filing this issue were you not provided with a template to fill out as shown below?

image

@ghost
Copy link
Author

ghost commented Jul 25, 2021

You have a limit of two jobs per tmp dir. Since your tmp dir stagger limit of four is greater than that, it won't do anything.

Side note, when filing this issue were you not provided with a template to fill out as shown below?

        tmpdir_stagger_phase_major: 4
        tmpdir_stagger_phase_minor: 1
        # Optional: default is 1
        tmpdir_stagger_phase_limit: 2

        # Don't run more than this many jobs at a time on a single temp dir.
        # Increase for staggered plotting by chia, leave at 1 for madmax sequential plotting
        tmpdir_max_jobs: 2

        # Don't run more than this many jobs at a time in total.
        # Increase for staggered plotting by chia, leave at 1 for madmax sequential plotting
        global_max_jobs: 8

still not work with 4:0, 4:1.. etc..

@altendky
Copy link
Collaborator

You still have the phase limit set the same as the overall limit. tmpdir_max_jobs means that any individual tmpdir can have a maximum of 2 jobs in any phase. tmpdir_stagger_phase_limit says that any individual tmpdir can have a maximum of 2 jobs in phases less than 4:1. Configuring the phase limit like that doesn't provide any further restriction.

Your bug report seems to claim that plotman is not enforcing the tmpdir_stagger_phase_limit. You have it set to a limit of 2 with phase 4:1. Do you have more than 2 jobs in a phase less than 4:1 on a single tmpdir? If so, share however it is that you see that.

@ghost
Copy link
Author

ghost commented Jul 25, 2021

You still have the phase limit set the same as the overall limit. tmpdir_max_jobs means that any individual tmpdir can have a maximum of 2 jobs in any phase. tmpdir_stagger_phase_limit says that any individual tmpdir can have a maximum of 2 jobs in phases less than 4:1. Configuring the phase limit like that doesn't provide any further restriction.

Your bug report seems to claim that plotman is not enforcing the tmpdir_stagger_phase_limit. You have it set to a limit of 2 with phase 4:1. Do you have more than 2 jobs in a phase less than 4:1 on a single tmpdir? If so, share however it is that you see that.

Okay, what I need to DO to have only 4 jobs before they get into 4:1 phase and start 4 more after old one's start transferring to dst? I do have only 4 plotting nvme's

@altendky
Copy link
Collaborator

Do you want one job in a phase less than 4:1 on each disk? Also, why do you want to align all of the plots rather than letting them be staggered?

@ghost
Copy link
Author

ghost commented Jul 25, 2021

Do you want one job in a phase less than 4:1 on each disk? Also, why do you want to align all of the plots rather than letting them be staggered?

  1. Yes, and after the plot in disk reach 4:1 start a new one without waiting for completion of the previous.
  2. Bcz I got more plots per day when running 4 madmax plots in parallel with 12 threads per one.

@altendky
Copy link
Collaborator

  1. If you want one plot per disk in a phase less than 4:1 then set the phase limit to 1.
  2. "in parallel" and "aligned" are not the same thing. What is better about 4x plots started every 40 minutes than 1x started every 10 minutes? At this point with madMAx I have both my plotters set with a phase 1 stagger since even with --rmulti2 I can't get CPU usage maxed out outside of phase 1. Though, I also have my 4x tmp drives raid0 on the system with multiple.

@ghost
Copy link
Author

ghost commented Jul 25, 2021

  1. If you want one plot per disk in a phase less than 4:1 then set the phase limit to 1.
  2. "in parallel" and "aligned" are not the same thing. What is better about 4x plots started every 40 minutes than 1x started every 10 minutes? At this point with madMAx I have both my plotters set with a phase 1 stagger since even with --rmulti2 I can't get CPU usage maxed out outside of phase 1. Though, I also have my 4x tmp drives raid0 on the system with multiple.

4 plots in parallel with 3200 sec max plot job length, 3200/4=800+/-100, when I'm running 1 job I take about 1100-1200 secs so with 256 buckets and without multiplier, so.....

@altendky
Copy link
Collaborator

Again, I am not suggesting you run only a single job. I'm only questioning why you want all four to start at the same time rather than staggered. Why start 4x every 40 minutes rather than 1x every 10 minutes? The stagger just avoids aligning resource usage peaks and valleys in an effort to make smoother continuous 100% usage.

Here's my monitoring dashboard in case a visualization helps. I always have about 3x running but I only start one at a time. The left side is a dual Xeon v0 and the right is an i5 NUC.

image

@ghost
Copy link
Author

ghost commented Jul 25, 2021

Again, I am not suggesting you run only a single job. I'm only questioning why you want all four to start at the same time rather than staggered. Why start 4x every 40 minutes rather than 1x every 10 minutes? The stagger just avoids aligning resource usage peaks and valleys in an effort to make smoother continuous 100% usage.

Here's my monitoring dashboard in case a visualization helps. I always have about 3x running but I only start one at a time. The left side is a dual Xeon v0 and the right is an i5 NUC.

This make no sense, there will be anyway an "natural" staggering when some plots in 4th stage and some new just starting, so in long run there will be only 1-3 actively running when last will be in 4 phase.

@altendky
Copy link
Collaborator

I'm not sure what is "natural" about plotman launching plots, but alrighty. Did the question you were asking get a workable answer here?

@ghost
Copy link
Author

ghost commented Jul 26, 2021

I'm not sure what is "natural" about plotman launching plots, but alrighty. Did the question you were asking get a workable answer here?

nope.

@ghost
Copy link
Author

ghost commented Jul 26, 2021

I need to have only 4 plots in stage <4 and up to 8 in stage >=4 spread between 4 nvmes

@altendky
Copy link
Collaborator

There is no globally applied phase limit. There is a per tmpdir phase limit which is what I addressed above.

  1. If you want one plot per disk in a phase less than 4:1 then set the phase limit to 1.

I think perhaps instead of "up to 8 in stage >= 4" you mean "up to 8 total"? It seems unlikely that you would need to allow twice as many in stage >=4 than in stages <4. Also, it seems there wouldn't be any such limit needed anyways mostly. But, if we go with "up to 8 total", you can achieve that either via global_max_jobs: 8 or tmpdir_max_jobs: 2 depending on your intent. You seem fairly focused on the tmp drives so I'm guessing the latter would be more representative of your intent. Though, all limits must be satisfied so global_max must still be at least the number of total jobs you want to limit to regardless of phase and tmpdir.

@ghost
Copy link
Author

ghost commented Jul 26, 2021

There is no globally applied phase limit. There is a per tmpdir phase limit which is what I addressed above.

  1. If you want one plot per disk in a phase less than 4:1 then set the phase limit to 1.

I think perhaps instead of "up to 8 in stage >= 4" you mean "up to 8 total"? It seems unlikely that you would need to allow twice as many in stage >=4 than in stages <4. Also, it seems there wouldn't be any such limit needed anyways mostly. But, if we go with "up to 8 total", you can achieve that either via global_max_jobs: 8 or tmpdir_max_jobs: 2 depending on your intent. You seem fairly focused on the tmp drives so I'm guessing the latter would be more representative of your intent. Though, all limits must be satisfied so global_max must still be at least the number of total jobs you want to limit to regardless of phase and tmpdir.

I mean not more than 4 TOTAL in stage <4 and new job start only while there <4 jobs in stage <4 SO
max total 8
but only 4 in stages 1-3

@altendky
Copy link
Collaborator

You can limit to one process in a phase less than 4 on each of your individual tmp drives. That is four total processes in a phase less than 4. There is no feature to phase limit globally, independent of any tmp dir. But, I thought you wanted one process in phase < 4 on each tmp drive so it seems like that should be ok for you.

@ghost
Copy link
Author

ghost commented Jul 26, 2021

You can limit to one process in a phase less than 4 on each of your individual tmp drives. That is four total processes in a phase less than 4. There is no feature to phase limit globally, independent of any tmp dir. But, I thought you wanted one process in phase < 4 on each tmp drive so it seems like that should be ok for you.

Okay, yes, now what variables do I need to change to get the result? I'm getting overwhelmed with that stuff atm

@altendky
Copy link
Collaborator

From the config presently listed in the OP, I set tmpdir_stagger_phase_limit: 1.

logging:
        plots: /root/.chia/plotman/logs

user_interface:
        use_stty_size: False

commands:
        interactive:
                autostart_plotting: False
                autostart_archiving: False
  
directories:
        tmp:
                - /plotting01
                - /plotting02
                - /plotting03
                - /plotting04

        dst:
                - /plots
                - /plots01
                - /plots02

scheduling:
        tmpdir_stagger_phase_major: 4
        tmpdir_stagger_phase_minor: 1
        tmpdir_stagger_phase_limit: 1

        tmpdir_max_jobs: 2

        global_max_jobs: 8

        global_stagger_m: 1

        polling_time_s: 20

        type: madmax

        chia:
                k: 32               
                e: False            
                n_threads: 2   
                n_buckets: 128      
                job_buffer: 3389    

        madmax:
                n_threads: 12        
                n_buckets: 256    
                n_buckets3: 256   
                n_rmulti2: 1            

Personally, I would set the global_stagger_m: to a bit less than a quarter of the time it takes a plot to get to phase 4. This is an iterative process since each change can affect how long the plots take. Basically, approximately evenly stagger the four "really calculating stuff" plots. This helps to smooth out the overall resource usage (CPU, bus usage to RAM and disk, etc) across the plots. In my experience with madMAx it doesn't really want to actually use full cpu in phases other than 1, even if you specify --rmulti2 2. Certainly this could vary per computer. But, if that's the case for you and you align all four of your parallel plots then you end up with them all battling for cpu in phase 1 and when they all hit phase 2 at about the same time you have cores sitting idle. If you instead always have a plot in phase one and others in phases 2 and 3 then you would always be able to fully utilize your cpu.

Yes, staggering introduces a ramp-up period where you aren't using your full resources. If you are doing 10 plots then this matters, but at that scale tuning plotman like we are going through here doesn't matter. If you are going to leave the system plotting for days and weeks, then an hour of ramp up or such is irrelevant compared to maximizing overall throughput.

@ghost ghost closed this as completed Oct 9, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant