Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regarding Runtime Parameter #1217

Open
hosunhc opened this issue Jun 28, 2023 · 19 comments
Open

Regarding Runtime Parameter #1217

hosunhc opened this issue Jun 28, 2023 · 19 comments

Comments

@hosunhc
Copy link

hosunhc commented Jun 28, 2023

The device that I am using has three clusters as shown in the device_config below:

  device_config:
    adb_server:
    big_core:
    core_clusters:
    core_names: ['A55', 'A55', 'A55', 'A55', 'A76', 'A76', 'X1', 'X1']

And if I try to change the frequency of cluster A76 when CPU4 is off, WA returns an error saying that it is not possible due to CPU4 being off even though CPU5, a cpu in the same cluster, is on:

  runtime_parameters:
    A55_frequency: 1328000
    A76_frequency: 1328000
    X1_frequency: 1745000
    sysfile_values:
      /sys/devices/system/cpu/cpu1/online: 0
      /sys/devices/system/cpu/cpu2/online: 1
      /sys/devices/system/cpu/cpu3/online: 0
      /sys/devices/system/cpu/cpu4/online: 0
      /sys/devices/system/cpu/cpu5/online: 1
      /sys/devices/system/cpu/cpu6/online: 1
      /sys/devices/system/cpu/cpu7/online: 0

Is there no way around this? Is it because CPU5 frequency is fixed to CPU4? Any advice is appreciated. The device is Pixel 6.

@marcbonnici
Copy link
Contributor

Hi, thanks for reporting this, that should not be the case so sounds like we might have a bug somewhere.

As a workaround could you try explicitly specifying the frequency of the enabled cores that you are looking for and see if that allows you to make progress?

e.g.

cpu2_frequency: 1328000
cpu5_frequency: 1328000
cpu6_frequency: 1745000

@hosunhc
Copy link
Author

hosunhc commented Jun 28, 2023

Thanks for the quick response. Still does not seem to work:


workloads:
- name: stress-ng
  iterations: 10
  params:
    cleanup_assets: true
    duration: 10
    extra_args: '--cpu-method callfunc --taskset 6,7 -l 100'
    stressor: cpu
    threads: 2
    uninstall: false
  runtime_parameters:
    # A55_frequency: 1328000
    # A76_frequency: 1328000
    # X1_frequency: 1745000
    cpu2_frequency: 1328000
    cpu5_frequency: 1328000
    cpu6_frequency: 1745000
    sysfile_values:
      /sys/devices/system/cpu/cpu1/online: 0
      /sys/devices/system/cpu/cpu2/online: 1
      /sys/devices/system/cpu/cpu3/online: 0
      /sys/devices/system/cpu/cpu4/online: 0
      /sys/devices/system/cpu/cpu5/online: 1
      /sys/devices/system/cpu/cpu6/online: 1
      /sys/devices/system/cpu/cpu7/online: 1

With the output as below:

INFO     Running job wk1
INFO         Configuring augmentations
INFO         Configuring target for job wk1 (stress-ng) [1]
ERROR        Cannot configure frequencies for CPU4 as no CPUs are online.
INFO         Completing job wk1
ERROR    Job wk1 iteration 1 completed with status FAILED. retrying...
INFO     Running job wk1
INFO         Configuring augmentations
INFO         Configuring target for job wk1 (stress-ng) [1]
ERROR        Cannot configure frequencies for CPU4 as no CPUs are online.
INFO         Completing job wk1
ERROR    Job wk1 iteration 1 completed with status FAILED. retrying...
INFO     Running job wk1
INFO         Configuring augmentations
INFO         Configuring target for job wk1 (stress-ng) [1]
ERROR        Cannot configure frequencies for CPU4 as no CPUs are online.
INFO         Completing job wk1
ERROR    Job wk1 iteration 1 completed with status FAILED. Max retries exceeded.

@marcbonnici
Copy link
Contributor

Hmm.. I see. It seems like this is happening because WA is resolving to the first cpu in the cluster and incorrectly not checking to find the first "online" cpu in the cluster.

If you don't have the requirement for particular cpus and only the number online per cluster, one potential workaround may be to online the first cpu of each cluster and hopefully allow WA's resolution to function as intended.
E.g. for your first example:

    sysfile_values:
      /sys/devices/system/cpu/cpu0/online: 1
      /sys/devices/system/cpu/cpu1/online: 0
      /sys/devices/system/cpu/cpu2/online: 1
      /sys/devices/system/cpu/cpu3/online: 0
      /sys/devices/system/cpu/cpu4/online: 1
      /sys/devices/system/cpu/cpu5/online: 0
      /sys/devices/system/cpu/cpu6/online: 1
      /sys/devices/system/cpu/cpu7/online: 0

@hosunhc
Copy link
Author

hosunhc commented Jun 29, 2023

Ahhhh i see, I was hoping that that wasnt the case as I would prefer having the flexibility of particular cpus

marcbonnici added a commit to marcbonnici/workload-automation that referenced this issue Jun 29, 2023
As per [1], any attempt to configure a core's frequency will fail
if the first core of the associated frequency domain is offline.
Remove the assumption that the first cpu should be used when committing
the change to the device to resolve this.

[1] ARM-software#1217
marcbonnici added a commit to marcbonnici/workload-automation that referenced this issue Jun 29, 2023
As per [1], any attempt to configure a core's frequency will fail
if the first core of the associated frequency domain is offline.
Remove the assumption that the first cpu should be used when committing
the change to the device to resolve this.

[1] ARM-software#1217
@marcbonnici
Copy link
Contributor

I think I've found the problem (and a few others in the process). Would you be able to try out this [1] branch on your setup and let me know if this resolves the issue for you?

[1] https://github.com/marcbonnici/workload-automation/tree/cpu_domain_fix

@hosunhc
Copy link
Author

hosunhc commented Jun 30, 2023

Okay, so I switched branches, and i just used the setup.py and followed the installation with:

cd workload-automation
sudo -H python setup.py install

And the given version is 3.4.0.dev1+7c432d74. but the issue still seems to occur.

@marcbonnici
Copy link
Contributor

Hmm.. thanks for trying that out.
Do you have your run.log available to see if there are any further hints in there?

@hosunhc
Copy link
Author

hosunhc commented Jun 30, 2023

run.log
The workload agenda is here:

workloads:
- name: stress-ng
  iterations: 5
  params:
    cleanup_assets: true
    duration: 10
    extra_args: '--cpu-method gcd --taskset 5,7 -l 100'
    stressor: cpu
    threads: 2
    uninstall: false
  runtime_parameters:
    A55_frequency: 1328000
    A76_frequency: 1328000
    X1_frequency: 1745000
    sysfile_values:
      /sys/devices/system/cpu/cpu1/online: 0
      /sys/devices/system/cpu/cpu2/online: 0
      /sys/devices/system/cpu/cpu3/online: 0
      /sys/devices/system/cpu/cpu4/online: 0
      /sys/devices/system/cpu/cpu5/online: 1
      /sys/devices/system/cpu/cpu6/online: 0
      /sys/devices/system/cpu/cpu7/online: 1

marcbonnici added a commit to marcbonnici/workload-automation that referenced this issue Jun 30, 2023
As per [1], any attempt to configure a core's frequency will fail
if the first core of the associated frequency domain is offline.
Remove the assumption that the first cpu should be used when committing
the change to the device to resolve this.

[1] ARM-software#1217
marcbonnici added a commit to marcbonnici/workload-automation that referenced this issue Jun 30, 2023
As per [1], any attempt to configure a core's frequency will fail
if the first core of the associated frequency domain is offline.
Remove the assumption that the first cpu should be used when committing
the change to the device to resolve this.

[1] ARM-software#1217
@marcbonnici
Copy link
Contributor

Thanks, would you be able to pull my branch again and see if this resolves this problem for you?

@hosunhc
Copy link
Author

hosunhc commented Jul 1, 2023

Still seems to be happening.
run.log
Also in case you need the agenda:
stressng_w_10iter.txt

@scojac01
Copy link
Contributor

scojac01 commented Jul 3, 2023

Hi Honsunhc - what happens if you try to explicitly set the frequency for each online CPU, rather than the cluster frequency?

e.g

  runtime_parameters:
    cpu0_frequency: 1328000
    cpu5_frequency: 1328000
    cpu7_frequency: 1745000

@marcbonnici
Copy link
Contributor

Right it looks like next issue here is that WA queries the device at the time it validates the input parameters and this can change before they are committed to the device.

At the point the cluster A76 (for example) will resolve to both cpus 4 and 5 (if both are online at that time) so WA picks the first cpu and hence is later generating the error since as part of the sysfile setting that cpu is being turned off before WA can actually commit the frequency.

I think Scotts workaround should work as it doesn't not rely on this resolution, however I've also updated my branch again to change the order the sysfile runtime parameters are set on the device so that any frequency configuration happens before we offline cpus. Would you be able to check if this one gets things working for you?

@hosunhc
Copy link
Author

hosunhc commented Jul 4, 2023

So I tried both Scotts method and the normal cluster method, and they both work great! There was one instance using the A76 method where the first iteration ran fine but then the remaining four iterations did have the same CPU issues, but this only happened once. If that error persists, I'll open a new issue, but at the moment I think its fixed! Thanks!

@marcbonnici
Copy link
Contributor

Thanks for confirming, I'm glad we finally have a working setup for you.

I think I might know what could cause the issue with the cluster approach but would need to look into this further so I'll keep this issue open for now as well.

@hosunhc
Copy link
Author

hosunhc commented Jul 9, 2023

So it seems that this could be a more persistent issue.
I attached the run log below:
run.log

@marcbonnici
Copy link
Contributor

I think the issue here is the cluster names combined with the hotplugging and iterations, the resolution of the cpus is still being performed at the start of the run and when trying to configure the device on subsequent iterations we run into the same problem.

Does using the cpuX_frequency notation still work here?

@hosunhc
Copy link
Author

hosunhc commented Jul 15, 2023

Yep, using cpuX_frequency works great.

@marcbonnici
Copy link
Contributor

Ok thanks for confirming. I looks like to solve the cluster parameters in combination with hotplugging the runtime parameter mechanism would require some more invasive changes.

Just to double check, are you still using my topic branch to get things working on your end rather than the upstream implementation? If so I'll look at merging those changes so we at least have a workable solution upstream as well.

@hosunhc
Copy link
Author

hosunhc commented Jul 21, 2023

Yep, I've been using your branch rather than the upstream implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants