Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clearing Thermal Events may freeze Intel processor families #225

Closed
cyring opened this issue Mar 22, 2021 · 2 comments
Closed

Clearing Thermal Events may freeze Intel processor families #225

cyring opened this issue Mar 22, 2021 · 2 comments
Assignees
Labels

Comments

@cyring
Copy link
Owner

cyring commented Mar 22, 2021

A list of Intel processors is required to safely allow writing the PROCHOT bit in the following MSR registers:

For both or per Core and per Package.

WRMSR(ThermStatus, MSR_IA32_THERM_STATUS);

WRMSR(ThermStatus, MSR_IA32_PACKAGE_THERM_STATUS);

@cyring cyring added the bug label Mar 22, 2021
@cyring cyring self-assigned this Mar 22, 2021
@cyring
Copy link
Owner Author

cyring commented Apr 10, 2021

IA32_THERM_INTERRUPT

rdmsr -ax 0x19b
13
13
13
13
13
13
13
13
13
13
13
13

High-Temperature Interrupt Enable = 1
Low-Temperature Interrupt Enable = 1
Critical Temperature Interrupt Enable = 1

  • Kernel
mce: CPU0: Thermal monitoring enabled (TM1)
            CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       CPU8       CPU9       CPU10      CPU11
 TRM:         50         50         50         50         50         50         50         50         50         50         50         50   Thermal event interrupt
            CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       CPU8       CPU9       CPU10      CPU11
 TRM:       1690       1690       1690       1690       1690       1690       1690       1690       1690       1690       1690       1690   Thermal event interrupts
  • Now attempting to boot with "nomce" > mcheck_cpu_init() to abort __mcheck_cpu_init_vendor() > mce_intel_feature_init(() > intel_init_thermal() > smp_thermal_vector = intel_thermal_interrupt;
  • Booting no mce
mce: Unable to init MCE device (rc: -5)
rdmsr -ax 0x19b
10
10
10
10
10
10
10
10
10
10
10
10

High-Temperature Interrupt Enable = 0
Low-Temperature Interrupt Enable = 0
Critical Temperature Interrupt Enable = 1

            CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       CPU8       CPU9       CPU10      CPU11      
 TRM:          0          0          0          0          0          0          0          0          0          0          0          0   Thermal event interrupts

CoreFreq_NO_MCE_Interrupt_Handler

Clearing log bits

Power Limitation

CoreFreq_Clear_Power_Limitation

Thermal Threshold

CoreFreq_Clear_Thermal_Threshold

Thermal Sensor

CoreFreq_Clear_Thermal_Sensor

Conclusions

  • Don't clear log bits if Kernel or SMI has set IA32_THERM_INTERRUPT[1-0] for an interrupt handler.

Verifying

  1. Stress Processor with Conic Compute > Hyperboloid of two sheets
    CoreFreq_Conic_Hyperboloid_of_two_sheets
    observe throttling
  2. Clear all log bits
    CoreFreq_Clear_All_Thermal_Events
  3. No CPU freeze

@cyring
Copy link
Owner Author

cyring commented Apr 10, 2021

  • Fix is available for testing since CoreFreq version 1.84.5

  • Booting the Kernel with parameter nomce will disable the Interrupt handler installation, and so, thermal log bits can be cleared.

  • Thank you for your test reports.

@cyring cyring added bugfix and removed bug labels Apr 10, 2021
@cyring cyring changed the title Clearing PROCHOT can freeze some processor family Clearing Thermal Events may freeze Intel processor families Apr 10, 2021
@cyring cyring closed this as completed Apr 24, 2021
@cyring cyring mentioned this issue May 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant