Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mcelog --client output nothing when inject error to memory,But the messages can catch the error log. #74

Open
xiaochunlee opened this issue Feb 22, 2019 · 10 comments

Comments

@xiaochunlee
Copy link

Lately, We tested RAS function about memory inject error on the Purley platform of lenovo SR630, The OS is RHEL8,kernel version kernel-4.18.0-67.el8, mcelog version is 159.
The test steps list as below:

  1. Mount the einj module
    linux-1rz0:~ # modprobe einj param_extension=1
    linux-1rz0:~ #

  2. Start the mcelog daemon
    linux-1rz0:~ # mcelog --daemon
    linux-1rz0:~ #

  3. Check whether the einj module loaded successfully
    linux-1rz0:~ # cd /sys/kernel/debug/apei/einj/
    linux-1rz0:/sys/kernel/debug/apei/einj #
    linux-1rz0:/sys/kernel/debug/apei/einj # ls
    available_error_type error_inject error_type flags notrigger param1 param2 param3 param4 vendor vendor_flags
    linux-1rz0:/sys/kernel/debug/apei/einj #

4.Inject uncorrectable error to memory mirror range
linux-1rz0:/sys/kernel/debug/apei/einj # echo 0x10 > error_type
linux-1rz0:/sys/kernel/debug/apei/einj # echo 0x12345 > param1
linux-1rz0:/sys/kernel/debug/apei/einj #
linux-1rz0:/sys/kernel/debug/apei/einj # echo 0xfffffffffffff000 > param2
linux-1rz0:/sys/kernel/debug/apei/einj #
linux-1rz0:/sys/kernel/debug/apei/einj # echo 1 > error_inject
linux-1rz0:/sys/kernel/debug/apei/einj #
linux-1rz0:/sys/kernel/debug/apei/einj # echo 1 > notrigger
linux-1rz0:/sys/kernel/debug/apei/einj #

Below is some informations about the outcome:

[root@rhel8-ose-test rastools]# systemctl status mcelog
● mcelog.service - Machine Check Exception Logging Daemon
Loaded: loaded (/usr/lib/systemd/system/mcelog.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2019-01-30 02:22:29 EST; 51min ago
Main PID: 1177 (mcelog)
Tasks: 1 (limit: 26213)
Memory: 856.0K
CGroup: /system.slice/mcelog.service
└─1177 /usr/sbin/mcelog --ignorenodev --daemon --foreground
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: Error enabled
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCi_MISC register valid
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCi_ADDR register valid
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: SRAR
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCA: Data CACHE Level-0 Data-Read Error
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: STATUS bd80000000100134 MCGSTATUS f
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCGCAP f000c14 APICID 17 SOCKETID 0
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: PPIN 2f5f92f94c7e6989
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MICROCODE 2000055
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: CPUID Vendor Intel Family 6 Model 85

[root@rhel8-ose-test rastools]#tail -f /var/log/dmesg
Jan 30 02:29:26 rhel8-ose-test kernel: mce: Uncorrected hardware memory error in user-access at 6696d1040
Jan 30 02:29:26 rhel8-ose-test kernel: mce: [Hardware Error]: Machine check events logged
Jan 30 02:29:26 rhel8-ose-test kernel: Memory failure: 0x6696d1: Killing einj_mem_uc:8974 due to hardware memory corruption
Jan 30 02:29:26 rhel8-ose-test kernel: Memory failure: 0x6696d1: recovery action for dirty LRU page: Recovered
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: Hardware event. This is not a software error.
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCE 0
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: CPU 21 BANK 1 TSC 8a4ce5d5aa0
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: RIP 33:403c4b
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MISC 86 ADDR 6696d1040
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: TIME 1548833366 Wed Jan 30 02:29:26 2019
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCG status:RIPV EIPV MCIP LMCE
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCi status:
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: Uncorrected error
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: Error enabled
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCi_MISC register valid
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCi_ADDR register valid
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: SRAR
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCA: Data CACHE Level-0 Data-Read Error
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: STATUS bd80000000100134 MCGSTATUS f
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCGCAP f000c14 APICID 17 SOCKETID 0
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: PPIN 2f5f92f94c7e6989
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MICROCODE 2000055
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: CPUID Vendor Intel Family 6 Model 85

But when we execute the mcelog --client ,there is nothing output. So I research the code about the mcelog and found that it was blocked by mce status bit settings,The partial resource code of the mcelog tool list as below:

125 static int intel_memory_error(struct mce *m, unsigned recordlen)
126 {
127 u32 mca = m->status & 0xffff;
128 if ((mca >> 7) == 1) {
129 unsigned corr_err_cnt = 0;
130 int channel[2] = { (mca & 0xf) == 0xf ? -1 : (int)(mca & 0xf), -1 };
131 int dimm[2] = { -1, -1 };

@xiaochunlee
Copy link
Author

For investigating by UEFI expert, They got a conclusion that the Firmware always handle the error first, So that the OS cant capture that. Such as the handling of PFA threshold and mirror area uncorrected errors. So, I have a question, does the mcelog need to handle that while enabled the firmware first on UEFI? Looking forward to your reply, many thanks!

@xiaochunlee
Copy link
Author

Because if the BIOS enable the Firmware First function, the OS can't get the error count correctly, So that the mcelog can't trigger the threshold events that defined by '/etc/mcelog/mcelog.conf ' as usual. The UEFI engineers only want to know how the OS will do when the UEFI handle error first, or whether the UEFI should enable the Firmware First, if enable it, whether it is reasonable.
Is there anyone give a help? Appreciate it!

@xiaochunlee
Copy link
Author

As we know, The mcelog daemon accounts memory errors, mcelog --client can be used to query the count of corrected and uncorrected errors. The mcelog daemon can also execute triggers when configurable error thresholds are exceeded. This is used to implement a range of automatic predictive failure analysis algorithms: including bad page offlining and automatic cache error handling. User defined actions can be also configured.
If the UEFI enable the FIRMWARE_FIRST function(defined by ACPI spec), UEFI handles errors first, not even notifying OS, it result in the thresholds event which defined in ‘/etc/mcelog/mcelog.conf’ can’t be triggered, So I think it's an abnormal behavior.
Anyway, the mcelog daemon accounts inaccurate memory error while test RAS features after enable FIRMWARE_FIRST in UEFI.So if enable FIRMWARE_FIRST in UEFI, when we run the command ‘mcelog –client’ , there is nothing output.
After researching the code about the mcelog, it was blocked at MCE status bit checking,The partial resource code of the mcelog tool list as below:

125 static int intel_memory_error(struct mce *m, unsigned recordlen)
126 {
127 u32 mca = m->status & 0xffff;
128 if ((mca >> 7) == 1) {
129 unsigned corr_err_cnt = 0;
130 int channel[2] = { (mca & 0xf) == 0xf ? -1 : (int)(mca & 0xf), -1 };
131 int dimm[2] = { -1, -1 };

It seems that after enable the FIRMWARE_FIRST, UEFI handle errors first when errors occurred, But This behavior causes a change in the status of the MCA register, so mcelog can’t get the correct information, Ultimately, the user-defined event could not be triggered.

@ZSN666
Copy link

ZSN666 commented Aug 8, 2022

hi [xiaochunlee]:
I ran into the same problem,How do you solve it now ?mcelog --client shows nothing but /var/log/mcelog records mce event,finally triggers not invoked。@xiaochunlee

@xiaochunlee
Copy link
Author

hi [xiaochunlee]: I ran into the same problem,How do you solve it now ?mcelog --client shows nothing but /var/log/mcelog records mce event,finally triggers not invoked。@xiaochunlee

Command "mcelog --client" shows logs huge possibility due to the MCA status register bit 7(start from bit 0). Eg. In the latest Intel Birch Stream, Sierra Forest CPU, when UEFI don't gather MCA data first into GHES (WHEA table) to notify OS, the MCA status register always is 0xbd80000000070134 in SRAR testing, the bit 7 is 0, so nothing shows up while running "mcelog --client", follow "SFR Registers Specification reversion 1.2", chapter "Machine Check Architecture (MCA) Encoding". bit 0-15 is MCACOD, the 0x0134 means "[CE/UC/SRAR] Data Cache Level 0 - Data Read Error", so it looks everything good, but "mcelog --client" show log prerequisite is bit 7 of MCA status must be 1, I don't know why have this design, @tony Zhu tony.zhu@intel.com, Could you please help to let me know the really reason or the background? Many thanks! When UEFI does fill the WHEA (Windows Hardware Error Architecture) table and notify OS via GHES (Generic Hardware Error Source) notify type, such as NMI or SCI, the MCA register always be 0xbc0000000000009f, obvious, the bit 7 is 1, so the "mcelog --client" have message output under this situation.
Now, In Lenovo ThinkSystem SR650V4 with Intel Birch Stream, Sierra Forest CPU, Lenovo UEFI designs don't follow Intel new design to store error info to Extend log, and drop original design that store error info to WHEA as well, so the MCA status is 0xbd80000000070134 after SRAR testing and no output while running command "mcelog --client". @tony Zhu tony.zhu@intel.com Is that a normal behavior, do you have any idea on that? Appreciate it!

@xiaochunlee
Copy link
Author

xiaochunlee commented Jun 14, 2024

I've got more information about those, update as follows.
Intel Architecture spec "Table 16-10. IA32_MCi_Status [15:0] Compound Error Code Encoding" says that only Memory Controller Errors and "Extended Memory Errors" (memory of 1-level memory, slow far memory of 2-level memory and fast near memory of 2-level memory) has bit#7 be set to 1. So both EDAC driver in kernel and mcelog in APP have logical judgement of MCi status bit#7, they just parse memory error. In my case, I use SRAR (Software Recoverable Action Required) test case, that triggered DCU error is a "Data CACHE Level-0 Data-Read Error", it belongs to the category of "Cache Hierarchy Errors", not the "Memory Controller Errors" we mentioned. That's why EDAC driver and mcelog show nothing while SRAR injected in. The kernel annotation listed as below.
But I'm still confuse that why WHEA be filled by UEFI in FFM (Firmware First Mode) expose to OS MCi Status have changed to 0xbc0000000000009f from 0xbd80000000070134 while SRAR tested, Maybe your UEFI guys or Intel could tell us.
-->>
/*

  • According to Intel Architecture spec vol 3B,
  • Table 15-10 "IA32_MCi_Status [15:0] Compound Error Code Encoding"
  • memory errors should fit one of these masks:
  •  000f 0000 1mmm cccc (binary)
    
  •  000f 0010 1mmm cccc (binary)    [RAM used as cache]
    
  • where:
  •  f = Correction Report Filtering Bit. If 1, subsequent errors
    
  •      won't be shown
    
  •  mmm = error type
    
  •  cccc = channel
    

/
#define MCACOD_MEM_ERR_MASK 0xef80
/

  • Errors from either the memory of the 1-level memory system or the
  • 2nd level memory (the slow "far" memory) of the 2-level memory system.
    /
    #define MCACOD_MEM_CTL_ERR 0x80
    /
  • Errors from the 1st level memory (the fast "near" memory as cache)
  • of the 2-level memory system.
    */
    #define MCACOD_EXT_MEM_ERR 0x280
    <<--

@aegl
Copy link
Collaborator

aegl commented Jun 14, 2024 via email

@xiaochunlee
Copy link
Author

xiaochunlee commented Jun 15, 2024

@aegl, Thanks for your clarification. Yes, is a fake mca bank number here, and just the CPER section in the Generic Error Status Block sent to OS.
But I was just wondering in function apei_mce_report_mem_error(), if set the mca status register MCACOD to 0x009f, and then it notify the decoders that registered on "x86_mce_decoder_chain", include those EDAC notifier and mcelog notifier, so the EDAC and mcelog find that it is a valid memory error via bit#7 in mca status register and start to decode the mce information, but in fact, it is a cache error, such as SRAR DCU test in access application address, the output looks like below. The MCA status register is 0xbd80000000100134 is a "Data CACHE Level-0 Data-Read Error" but become 0xb40000000000009f via GHES notify OS, due to EDAC driver and mcelog be notified and start decoding to "MEMORY CONTROLLER RD_CHANNELunspecified_ERR" for mcelog, and "EDAC MC0: 1 UE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xe16677 offset:0x0 grain:32 - err_code:0000:009f socket:0 imc:0 rank:0 bg:0 ba:3 row:dad3 col:3c0)" for EDAC. Are those interpreted output right or not? I mean other interpreted results except bank number (-1). Actually, it is a Cache Hierarchy Errors, but in the UEFI FFM scennario, kernel treat is as a Memory error, is that correct, do we need to enhance them? Many thanks! @aegl

-->>
2019-08-19T04:17:35.088436-04:00 linux-dgwd mcelog: mcelog server already running
2019-08-19T04:17:35.096032-04:00 linux-dgwd kernel: [ 198.093202] EINJ: Error INJection is initialized.
2019-08-19T04:18:34.201005-04:00 linux-dgwd kernel: [ 257.199597] Disabling lock debugging due to kernel taint
2019-08-19T04:18:34.201256-04:00 linux-dgwd mcelog[2340]: Hardware event. This is not a software error.
2019-08-19T04:18:34.201302-04:00 linux-dgwd kernel: [ 257.199615] mce: Uncorrected hardware memory error in user-access at e16677000
2019-08-19T04:18:34.201304-04:00 linux-dgwd kernel: [ 257.199687] mce: [Hardware Error]: Machine check events logged
2019-08-19T04:18:34.201304-04:00 linux-dgwd kernel: [ 257.199882] Memory failure: 0xe16677: Killing mca-recover:4801 due to hardware memory corruption
2019-08-19T04:18:34.201305-04:00 linux-dgwd kernel: [ 257.199885] Memory failure: 0xe16677: recovery action for dirty LRU page: Recovered
2019-08-19T04:18:34.201505-04:00 linux-dgwd mcelog[2340]: MCE 0
2019-08-19T04:18:34.201595-04:00 linux-dgwd mcelog[2340]: CPU 3 BANK 1 TSC c76107cf84
2019-08-19T04:18:34.201677-04:00 linux-dgwd mcelog[2340]: RIP 33:404c36
2019-08-19T04:18:34.201753-04:00 linux-dgwd mcelog[2340]: MISC 86 ADDR e16677000
2019-08-19T04:18:34.201832-04:00 linux-dgwd mcelog[2340]: TIME 1566202714 Mon Aug 19 04:18:34 2019
2019-08-19T04:18:34.201913-04:00 linux-dgwd mcelog[2340]: MCG status:RIPV EIPV MCIP LMCE
2019-08-19T04:18:34.201990-04:00 linux-dgwd mcelog[2340]: MCi status:
2019-08-19T04:18:34.202066-04:00 linux-dgwd mcelog[2340]: Uncorrected error
2019-08-19T04:18:34.202143-04:00 linux-dgwd mcelog[2340]: Error enabled
2019-08-19T04:18:34.202220-04:00 linux-dgwd mcelog[2340]: MCi_MISC register valid
2019-08-19T04:18:34.202295-04:00 linux-dgwd mcelog[2340]: MCi_ADDR register valid
2019-08-19T04:18:34.202371-04:00 linux-dgwd mcelog[2340]: SRAR
2019-08-19T04:18:34.202449-04:00 linux-dgwd mcelog[2340]: MCA: Data CACHE Level-0 Data-Read Error
2019-08-19T04:18:34.202525-04:00 linux-dgwd mcelog[2340]: STATUS bd80000000100134 MCGSTATUS f
2019-08-19T04:18:34.202599-04:00 linux-dgwd mcelog[2340]: MCGCAP f000c14 APICID 6 SOCKETID 0
2019-08-19T04:18:34.202673-04:00 linux-dgwd mcelog[2340]: PPIN 2e01f1dd8b3e4923
2019-08-19T04:18:34.202748-04:00 linux-dgwd mcelog[2340]: MICROCODE 3000010
2019-08-19T04:18:34.202826-04:00 linux-dgwd mcelog[2340]: CPUID Vendor Intel Family 6 Model 85

From log, we know that we got two interrupts in a row when SRAR injected, the above one is from MCE (int 18) interrupt. EDAC doesn't decode mca info, because it is not a memory error (bit#7 is 0), and "mcelog --client" doesn't either.
The below one is NMI interrupt that defined in GHES, there are EDAC decode output, and command "mcelog --client" did show up something.

2019-08-19T04:18:38.274652-04:00 linux-dgwd kernel: [ 261.273226] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
2019-08-19T04:18:38.274661-04:00 linux-dgwd kernel: [ 261.273227] {1}[Hardware Error]: event severity: recoverable
2019-08-19T04:18:38.274663-04:00 linux-dgwd kernel: [ 261.273228] {1}[Hardware Error]: Error 0, type: recoverable
2019-08-19T04:18:38.274664-04:00 linux-dgwd kernel: [ 261.273229] {1}[Hardware Error]: fru_text: Card01, ChnA, DIMM0
2019-08-19T04:18:38.274690-04:00 linux-dgwd kernel: [ 261.273229] {1}[Hardware Error]: section_type: memory error
2019-08-19T04:18:38.274692-04:00 linux-dgwd kernel: [ 261.273230] {1}[Hardware Error]: error_status: 0x0000000000000000
2019-08-19T04:18:38.274694-04:00 linux-dgwd kernel: [ 261.273230] {1}[Hardware Error]: physical_address: 0x0000000e16677000
2019-08-19T04:18:38.274695-04:00 linux-dgwd kernel: [ 261.273232] {1}[Hardware Error]: node: 0 card: 0 module: 0 rank: 0 bank: 2 device: 0 row: 56265 column: 696
2019-08-19T04:18:38.274698-04:00 linux-dgwd kernel: [ 261.273233] {1}[Hardware Error]: DIMM location: CPU 1 DIMM 5
2019-08-19T04:18:38.274699-04:00 linux-dgwd kernel: [ 261.273240] Memory failure: 0xe16677: already hardware poisoned
2019-08-19T04:18:38.274702-04:00 linux-dgwd kernel: [ 261.273244] mce: [Hardware Error]: Machine check events logged
2019-08-19T04:18:38.274705-04:00 linux-dgwd kernel: [ 261.273256] EDAC skx MC0: HANDLING MCE MEMORY ERROR
2019-08-19T04:18:38.274715-04:00 linux-dgwd kernel: [ 261.273262] EDAC skx MC0: CPU 4: Machine Check Event: 0 Bank 255: b40000000000009f
2019-08-19T04:18:38.274716-04:00 linux-dgwd kernel: [ 261.273262] EDAC skx MC0: TSC 0
2019-08-19T04:18:38.274718-04:00 linux-dgwd kernel: [ 261.273263] EDAC skx MC0: ADDR e16677000
2019-08-19T04:18:38.274719-04:00 linux-dgwd kernel: [ 261.273263] EDAC skx MC0: MISC 0
2019-08-19T04:18:38.274721-04:00 linux-dgwd kernel: [ 261.273264] EDAC skx MC0: PROCESSOR 0:50655 TIME 1566202718 SOCKET 0 APIC 8
2019-08-19T04:18:38.274722-04:00 linux-dgwd kernel: [ 261.273269] EDAC MC0: 1 UE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xe16677 offset:0x0 grain:32 - err_code:0000:009f socket:0 imc:0 rank:0 bg:0 ba:3 row:dad3 col:3c0)
2019-08-19T04:18:38.275994-04:00 linux-dgwd mcelog[2340]: Hardware event. This is not a software error.
2019-08-19T04:18:38.276274-04:00 linux-dgwd mcelog[2340]: MCE 0
2019-08-19T04:18:38.276500-04:00 linux-dgwd mcelog[2340]: CPU 4 ACPI/APEI reported error
2019-08-19T04:18:38.276722-04:00 linux-dgwd mcelog[2340]: ADDR e16677000
2019-08-19T04:18:38.276946-04:00 linux-dgwd mcelog[2340]: TIME 1566202718 Mon Aug 19 04:18:38 2019
2019-08-19T04:18:38.277172-04:00 linux-dgwd mcelog[2340]: MCG status:
2019-08-19T04:18:38.277393-04:00 linux-dgwd mcelog[2340]: MCi status:
2019-08-19T04:18:38.277612-04:00 linux-dgwd mcelog[2340]: Uncorrected error
2019-08-19T04:18:38.277831-04:00 linux-dgwd mcelog[2340]: Error enabled
2019-08-19T04:18:38.278048-04:00 linux-dgwd mcelog[2340]: MCi_ADDR register valid
2019-08-19T04:18:38.278267-04:00 linux-dgwd mcelog[2340]: MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
2019-08-19T04:18:38.278485-04:00 linux-dgwd mcelog[2340]: Transaction: Memory read error
2019-08-19T04:18:38.278709-04:00 linux-dgwd mcelog[2340]: STATUS b40000000000009f MCGSTATUS 0
2019-08-19T04:18:38.278928-04:00 linux-dgwd mcelog[2340]: MCGCAP f000c14 APICID 8 SOCKETID 0
2019-08-19T04:18:38.279147-04:00 linux-dgwd mcelog[2340]: PPIN 2e01f1dd8b3e4923
2019-08-19T04:18:38.279363-04:00 linux-dgwd mcelog[2340]: MICROCODE 3000010
2019-08-19T04:18:38.279580-04:00 linux-dgwd mcelog[2340]: CPUID Vendor Intel Family 6 Model 85
<<--

@aegl
Copy link
Collaborator

aegl commented Jun 17, 2024 via email

@xiaochunlee
Copy link
Author

Thanks for your patient explanation. so it sounds normal of EDAC decoder and "mcelog --client" show nothing in SRAR test case with no UEFI involved. Whereas, with UEFI help to collect data and notify OS via GHES table with a fake mca bank , the EDAC decoder and "mcelog --client" recorded data and output. it is normal as well. Is my understanding right? many thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants