Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

double-bit ECC failures that occur during load from ICCM do not set cptra_hw_error_fatal::iccm_ecc_unc #295

Closed
korran opened this issue Nov 13, 2023 · 5 comments · Fixed by #299
Labels
bug Something isn't working

Comments

@korran
Copy link
Contributor

korran commented Nov 13, 2023

To reproduce:

Corrupt a word in ICCM, and try to load it with the lw instruction.

The only symptoms that occur when this happens are an NMI with generic MCAUSE 0xf000_0001 (NMI_CAUSE_DBUS_NON_BLOCKING_LOAD_ERROR):

mcause: 0xf0000001,
mscause: 0x0,
mstatus: 0x1800,
mtval: 0x0,
cptra_hw_error_fatal: 0x0
cptra_hw_error_enc: 0x0
cptra_hw_error_non_fatal: 0x0
error_internal_intr_r: 0x0

If this happens in the field, it will difficult to root-cause it, as there doesn't seem to be any indication to the uC an ECC error occurred.

Screenshot from 2023-11-13 01-01-10

@korran korran added the bug Something isn't working label Nov 13, 2023
@Nitsirks
Copy link
Contributor

Digging into VeeR logic here. Looks like the iccm ecc double error fires only on IFU requests to ICCM. When LSU does a DMA to the ICCM it just flags it as a bus error (same with DMA to DCCM as far as I can tell).

Found this in the PRM:
image

@bharatpillilli
Copy link
Collaborator

That makes sense because the data retrieval is not an IFU operation. This would be similar to operation on the bus where there is a parity error (I understand that our bus doesnt support parity errors :-)).

@korran
Copy link
Contributor Author

korran commented Nov 14, 2023

@bharatpillilli I'm fine with the fault being a generic fault, but we need some mechanism to tell that ECC was the cause of the fault. When this happens, I would like to see the hardware set the iccm_ecc_unc bit in the soc_ifc.cptra_hw_error_fatal register.

@calebofearth
Copy link
Collaborator

This error could be communicated to the SoC by writing to the cptra_fw_error_fatal register, which will also trigger the cptra_error_fatal output pin. Firmware should probably be doing this for any NMI it encounters already.

I understand that an NMI may be unrecoverable and firmware might be unable to set the register - but this is the best option with current hardware.

@korran
Copy link
Contributor Author

korran commented Nov 15, 2023

The firmware is already setting the cptra_fw_error_fatal register in response to the NMI, but it looks like a fault that could be caused by buggy firmware. AFAICT only internal CPU signals know that this was actually caused by an ECC error.

calebofearth added a commit that referenced this issue Nov 30, 2023
* Fix the GH issue #295, which describes a scenario where ECC errors on reads from ICCM may not trigger the error signal and cause cptra_error_fatal to assert.
* Fix a minor UVM issue in uvmf_soc_ifc that causes intermittent errors during nightly regression.

Related work items: #597603, #597604, #597607
calebofearth added a commit that referenced this issue Dec 1, 2023
* Fix the GH issue #295, which describes a scenario where ECC errors on reads from ICCM may not trigger the error signal and cause cptra_error_fatal to assert.
* Fix a minor UVM issue in uvmf_soc_ifc that causes intermittent errors during nightly regression.

Related work items: #597603, #597604, #597607
howardtr pushed a commit that referenced this issue Dec 5, 2023
* Add test for Mailbox JTAG accesses with clock gating

Internal-tag: [#47908]

* Update on Readme

* SHA 256 final

* update on readme

* Added HMAC

* Updated coprights

* Create Security and Response policy

* Merged PR 124577: Fix WDT NMI prediction

Fixes the case where timers restart and then t2 times out in cascade mode and NMI is triggered. Expected txn needs to occur in the same clk as NMI interrupt

Related work items: #545794

* Merged PR 125379: added message reduction

added message reduction

Related work items: #554637

* Merged PR 125576: Increase WDT timer timeout value

Enforce a minimum of 5 clks on timer timeout values for WDT tests to avoid transaction mismatches

Related work items: #555119

* Merged PR 125587: KV clear prediction fix, debug x AHB sequence, convert rst cross cps to cover props

1. Emulate the 1 cycle delay in clear function
2. Add more scenarios to debug sequence
3. Rst cross coverpoints never get hit, converting to cover properties for better coverage

Related work items: #532848

* Merged PR 126129: added failure in signing if generated signature has s=0

added s_out_of_range failure for signing output
added ecc test vector for seed/nonce/iv full range

Related work items: #561868

* Merged PR 125683: Scan mode dft override and synchronizer removal

- fix for scan coverage
Removing synchronizer from scan mode signal on reset override.
Adding warm reset pin to drive reset during scan mode for max coverage.

Related work items: #555293

* Merged PR 126213: Check for pending t1 interrupt before changing timeout values in RT fw

Related work items: #562658

* Merged PR 126577: Coverage merge all duts

Pipeline to merge all duts coverage into one database

Related work items: #563789

* Merged PR 126835: Fixing +COVERAGE compiles that broke after coverage bind additions

Addition of coverage binds to top level caliptra benches broke compilation when +COVERAGE was added. Missed this due to pipeline and manual testing not includingt he option.

Related work items: #563789

* Merged PR 127057: added a missing default case to hmac_drbg_interface

added a missing default case to hmac_drbg_interface

Related work items: #565291

* Remove MSFT internal collateral files

* Merged PR 127071: UVM validation FW fix - check/clear error interrupts at MBOX flow entry

Firmware fix to clear any error interrupts held over after previous mailbox flow handling exited, but before the mailbox returned to idle state. Resolves a UVM regression edge case.

Also, force firmware randomization seed to match hardware seed by extracting seed value from the yml test file (which accounts for manual override in local runs).

Related work items: #565323

* Merged PR 127106: Fix rd_data cg instantiation

Related work items: #565386

* Merged PR 127097: More fixes to coverage merging

Fixing how coverage is merged

Related work items: #563789

* Merged PR 127448: MSFT sync: Manual file-copy from GH dev-integrate to MSFT internal repo

MSFT sync: Manual file-copy from GH dev-integrate to MSFT internal repo

Related work items: #566127

* Merged PR 127773: Adding caliptra top tb directed regression to coverage roll up

Adding coverage switches to caliptra top directed regression
Changes to coverage pipeline to roll up caliptra top directed and random regressions
Added symlink to latest merged coverage directory

Related work items: #563789

* Merged PR 127232: UVM fix for soc_ifc_rand_test deadlock edge case

Fix for multi-threaded reg accesses resulting in deadlock on uvm_reg

Related work items: #565702

* Merged PR 127980: Fixing MBOX spurious double ecc error

- removing extra pointer resets so that we don't generate spurious interrupts on unnecessary reads of dword 0
- fixing tests added to directed test list, bad path

Related work items: #567016

* Merged PR 128205: UVM regression fix for multi-agent arb issue, force_unlock deadlock issue

Fixes two issues:
- Known UVM bug (described here: https://forums.accellera.org/topic/7037-register-write-clobbers-simultaneous-access-in-multi-threaded-testbench/) that causes uvm_reg arbitration to fail (access semaphore has a bug). This causes failures during the multi-agent sequence when multiple agents are trying to access mbox_datain. Solved with an additional application-layer semaphore custom to our reg-block.
- A recent fix to solve an unreturned semaphore in the register layer (unrelated to the above) added a bug that may cause deadlock in the error injection CPTRA-side handler sequence when an error occurs.

Related work items: #566556, #567666

* Merged PR 127470: Disable timers after first timeout before NMI check

Randomized timeout values can be small enough that timer1 times out a second time before NMI testing is done in RT. The intr check helps but needs to happen just before the timer is restarted.

Related work items: #566167

* Merged PR 128247: [UVM] Fixes in val env. for several regression failures related to timeouts or Mailbox FSM edge cases

When an illegal transfer occurs concurrent with a legal mailbox interaction, the erroneous access should take precedence and flag a protocol violation instead of continuing with the normal flow.
* Bug issue: https://dev.azure.com/ms-tsd/AHA_POC/_workitems/edit/519733
* Will not fix for 1p0. Instead, this PR adds an explicit print message when the known failure scenario occurs, to aid regression review.

Add a fix for a UVM sequence-specific failure where double-bit ECC error injection can result in a timeout (by corrupting the "expected" response dlen value to a large number).

Related work items: #568733, #568736

* Merged PR 128855: [Bug fix] Mailbox rd_valid_f signal rst/init value; [UVM] validation fw fix for error intr handling

RTL fix:
- Add reset value for mbox_rd_valid_f, resolving a potential issue with mbox_dataout containing X values (resolves #250)

UVM Validation fix:
- Clear cmd_fail/inv_dev error interrupts at Mbox flow entry (val runtime firmware)

Related work items: #569091, #569460

* Remove integ spec PDF as we migrate to Markdown format

* README updates:

Add
 - firmware regression list description
 - enhance UVM run steps
 - describe test list selection
 - describe Verilog file list generation and usage

* Formatting

* Formatting

* Formatting

* Formatting

* Formatting

* README: Tool version info

* initial markdown conversion

* minor updates based on feedback

* add images and image references

* Update Caliptra_rtl.md

Adding specific signal names for flops to remove from scan chain to protect obfuscation key leakage.

* Merged PR 129340: KV debug test update and other misc items

1. KV debug test update to issue reset every time debug/scan mode is toggled
2. Update rst window assertions to disable during scan mode
3. Add common_defines to clk_gate.sv (#248)
4. Add WDT + rst test
5. WDT regression fix

Related work items: #574347, #574348

* Merged PR 130632: Remove common_defines from clk_gate

common_defines inclusion in clk_gate.sv causes compilation issues

* Merged PR 130640: Small coverage improvements and adding directed tb to merge pipeline

- cleaning up coverage merge pipeline
- updating map file for directed testbench mappings
- Adding streaming case from caliptra side for sha accelerator to hit stall condition
- Removed coverpoints that can't be hit
  - soc ifc reg doesn't stall, so ip signal isn't valid
  - Boot done -> idle arc is tied off. State is terminal

Related work items: #486758, #563789

* Merged PR 130547: UVM val FW bug fix to resolve regression failure

Do an FSM check again upon detecting error interrupt - to catch late-asserting error transition

Related work items: #575104

* Merged PR 131583: Regression test list typo fix

Fix a typo in the firmware test list providing testsuite for nightly directed regression

Related work items: #582436

* Merged PR 131385: fix for req hold bug, issue 259

fix for req hold bug, issue 259
sha mbox accesses would always assert req_hold the entire time, even if there was no conflict
fixed by qualifying the hold correctly, now soc or uc can read mbox registers without getting held

Related work items: #579190

* Merged PR 131836: Include config_defines in clk_gate

Include config_defines in clk_gate

Related work items: #583044

* Merged PR 131898: RDL register description updates and fix for UVM prediction issue

Add enhanced text descriptions for all of the error fatal/non-fatal registers and their internal mask registers that explains how the register contents and transitions are related to the assertion of output interrupt signals, cptra_error_fatal and cptra_error_non_fatal.
Add text description for CPTRA_DBG_MANUF_SERVICE_REG, which resolves github issue #261
Add a fix in the UVM class soc_ifc_predictor to correct the prediction of cptra_error_non_fatal based on triggering events instead of directly calculating the interrupt pin based on register contents.

Related work items: #583195

* Camel case for markdown docs

* Merged PR 132089: fixing wait count, worst case is actually 33 for direct read conflicts

fixing wait count, worst case is actually 33 for direct read conflicts

Related work items: #583732

* Merged PR 132153: TB Fix: soc_ifc_tb standalone test is incorrectly checking WO trigger regs.

Fix the reg stimulus/checking for WO intr blk regs (triggers)

Update the text description in register RDL files to add clarity on usage of trigger register.

Related work items: #583997

* Fix mv path

* Clarify eTRNG usage

* Language: RNG -> TRNG

* Update integ spec to 0.9 version; rollback release notes to reflect 0.9 release

* Language, simplification

* Version docs at 1.0-rc1

* Update CaliptraIntegrationSpecification.md

Updating integration parameter table to include the file where the define/parameter is present as well as updating the names to match RTL.

* Update CaliptraIntegrationSpecification.md

Moving "defines" to defines table

* RDL: Add RNG unavail bit to dbg manuf reg description (#283)

Add Random Number Generator Unavailable bit to dbg manuf register description

* Merged PR 132462: [UVM] Fix for regression failure caused by soc_ifc error injection sequence

Adjust wait methodology when pausing the rand reg access routine to avoid errant termination of the task and subsequent deadlock

Related work items: #584641

* Merged PR 133196: KV test content for coverage

Added some interleaved operations like writes to random clients during debug/scan modes, after clear/locks, etc. Earlier, only a few of the clients were being exercised. Added a task to randomly select a combination of write clients

Related work items: #586448

* Merged PR 132944: UVM val firmware bug fix: solve a possible error race condition

* In UVM validation firmware for caliptra_top, fix the mbox_unlock procedure so internal firmware interrupts in the data structure are cleared before asserting mbox_unlock. This allows subsequent errors immediately after the unlock to trigger a whole new error handling flow instead of being masked.

Related work items: #585880

* Merged PR 133433: Add SV assertions to uvmf_caliptra_top testbench

Add SV assertions to uvmf_caliptra_top testbench

Related work items: #586843

* Merged PR 133575: Remove top port TODO comments

* Remove TODO comments from the top portlist in caliptra_top
  * Resolves #284
* Fix uvmf_caliptra_top compilation with the iTRNG option by including the UVMF_CALIPTRA_TOP define (for SVA usage).
* Add generated UVM compilation file lists (.vf)
  * The compilation file lists should provide clarity regarding #265

Related work items: #587032, #587095

* Spec update with synthesis warnings and jtag tck requirement

* Added some more description

* Apply suggestion from review

* Remove accidentally placed description

* Update expected mailbox rdptr value

Internal-Tag: [#51338]
Signed-off-by: Maciej Kurc <mkurc@antmicro.com>

* Remove I3C interface placeholder comment (#300)

* Remove support for JTAG read IDCODE instruction from VeeR TAP

Internal-Tag: [#51306]
Signed-off-by: Maciej Kurc <mkurc@antmicro.com>

* Remove expected IDCODE from OpenOCD config

Internal-Tag: [#51306]
Signed-off-by: Maciej Kurc <mkurc@antmicro.com>

* initial markdown conversion of hardware spec

* [README] Update VCS steps (#308)

* Update VCS steps with instructions to copy test vector generator

* Add Makefile VCS sim step to copy test vector gen scripts

* Add steps for running UVM unit tests

* VCS instructions for running unit tests

* Fix VCS invocation in Makefile so that DPI functions get compiled. (#306)

Internal-Tag: [#51415]

Signed-off-by: Maciej Kurc <mkurc@antmicro.com>

* Merged PR 133861: Filesystem merge from caliptra-rtl GitHub to MSFT internal

Manual filesystem merge from GH repo
(Bringing back to GH to include some file modifications - lic headers)

Related work items: #587660

* Commit minor tweaks to sync infrastructure with MSFT internal repo (hand-copied)

* Merged PR 134395: KV UVM fixes

Addresses regression failures due to test setup issue and last dword clear logic

Related work items: #588795

* Merged PR 134100: Update synthesis script with FC commands

Migrate our synthesis setup to fusion compiler
NOTE: MSFT internal synthesis flow is used as a pipe-cleaner to check
for synthesizability, lint, timing. This may be different from tools
used by other Caliptra developers for more rigorous physical analysis.

Related work items: #589061

* Merged PR 134598: UVM regression fixes for soc_ifc deadlock and AHB stall

Fix three different regression errors caused by testbench bugs:
* An edge case can cause AHB interface to stall up to 34 clock cycles when running SHA accelerator operations, previous TB code flags an error above 33 clock cycles
* A multi-threaded sequence issue in the soc_ifc mailbox testcase with random register access injection can result in deadlock
* A false-positive test pass might be reported for Caliptra-initiated mailbox tests - erroneous/unexpected MBOX_ERROR transitions are handled normally, even for non-error-injection scenarios.

Related work items: #589323, #589324, #589546

* Merged PR 134981: Update kv scan sequence

Update lower level scan mode sequence to make debugUnlock input to KV a pulse instead of a level
Add some helpful prints to predictor

Related work items: #591177

* Merged PR 136182: Fix ICCM ECC error not reported

* Fix the GH issue #295, which describes a scenario where ECC errors on reads from ICCM may not trigger the error signal and cause cptra_error_fatal to assert.
* Fix a minor UVM issue in uvmf_soc_ifc that causes intermittent errors during nightly regression.

Related work items: #597603, #597604, #597607

---------

Signed-off-by: Maciej Kurc <mkurc@antmicro.com>
Co-authored-by: Robert Szczepanski <rszczepanski@antmicro.com>
Co-authored-by: ludwig247 <tobas.ludwig@lubis-eda.com>
Co-authored-by: tobias ludwig <tobias.ludwig@lubis-eda.com>
Co-authored-by: Lou Ferraro <112569056+FerralCoder@users.noreply.github.com>
Co-authored-by: Kiran Upadhyayula <kupadhyayula@microsoft.com>
Co-authored-by: Mojtaba Bisheh Niasar <mojtabab@microsoft.com>
Co-authored-by: Michael Norris <michnorris@microsoft.com>
Co-authored-by: Piotr Kwidzinski <110058193+pkwidzin-amd@users.noreply.github.com>
Co-authored-by: bharatpillilli <bharat.pillilli@gmail.com>
Co-authored-by: steph-morton <143441730+steph-morton@users.noreply.github.com>
Co-authored-by: Andres Lagar-Cavilla <1432143+andreslagarcavilla@users.noreply.github.com>
Co-authored-by: Michael Norris <108370498+Nitsirks@users.noreply.github.com>
Co-authored-by: Andres Lagar-Cavilla <andreslc@google.com>
Co-authored-by: Kiran Upadhyayula <kupadhyayula@fe72.svceng.com>
Co-authored-by: Maciej Kurc <mkurc@antmicro.com>
Co-authored-by: mcockrell-google <mcockrell@google.com>
Co-authored-by: Karol Gugala <kgugala@antmicro.com>
mojtaba-bisheh pushed a commit that referenced this issue Jun 10, 2024
* Fix the GH issue #295, which describes a scenario where ECC errors on reads from ICCM may not trigger the error signal and cause cptra_error_fatal to assert.
* Fix a minor UVM issue in uvmf_soc_ifc that causes intermittent errors during nightly regression.

Related work items: #597603, #597604, #597607
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants