Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PSRAM Cache Issue stills exist (IDFGH-31) #2892

Open
neoniousTR opened this issue Dec 27, 2018 · 21 comments

Comments

Projects
None yet
9 participants
@neoniousTR
Copy link

commented Dec 27, 2018

We stumbled upon the fact that cache issue with PSRAM still exist, even in the newest development environment. This can produce random crashes, even if the code is 100 % valid.

This very small example program reproduces the problem easily, at least if compiled with newest ESP-IDF and toolchain under Mac OS X (did not try other environments):
https://github.com/neonious/memcrash-esp32/

(As a side note: We noticed this problem when we implemented the dlmalloc memory allocator in a fork of ESP-IDF. We worked around this problem (hopefully you can fix it correctly), and now have an ESP-IDF with far faster allocations. Take a look at the blog post here: https://www.neonious-basics.com/index.php/2018/12/27/faster-optimized-esp-idf-fork-psram-issues/ ).

@Alvin1Zhang Alvin1Zhang changed the title PSRAM Cache Issue stills exist [TW#28180] PSRAM Cache Issue stills exist Dec 28, 2018

@Alvin1Zhang

This comment has been minimized.

Copy link
Collaborator

commented Dec 29, 2018

@neoniousTR Hi, neoniousTR, thanks for reporting this, we will look into this and update if any feedbacks. Also there is a topic about the issue on our forum at http://bbs.esp32.com/viewtopic.php?f=13&t=8628&sid=1acc8bd897e72cf450ad9eb71491d732. Thanks.

@neoniousTR

This comment has been minimized.

Copy link
Author

commented Jan 26, 2019

We updated the example project at https://github.com/neonious/memcrash-esp32
It is now leaner, more to the point, and most importantly, compiles out of the box.

We think this problem is urgent to fix, as random crashes can occur to anyone using the PSRAM of the ESP32.

There only seems to be two workarounds:

  • Use only the first 2 MB of 4 MB of PSRAM (big penalty)
  • End every function which stores to PSRAM with a memw instruction (slow). nops do not help.

Please take a look at the project, and hopefully you have a better idea.

@Spritetm

This comment has been minimized.

Copy link
Member

commented Jan 29, 2019

Fyi, we're working on this. For what it's worth, it seems to be caused by an interrupt (in your example, the FreeRTOS tick interrupt) firing while some cache activity is going on. We have our digital team running simulations to see what exactly is going on in the hardware; we hope to create a better workaround than the memw solution from that.

@neoniousTR

This comment has been minimized.

Copy link
Author

commented Jan 29, 2019

Good that you can reproduce this.
Interrupts are a good explanation why this happens only randomly..
Hoping for the best.

@markwj

This comment has been minimized.

Copy link

commented Jan 30, 2019

We seem to be seeing this as a std::string memory corruption (all zeros, on a 4 byte boundary).

In our case, disabling the top 2MB of SPIRAM didn't seem to work. But pre-allocating 2MB (which we then never use) seemed to workaround the problem. Our code runs primarily on core #1.

This is impacting us quite badly. Lots of random corruptions and crashes with devices in the field.

@neoniousTR

This comment has been minimized.

Copy link
Author

commented Jan 30, 2019

Maybe whether the top or bottom of the RAM works depends on the core used.

@dexterbg

This comment has been minimized.

Copy link

commented Jan 30, 2019

Confirmed: running our test project from #3006 with that 2 MB allocation and also starting the test task on core 0 shows the corruptions again. It seems core 0 can only work reliably with the lower 2 MB and core 1 only with the higher 2 MB.

@neoniousTR

This comment has been minimized.

Copy link
Author

commented Feb 1, 2019

@Spritetm or @Alvin1Zhang
As this issue does not happen in single core mode, do you know if the original PSRAM cache issue which is fixed with the flag and adds many nops and memws is also only in dual core mode?

If so, we will try to switch low.js to single core mode, this might even be faster at the end, because the JavaScript itself is single core anyhow and has the most load.

Also, how is the progress going? I'd think the chances are to get this fixed by modifying the interrupt handlers or the cache fetchers and savers (they are part of the ROM?).

@xbary

This comment has been minimized.

Copy link

commented Feb 11, 2019

Hello, I wanted to add to the subject the error I observed in my application using PSRAM. Random error while retrieving the amount of free PSRAM memory. In my application, I check the amount of free PSRAM memory in the main loop, and differently I received in reply that, for example, 16 bytes of free memory, but at the next check, it actually answered ~ 4mb.

In my opinion, there must have been an erroneous random reading from PSRAM.

@neoniousTR

This comment has been minimized.

Copy link
Author

commented Feb 19, 2019

Our current status:

Load to cache/Write from cache does not seem to be interrupt-based.. Might be 100% hardware-based?

Added memw to interrupt handlers does not change anything.

Currently we believe Dual Core + PSRAM is a broken combination.

So we will completly switch to Unicore now.

Please answer:
Do you know if the original PSRAM cache issue also exist in unicore mode? Would be great if we can get rid of the nops and memws with this once and for all...

@xbary

This comment has been minimized.

Copy link

commented Feb 21, 2019

I made such an experience, I rewrote the String class from the arduino project to use the PSRAM memory. I changed the name and changed the realloc in the changebuffer function. Suddenly, it turned out that my application did not regularly show 0x00 in one cell.
here is this changed String class: https://github.com/xbary/xb_StringPSRAM
I would like you to be able to replace the class with StringSRAM as part of the tests, you may be able to reproduce the repeatability of the error.

neoniousTR added a commit to neonious/lowjs that referenced this issue Feb 22, 2019

@neoniousTR

This comment has been minimized.

Copy link
Author

commented Feb 22, 2019

I have to confirm that the original PSRAM workaround is still required in Unicore mode.

So Workaround + Unicore is the only combination which works reliably with PSRAM. If I am wrong, I hope somebody will post. Otherwise we have to take this as a fact ...

@me21

This comment has been minimized.

Copy link

commented Feb 23, 2019

Can dual core chip be switched to unicore mode?
Does this error manifest itself in Arduino framework? As far as I know, Arduino task is pinned to core 0, therefore, it can be effectively viewed as unicore. Am I right?

@neoniousTR

This comment has been minimized.

Copy link
Author

commented Feb 23, 2019

@Spritetm

This comment has been minimized.

Copy link
Member

commented Feb 26, 2019

FWIW, we have a tentative solution for this; the existing workaround solution does actually seem to work but doesn't take calls/returns into account properly. We'll ship a toolchain with improved workaround code soon, but we want to have this fairly well tested so we don't have any other edge cases sneaking past us. I'll see if I can post a preliminary patch as soon as I have something halfway stable,

@markwj

This comment has been minimized.

Copy link

commented Mar 4, 2019

Do have any idea of schedule for this, or an ability to get us a pre-release toolchain?

This is impacting us quite badly. The 2MB pre-allocation solves the problem for our code, but just shifts the problem to wifi running on the first core (which now experiences random errors and throughput problems).

@xbary

This comment has been minimized.

Copy link

commented Mar 4, 2019

I confirm, the error still occurs at random moments, even hangs completely.

@projectgus projectgus changed the title [TW#28180] PSRAM Cache Issue stills exist PSRAM Cache Issue stills exist (IDFGH-31) Mar 12, 2019

@markwj

This comment has been minimized.

Copy link

commented Apr 25, 2019

Do have any idea of schedule for this, or an ability to get us a pre-release toolchain?

@dexterbg

This comment has been minimized.

Copy link

commented May 25, 2019

@Spritetm We do appreciate your efforts in making sure your patch is perfect. But meanwhile our system has to bear a huge performance hit by the workaround, while the stability is still impacted by the bug. We're more than willing to help you in beta testing your patch by using it on our project. Please do a pre-release or share some update on the status. Thanks!

@Patrik-Berglund

This comment has been minimized.

Copy link

commented May 26, 2019

Also think an update is in place, we are awaiting to see if you are able to fix this bug or if it makes the PSRAM feature unusable.

We need more RAM than internal available in the ESP32, so this is a deal breaker for our product.

@negativekelvin

This comment has been minimized.

Copy link
Contributor

commented May 26, 2019

Just wondering why if the original workaround should work in this case that forcing nops does not resolve it.

400d4b3c:	1047a5        	call8	400e4fb8 <crash_set_both>
400d4b3f:	f03d      	nop.n
400d4b41:	f03d      	nop.n
400d4b43:	f03d      	nop.n
400d4b45:	f03d      	nop.n
400d4b47:	0228      	l32i.n	a2, a2, 0

Also I noticed the workaround will add the nops even when there is already a memw barrier.

400d4e86:	03a9      	s32i.n	a10, a3, 0
400d4e88:	0020c0        	memw
400d4e8b:	01a9      	s32i.n	a10, a1, 0
400d4e8d:	f03d      	nop.n
400d4e8f:	f03d      	nop.n
400d4e91:	f03d      	nop.n
400d4e93:	f03d      	nop.n
400d4e95:	0020c0        	memw
400d4e98:	0138      	l32i.n	a3, a1, 0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.