Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve available heap #3740

Open
devyte opened this issue Oct 21, 2017 · 53 comments
Open

Improve available heap #3740

devyte opened this issue Oct 21, 2017 · 53 comments

Comments

@devyte
Copy link
Collaborator

devyte commented Oct 21, 2017

Hardware

Hardware: all
Core Version: git / 2.4-rc2

Problem Description

Summary

This issue is meant to track a RAM optimization effort. The goal is to increase available RAM as much as possible.

Details

The following are initial ideas to investigate.

  • Move all const char strings to flash, and use the the access macros to access them. Implement alternate methods/functions with the correct argument signatures where applicable. This covers both the core as well as the Arduino lib code.
  • Move const declarations to flash. Example: CHARS_PER_LINE in libb64/cencode.c.
  • Revise all singleton object declarations. Make sure that all are wrapped by #defines that allow compile-time configuration to not use the global singletons. Example: Use of the Serial, but not Serial1, frees up 256+ bytes of heap.
  • Revise all uses of constant size char [] arrays. Arrays like that should not be declared on the stack, but rather created dynamically, and only on demand. Example: 2K buffer in ESP8266Webserver.
    -Revise all uses of String. When concatenating, the internal managed array will grow, sometimes far beyond what is needed. It is sometimes better to reserve, if there is an estimate of how much space will be needed.
  • Revise object implementations. Make sure objects are passed by reference or pointer, and never by copy.
@Pablo2048
Copy link

Pablo2048 commented Oct 21, 2017

Actually I suggest to modify debug messages also to store text into PROGMEM. I've write macro for this reason:
#ifdef DEBUG_ESP_PORT #define DEBUG_MSG(_1, ...) DEBUG_ESP_PORT.printf_P(PSTR(_1), ##__VA_ARGS__) #else #define DEBUG_MSG(...) #endif

@d-a-v
Copy link
Collaborator

d-a-v commented Oct 21, 2017

There is this macro USE_OPTIMIZE_PRINTF in osapi.h which, if defined, makes all os_printf's fmt stored in flash.

@Pablo2048
Copy link

Yes, but IMHO this is not used in debug messages inside libraries - for example look into ESP8266WiFiGeneric.h macro DEBUG_WIFI_GENERIC. Every debug macro inside library uses DEBUG_ESP_PORT.printf() so it uses Serial, or Serial1 object, not os_printf...

@earlephilhower
Copy link
Collaborator

earlephilhower commented Oct 21, 2017

You probably already know this, but the following command lines are very useful for finding stuff in RODATA that's eating up RAM:

For finding constant strings: objdump  -s -j .rodata /tmp/arduino_build_<code>/*.elf

For finding symbols(aka variables): readelf -a /tmp/arduino_build_<code>/*.elf| grep 3f

Some wasteful things that stick out for me:

  • ctype: 256 byte constant lookup table that appears in most apps using any string ops
  • dns_table: 1.1KB to hold DNS entries for only 4 hostnames. This needs LWIP recompiled with a smaller #define'd hostname length.

@d-a-v
Copy link
Collaborator

d-a-v commented Oct 21, 2017

@earlephilhower about dns_table, what would be reasonable? 4 hosts, 64 chars each ?
(check here and there, knowing that lwip/256chars is RFC-compliant)

@earlephilhower
Copy link
Collaborator

@d-a-v well, that's a good question. I do know the RFC says 256 octets, but practically on a tiny ESP8266 you're likely only initiating connections to 1 or 2 servers with 32-64 byte hostnames at any time.

With no changes to code, I like your 64-byte x 4 suggestion. If small LWIP patches can be entertained, then I'd either submit something that only stored hashes of the hostname (and thus only requiring 64-bytes for any length hostname) or did a dynamic allocation like @devyte is considering.

AFAIK there are already memory allocations in LWIP, so this second option may not break any contracts (but I've only spent a little bit going through the code so could be wrong here).

And if we're going heroic on it, valid DNS hostnames only allow something like 40 valid characters so you could even do a packed 6-bit encoding to save 25% of any space without any compromise on lengths.

@devyte
Copy link
Collaborator Author

devyte commented Oct 22, 2017

@earlephilhower thank you for that readelf/objdump suggestions, I've actually done that, but only once before and a looong time ago. It didn't even cross my mind to apply it here.
Any further suggestions along those lines are welcome.

@devyte
Copy link
Collaborator Author

devyte commented Oct 22, 2017

I seem to remember that at the nodemcu firmware repo they pulled some tricks in a 2-nd pass linker step, or post-link, or something along those lines. In essence, they post-processed the built binary.
Does anyone know the details, and whether a similar approach would make sense here?

@devyte
Copy link
Collaborator Author

devyte commented Oct 22, 2017

@d-a-v what do you think of compiling lwip2 with C++? I migrated a big C project (>1M lines of code) to C++ a few years back, and it wasn't too painful.
There's a lot of potential for tweaks inside lwip, but doing them in C-world makes for fragile code (I speak from experience).

@d-a-v
Copy link
Collaborator

d-a-v commented Oct 23, 2017

@earlephilhower updating dns names from 256 to 64 chars can be done at no cost (and I guess no harm), this is what lwipopts is intended for.
But playing hero and changing string encoding to pack names is I think, going to kill lwip a little more than it already is because everything is subject to bugs and lwip is actually maintained elsewhere.
My thinking is that the little esp82jewel already has enough bugs to cope with. I started to try and debug lwip once, like many others before me, but here, it is not the place to do that. That's why I tried, with #3362, to have a working-maintained-unmodified lwip. This implementation downloads and compiles original lwip with not any patch. Thus lwip bugs/improvements should be tracked there and not here.

So @devyte, we can try and compile lwip with C++, but I think we should submit all patches to lwip-maintainers, not trying to maintain them here.

I too have a proposal for fun: #3362 is an abstraction of lwip's netif (the link-layer) for any tcp stack on top of it. If you know about a smaller/lighter ip/tcp implementation (better ram+flash footprint), we can try and use it - WiFi* classes would however have to use this new stack.

@earlephilhower
Copy link
Collaborator

@d-a-v I saw you working on this when I was doing the HTTPS Server, but it didn't click that you were using an unmodified LWIP build. That sounds great!

I'd like to take a deeper look (should I ever get some time) at the LWIP repos and see if I can submit a PR that does at least the dynamic allocation of hostnames. That itself should be an almost trivial change and reduce memory dramatically in most cases (DNS for www.yahoo.com would take 16-bytes for the string vs. 256 w/a fully RFC-compliant configuration or even 64 with the reduced name length one). On LWIP init allocate 1-byte "\0" string pointers for the hostname. Then on any assignment w/the hostname on the LHS just do a "free(hostname[x]), hostname[x] = strdup(searchname)"...

@devyte - One additional tweak to the command lines I use can give you a sorted list by symbol size:
readelf -a /tmp/arduino_build_*/*.elf| grep 3ff | sort -n -k 3

@d-a-v
Copy link
Collaborator

d-a-v commented Oct 24, 2017

@earlephilhower it seems a good idea. Asking them if they would adopt such a patch would be the first harmless thing to do ?
In the meantime what would all recommend for the static table ?
3 entries 48 chars each ?
4 entries 64 chars each ?
(so I update defaults at least in lwip2)

@d-a-v
Copy link
Collaborator

d-a-v commented Oct 24, 2017

@devyte in your migration process, did you just make your C project C++ compliant without further changing data structures ?

@devyte
Copy link
Collaborator Author

devyte commented Oct 24, 2017

@d-a-v The (simplified) outline of what I did was as follows:

  1. renamed all source files from .c to .cpp to get the C++ compiler to kick in
  2. fix all of the C++ reserved keywords that happened to be used in C
  3. remove all of the extern "C" wrappers
  4. fix all of the symbol name clashes (globals, function names)
  5. other remaining details, which were rather few

I didn't change any data structures., except due to reserved keywords.

@earlephilhower
Copy link
Collaborator

earlephilhower commented Oct 25, 2017

I think that'd kind of defeat @d-a-v 's hope of just pulling LWIP and compiling straight from the sources, no?

Are you just trying to get a deeper linting? Maybe a "-Wall -Wpedantic" to the boards.txt GCC command line would get you where you want to be without code changes and file renames? I'm not sure what running g++ instead of gcc is going to give other than stricter warnings...

@dioguerra
Copy link

I detected that the biggest HEAP consumption happened when moving from 2.3.0 to 2.4.0-RC1

With the same project i had 34020B Free RAM to just 25680B and 28344B (2.4.0-rc2)

And with ArduinoJson and SecureConnection my heap goes bye bye

@devyte
Copy link
Collaborator Author

devyte commented Oct 25, 2017

@earlephilhower the C++ compiler is stricter, and there are differences in how code is optimized. In our project, speed went up by 5-7%, and out of about 200+ bugs, 110 were found straight up due to the C++ compiler complaining about something weird in the code.
Of course, that was just one project, and it's entirely possible that the same bugs could have been found in other ways, but the experience was quite sobering overall.
For the record, I do NOT expect to maintain patches here just for compiling with C++. Like @d-a-v said, any code changes for that should get submitted upstream to the lwip repo.
What we gain out of the box is scopes and easier C++ hooks, in case we need them.
BTW, I am not pushing for this as a must-have, but I do think that some experimentation is worth it.

@penfold42
Copy link

@mrd2 have you tried the readelf to see where the heap went ?

@dioguerra
Copy link

@penfold42 I shall check

@d-a-v
Copy link
Collaborator

d-a-v commented Dec 16, 2017

I would like to mention here #3978.
@earlephilhower suggested a very cheap update for a huge win, the goo'old -DNDEBUG.
Thanks

@earlephilhower
Copy link
Collaborator

896 is definitely an improvement over 536, but it goes from about a 1/2 duty cycle of play/stutter to a 2/3 duty cycle which unfortunately still isn't usable. Memory usage seems to have gone up ~0.5K (not scientific as the measurement fluctuates during runtime).

@d-a-v
Copy link
Collaborator

d-a-v commented Jan 1, 2018

So the conclusion is that v2@1460 works but takes a little more ram than v1@1460. I had to increase back dns cache name length to (256->48->)128 since 48 was too small for some IOT clouds. We are back to the OT's subject which is to improve available heap.
I'll check once again lwipopt.h if there are other things we can modify. Feel free to report back if you find something.

@earlephilhower
Copy link
Collaborator

Additional dynamic heap savings could be had by rewriting the sprintf_P() function to actually use th PROGMEM format string directly from ROM, and not doing a malloc(), strcpy_P, then free(). Just ran into that in my own app where the index.html string is ~1.5K and the heap was too fragmented to find enough space. I ended up writing my own mysprintf_P(), although the workaround could also have been to sprintf_P() line-by-line.

This would also allow for %S == PSTR string formats, since you're probably stuck rewriting the whole vsprintf() function anyway.

@d-a-v
Copy link
Collaborator

d-a-v commented Jan 2, 2018

I was thinking that os_printf_plus() indifferently allows char pointers to be in ram or flash. I did not check if there was a vsnprintf_plus() too.

I have a class made to behave like a char* but which is able to automatically handle data whether it is in ram, flash (or eeprom on avr or wherever data could be). By using it there is no need to make any distinction on data location. Memory footprint is at most a pointer size (32 bits on esp, 16 bits on avr), it cannot be used as an array (char by char loops are needed), it is not fast, but it helps a lot making simple code without needing to copy/transfer data or_P duplicate_P all_P functions_P.
I was wondering if it could be useful for ssl libraries (lots of large flash->ram copies there) or, since we are now at rewriting vsnprintf() - which I did for this class - if it could help ESP saving the precious heap.
link to ustr

If we do rewrite vsnprintf() please don't forget %i which I miss a lot.

@devyte
Copy link
Collaborator Author

devyte commented Jan 3, 2018

String seems to currently always allocate up to 16 bytes more than requested, even when using reserve():
The impact is that an application that uses many String objects will waste a lot of heap.
The calls to reserve() and changeBuffer() pass a size value as argument, which never takes into account the null term of a string (i.e.: strlen()). The only reason the code works is because of the extra allocated space.
The correct behavior should be:
-reserve() resizes to the requested size+1 to make the buffer tight (that's half of the point of the method)
-other internal calls to reserve() or changeBuffer() should pass size+some extra value to handle the buffer growing

@earlephilhower
Copy link
Collaborator

earlephilhower commented Jan 3, 2018

@devyte About the String.resize(), isn't it just trying be allocating to the internal umm blocksize, and adding space for \0 in doing so?

IIRC, though, UMM is running on 8-byte chunks, not 16, so a better line would be:
size_t newSize = (maxStrLen + 1 /* space for terminating \0 */ + 7) & (~0x7);
see the umm readme.mdfile:

Each memory block holds 8 bytes, and there are up to 32767 blocks available, for about 256K of heap space. If that's not enough, you can always add more data bytes to the body of the memory block at the expense of free block size overhead.

There's no savings in allocating less than an 8-byte chunk since the remaining bytes are lost, anyway,

@d-a-v There are so many changes in 2.4 I can't be sure, but I think that they're going w/newlib's vs_sprintf code and not os_vsprintf. I'd rather not (poorly) roll my own sprintf() if we can take a nice piece of proven code and apply a minor tweak.

@devyte
Copy link
Collaborator Author

devyte commented Jan 3, 2018

@earlephilhower the null term is implicit in how the calculation works. I suppose that's ok, given that if:

  • I reserve 1-15 bytes, the allocation is done for 16 bytes
  • I reserve 16-31 bytes, the allocation is done for 32 bytes
    So the null is silently added. It's the implicitness that rubbed me wrong. But alright.

About the 16 bytes at a time. I don't see anything about a block size in umm_malloc_cfg.h. A quick glance at README does show that a block has 8 bytes.

So, I guess the test here would be to reduce 16 to 8 in changeBuffer(), like you said.

@earlephilhower
Copy link
Collaborator

I've just sent 4 small pull requests to @igrr 's newlib repo which take care of the memory issues I noted above. Most apps should see around 500 bytes add'l static heap, and if the printf() patch is merged they'll need no add'l dynamic memory for the *printf_P()s (actually, there's no need for *printf_P() anyway).

@devyte 's note of String is not involved, since that's an Arduino core thing.

@phoddie
Copy link
Contributor

phoddie commented Jan 30, 2018

Perhaps this is naive, but it looks possible to save 192 bytes by allowing dead stripping interrupt_handlers when attachInterrupt is unused by deferring ISR installation. Here's a rough attempt.

@steminabox
Copy link

I think this is the biggest issue in the 8266 library today.. I found this thread when a relatively small program of mine on the 8266 started acting strangely - yep, memory problems.. The problem? The ardunio library (2.4.1) is taking up too many resources! A compile of an 'empty' program is taking up
.data 1276
.text 29147
.rodata 1944
.bss 29168

so going from https://tech.scargill.net/esp8266-ram/
.data + .rodata+.bss = 32388
which leaves about 50K free memory - which is what I get when I check.
The problems is another limit - the .text section at 29147 is close to its 32k limit - I think is what killed my program (it started going flaky once .text was bigger than 32k...)

Am I an idiot or is this library now so big it doesn't leave scope for anyone to do anything with it that isn't trivial?

@steminabox
Copy link

.. and doesn't that .text somehow end up in ram too?

@devyte
Copy link
Collaborator Author

devyte commented Mar 31, 2018

@phoddie all ideas are open for discussion, especially if somebody already has preliminary results.
Can you make a pr with that rough attempt, so that the changes can easily be discussed?

@devyte
Copy link
Collaborator Author

devyte commented Mar 31, 2018

@steminabox please don't confuse heap RAM with IRAM.
Yes, the core libs have grown, because a lot of functionality has been added. Most recently, floating point support for printing to string. Some things have been movednto IRAM, some things to flash. However, the ESPs resources are very limited, so it's still very tight.
Although the ESP has 80KB heap total, a lot of that is taken up by Espressif's sdk for the wifi, and then by lwip for the tcp/ip stack. When I picked up this core, an empty sketch showed just over 40KB of free heap. Now we're at 50KB, and there are still ideas to try for squeezing out more. With one idea currently being discussed, I think this firmware will be the one that offers the most free heap with an empty application.
About IRAM, the ESP has 32KB available, and there is much code that currently can't be executed directly from flash. There are ideas to provide flexibility for where some code is put (flash/IRAM), and other ideas for relaxing some restrictions (ISRs). There is also the possibility of improving things with better optimizations by a newer gcc.
If this is an issue for you, and/or you find it interesting, I suggest you research, pay attention to the repo discussions, inspect the repo code and binaries that result from the build, and then chime in if you have an idea for improvement, or if there is a task that you think you could handle. New ideas and contribution efforts are always welcome.

@steminabox
Copy link

yep, I've been reading more about it - and IRAM is obviously the problem. Once .text hits 31204 the program is stuffed...
So free heap is almost irrelevant. With .text at with a completely empty program (and no optional libraries attached) the ardunio library has only left 2057 bytes for the user!! ie the library has used 93% of the IRAM! I can't get much bigger programs to run than I did on a UNO!

Why is the library doing that? ie how much of that IRAM really needs to be there? eg why on earth (using your above example) has floating point been put in as a default?

So instead of the end user having to go through a lot of work - and still not fitting into the remaining 7% of IRAM, it would be better to fix the library by taking out all the optional stuff by default, and allow users to shift it back in if needed... Otherwise this library is getting close to useless for anything needing more than a page or two of code... It would be much better to use 10k more heap (when needed) and 10K less IRAM!
So yes, I can look at all the code and sort it out, I just have to decide if it is worth the effort to try and work around this limitation with this library (ie change bits or flog code to make a minimalist version) or do I swap to the esp32 (The user iram there seems to be about 130K)...

But that's just me - the bigger philosophical question is why is the 8266 ardunio library growing in this direction, ie becoming bloatware?

@igrr
Copy link
Member

igrr commented Mar 31, 2018

Although off-topic in this issue, one more idea to have more IRAM is to only use 16KB for the cache, instead of the 32KB used now. That would free up 16KB for the application.

Regarding IRAM usage by the core itself. Since user applications rarely need IRAM (interrupt handlers are the only case), IRAM usage was never really part of the optimization effort (until very recently). Instead of trying to free up IRAM I think it might be worth making a few changes to allow interrupt handlers to go into Flash.

@steminabox
Copy link

I got rid of my two small user interrupt routines (while trying to find where the space was going), so they aren't the problem. So something else must be using it..
How would I change the cache (like you mention above)?

@steminabox
Copy link

OK, I found the thread over here - #4551 - which is more discuss the major IRAM problem. It seems it's even worse if you take the changes since 2.4.1 (ie only 440 bytes available according to one post).
This is pretty much a complete show stopper for the library, unless you have a trivial program that doesn't do much...

@devyte
Copy link
Collaborator Author

devyte commented Mar 31, 2018

@steminabox about floating point support, you misunderstood me: it was not put into IRAM, it was just added to the code base. I meant it as an example as to why the core is always increasing in size: features are being added.
If you search the code for ICACHE_RAM_ATTR, you'll find all the functions in the core that are specified to be in IRAM, and if you think about each case you'll understand why they have to be there. Those are not all, though, there are parts of the SDK that have to be in IRAM as well, among others.
But we're getting off topic. This thread is meant to cover increasing free heap, which is the single biggest obstacle right now for secure sockets. The IRAM issue is a different one, and there is a discussion already ongoing elsewhere.
So if you have an idea about improving free heap, bring it up here.

@steminabox
Copy link

yep, I'll continue about IRAm in the other thread, though having infinite heap and no ability to load your program isn't that useful :-)

So my last comment, actually on topic :-) - with the heap the best way to keep spare space is to never use memory unless you have to, defer it until the last moment, and keep for as short as time as possible.. There is normally a trade off with doing this and cpu time (and mem fragmentation)- but the 8266 seems to have lots of cpu for the amount of memory it has.. I don't know if fragmentaion would be a problem, then again, most 8266s could reboot every now and again..

@devyte
Copy link
Collaborator Author

devyte commented Jul 13, 2018

A lot has been done for this, but there is still more to do...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment