Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BOUNTY: Find a workaround/fix for update issues - $1000 #54

Closed
digistump opened this issue Mar 2, 2016 · 86 comments

Comments

Projects
None yet
@ghost
Copy link
Collaborator

commented Mar 2, 2016

Skills Required: System Admin, Node, C++, Linux Sockets, ????
Difficulty: Unkown

Challenges/Thoughts:

The technical limitations: The firmware shipped on the Oak is solely there to let you configure your wi-fi and get the first update, it, of course - given the Oak does not have a USB interface, cannot be changed except for with this update, so any changes to make this work have to happen on the server side of things.

The technical details: The Oaks preformed very well in our early testing at getting the update, before we sent in the firmware to the factory, while this was tested in a variety of ways the main development setup had the update file hosted on an Apache 2.2 server on Amazon EC2, and was being accessed from our machines over a 3mbps DSL connection. We also tested with it locally on our b/g/n wifi network. Our routers were run stock for testing with b/g/n enabled and auto for channel. - Between this point where we approved the firmware to be burned to the units and when people started to report issues the following changes occurred (or may have occurred): The hardware went from prototype to production, with likely ever so slightly different parts that should not have effected any performance and the update file was moved to an Apache 2.4 server on Ubuntu 14.04 on Digital Ocean (and then to various server setups - see github for more on this). After this updates were still working very well for us, since we didn't think it was an area we needed to worry about I can't say for sure if one ever failed, but they never failed enough to even catch my attention as a possible issue before I started shipping the factory produced units. The final point, and given that the update still works for me more than it doesn't over DSL, but not always over the local network, is why I believe connection speed may be involved.

Issue specifics: Specifically the Oak seems to either disconnect from the server prematurely or fails to get the next packet/chunk of data from the server. This usually is seen on the oak as a Socket Timeout or the Oak restarts because the watchdog timer kicks in after it sits in a loop doing nothing for awhile. If the Oak makes it to the end of the update it works. This may have to do with WiFi interference (we don't expect you to magically make it work even if other things are on the same channel as the Oak) - we are just trying to get it to work with a minimal set of rules for the user (make sure your router is not on channel 1 is acceptable, make sure your router is set to B only and 1mbps is probably not). For some rather extreme router settings that seem to work, and support our idea that this is speed related please see this post by a fellow beta tester on the forums: http://digistump.com/board/index.php/topic,2046.0.html

Things we've tried: We've tried various server setups including apache and basic node.js ssl servers. We've tried using Node.JS to make a custom https server that slowly pushes chunks of the firmware to simulate a slower connection. This seemed 100% reliable at one point in our testing, and then it wasn't any more, no idea why - but it is worth noting that the linux sockets buffer this data anyway, so it was unlikely this was actually helping, but we really don't know for sure yet.

How to test:

  1. Generate a self signed certificate for your IP or domain you'll be using to test with, or use your existing SSL certificate if it is compatible. (a sha256RSA certificate is best for testing to avoid issues due to the certificate being incompatible)
  2. Get the fingerprint of that certificate and copy it.
  3. Restore your Oak to factory condition with github.com/digistump/OakRestore
  4. Open an Arduino Serial monitor on the port of your Oak/USB adapter connected to the Oak, select new line only from the drop down as to what line endings to send and baud to 115200, and then send (hitting enter after each line below:
set
40 (where 40 is the length of the next line including the JSON markup)
{"first-update-domain":"yourdomainorip"}
set
90
{"first-update-fingerprint":"00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00"}

where yourdomainorip is the domain or ip you are testing with, and the 00s are replaced by the SHA1 thumbprint of your SSL certificate. Your Oak will now expect this certificate, connect to this domain or IP, and expect the firmware file at /firmware/firmware_v1.bin (grab the latest firmware here to place on your server for testing: https://oakota.digistump.com/firmware/firmware_v1.bin)

NOTE You should have these strings ready to send as there is a 30 second timeout between sending the set and sending the json string.

The Oak should respond with {"r":0} after each three lines (set, length, content) are sent - if you get {"r":-1} back then something is wrong with the input.

To confirm you have changed the two parameters send over serial these two lines:

info
0

You should get back a JSON response that includes those two settings.

  1. Now use the config app (github.com/digistump/OakSoftAP/) to set your wifi info, and when it says it is restarting to get the update disconnect it and use the OakRestore instructions but substitute this at the command line (grabbing the zip file in the comment below this to get the bin files):
python esptool.py --baud 115200 --port YOUR_COM_PORT write_flash -fs 32m 0x1000 blank.bin 0x2000 oakupdate_debug.bin 0x0081000 oakupdate_debug.bin 0x101000 blank.bin 0x102000 blank.bin 0x202000 blank.bin 

This will cause your Oak to endlessly loop trying to download the update, displaying a log to serial of its progress, and then rebooting and doing it again.
6. Implement your solution in any way you can. (more below)
7. Test with your solution being served both on fast and slow broadband connections (local network, phone hotspot, high speed connection, etc). You can repeat 3-5 to set it to connect to a different server to test locally/remotely - these test bin files don't check for a certificate domain match, so you can reuse the same certificate if desired.
8. If you feel you have a good solution repeat step 5 but use oakupdate_debug_silent.bin instead - this does not show status counters during the update loop and therefore is more true to the speed of that loop on the factory Oak.
9. Submit your fix.

Acceptable solutions: The sky is pretty much the limit here, other than the firmware cannot be changed. The solution must run on a standard linux server (hey if you can get it to work with a windows server that'd be fine too), but can change the OS/server/etc in any way, this will run on a standalone cloud server. It does not have to be particularly performant (we can run many servers if we need to), though it must be able to serve the firmware to more than one device at a time. You can mess with linux sockets, you can write it in any language, you can use any existing software or packages - really we're open to anything. I have a fear it could be as easy as setting up Apache 2.2 again without any changes to the default linux socket setup (we've messed with that too many times now probably, without fully understanding - see comment at top of node scripts) - I don't think that's true as I'm sure I've tried that, but really even if it was that simple we would reward you the bounty.

Solution testing: Any submitted solution will be tested first by ourselves, and then by a selected group of users who have experienced issues updating and are able and willing to carefully test. If your solution works for most of these case we will reward the bounty to you. Partial bounties for partial improvements may be granted as well. Solutions that require the user to run something locally on their network may be considered, but preference will be given to server only solutions (not sure if that would offer any advantage, but thought I'd throw it out there).

Bounty

$1000 cash or $2000 credit or 200 Oaks Won by @jldeon

Cash or credit is your choice. Cash to be paid via Paypal. Credit has no expiration but can only be applied to a single order and does not cover shipping (because that is how our shopping cart works, not because we want to be limiting). Oaks reward includes shipping. You can also pick a split between any of the options.

You may credit yourself in the files as well, leaving in tact existing licenses and credits.

Legal Stuff: We will choose a winner at our sole discretion. The winner will be the first pull request/comment that submits fully working code meeting the above requirements and following good coding practices, based on the timestamp of the pull request. Bounty will be awarded (or in the case of Oaks, sent) within 48 hours of confirming winner. Cash awards will be made in USD. This is not an offer for hire. All work submitted becomes the property of Digistump LLC to be used at our discretion in compliance with any associated licenses. Void where prohibited by law.

@ghost

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 2, 2016

Relevant files as mentioned above:

Generic Node File Server with HTTPS:
server.zip

Attempt at releasing chunks of file more slowly:
server_with_delay.zip

Firmwares to load after setting IP/Domain and Fingerprint, and source code of those firmware for reference:
oakupdate.zip

@digistump digistump changed the title BOUNTY: Find a workaround/fix for update issues BOUNTY: Find a workaround/fix for update issues - $1000 Mar 2, 2016

@haakonnessjoen

This comment has been minimized.

Copy link

commented Mar 2, 2016

I think maybe file_chunk_write should wait for the write callback to respond, before starting the timer. That way you won't fill up a send buffer if the connection is slow. Which could cause a large burst of data to be sent on a slow connection, I would guess. And this is nitpicking; but getFileSizeInBytes is kind of redundant, since you just read the entire file. So file.length holds your answer to the file size.

@epatel

This comment has been minimized.

Copy link

commented Mar 2, 2016

Sounds like the TCP stack in the firmware is broken, i.e. it can't request missing packages. I would hook up wireshark and look carefully of what is going on. If that can be verified (no re-sends are requested), maybe a solution creating a local network with a local server (i.e. in virtualbox or on a mobile device) could minimize package drop.

One way to capture with wireshark could be to share "internet" by Wifi on a Mac and run wireshark on the same machine, this to make sure to get as much Wifi traffic as possible.

@jldeon

This comment has been minimized.

Copy link

commented Mar 2, 2016

Is there some way to get the source and replicate the build environment for the factory image? I realize that the final solution must work with the image already flashed, but for debug it would be helpful to see the code on the Oak and make changes.

@ghost

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 2, 2016

The source is the same as attached in the second comment zip file
(oakupdate.zip) - just comment out #DEBUG_SETUP to get exactly what is on a
factory Oak

You can setup the environment to compile it by downloading the following
and unzipping into Documents/Arduino/Hardware/

https://www.dropbox.com/s/dgb4qf1cooz3oba/oak_fallback.zip?dl=0 (Note still
uploading right now, if you go to this link it will tell you when it is
done uploading)

Then select Oak Fallback as the board, Single as the rom config, and hit
upload (assuming you've already followed the steps above to set your wifi
connection and domain/ssl thumbprint)

On Wed, Mar 2, 2016 at 2:50 PM, jldeon notifications@github.com wrote:

Is there some way to get the source and replicate the build environment
for the factory image? I realize that the final solution must work with the
image already flashed, but for debug it would be helpful to see the code on
the Oak and make changes.


Reply to this email directly or view it on GitHub
#54 (comment).

@jldeon

This comment has been minimized.

Copy link

commented Mar 2, 2016

OK, working on getting my environment set up.

Should I expect any response to the "set" commands from the factory firmware?

@ghost

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 2, 2016

Yes - just added, thanks for asking: The Oak should respond with {"r":0} if after each three lines (set, length, content) are sent - if you get {"r":-1} back then something is wrong with the input.

Also baud should be 115200

@ghost

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 2, 2016

@jldeon - just added some more notes on how to confirm these changes and about timeouts for sending it all, see just under the "set" codeblocks in the top post

@jldeon

This comment has been minimized.

Copy link

commented Mar 3, 2016

I've got everything set up and running now, and I'm deep in debug land. I think I've got a solid lead on part of what's going wrong, but not everything yet. Will try to keep you posted, assuming someone else doesn't figure it out first :)

@DarkLotus

This comment has been minimized.

Copy link
Contributor

commented Mar 3, 2016

Just started to get things setup to have a play with this, but im in the horrible(joke) situation where my Oak's update fine a good 80%+ of the time. 9/10 in a row this morning. ( The digistump server im talking about )

I've even tried Enabling WPA2 and forcing wifi channel 1 as was advised against. I've also tried hammering my connections up/down stream while updating as well. My only other AP is an Alcatel Pixi 4.5 in AP mode but the Oak just refuses to connect to that full stop.

Im happy to provide remote testing of anyone's WiP solution from a location that seemingly is blessed.

@jldeon

This comment has been minimized.

Copy link

commented Mar 3, 2016

I've uncovered one major problem, and I believe the solution is fairly simple. It seems like having one failure (ie, one bad connection to the update server) causes cascading failures for the Oak.

My fix should alleviate this:

HOST LOOKUP OK
NO CLIENT
COULD NOT CONNECT TO UPDATE SERVER
UPDATE FAILED

endless loop.

It seems like the network stack is doing something silly, which is causing the server to become confused. Basically, TCP connections are established based on the client's IP and port, and the server's IP and port. When most web clients connect to a server, they pick a random port as their client port. The library on the Oak picks the same one every time (4097).

If the transmission of the firmware file goes south, the socket is left open on the server. Most web servers have long timeouts on these sockets, because they expect you to make multiple requests (ie, HTML file, and then a dozen images or JS files or whatever).

Now you reboot your Oak and try to connect again. The problem is, the Oak is sending a SYN using the same source IP and port (more than likely, if you're on a home router using NAT). The server looks at that packet, goes "I have a connection already" and drops it.

Meanwhile, the server is still waiting for acknowledgement on the last chunk of the firmware file it sent.

We can't fix the silliness on the Oak side, but we can more aggressively close the sockets on the server side. For Apache, try setting:

TimeOut 3
KeepAlive Off

In the VirtualHost section (or similar). This doesn't 100% fix the problem, but it does make it a lot less likely. KeepAlive Off is pretty sensible for the server, since we're not making bulk requests to it. The TimeOut parameter gives the client 3 seconds to acknowledge a packet if the send buffer is full (so by that point, you're already behind in terms of transmission).

On Node.JS, the timeout parameter of the HTTP server object appears to serve a similar purpose.

Ideally, we'd set something like TCP_USER_TIMEOUT on the socket, but that would require either recompiling an existing server, or rolling our own. I'm not sure that the security and performance impact is worth the trouble, though.

Another option is setting /proc/sys/net/ipv4/tcp_retries2 to some lower value. This would effect everything OS-wide, though, so it's kind of the nuclear option. I believe the default at this layer is the value 15, which corresponds to more than 10 minutes.

Before these changes, if I interrupt the download from my local server (pull power while the 123123123 is scrolling), I'll get "NO CLIENT" errors over and over and over again. With these changes, I can pretty reliably flash my Oak on my local network.

@jldeon

This comment has been minimized.

Copy link

commented Mar 3, 2016

Another way we can get stuck in this "NO CLIENT" state is if the web server tries to close the socket, but can't get the Oak to reply to its attempts to do so. /proc/sys/net/ipv4/tcp_orphan_retries controls the number of times that a socket in the FIN-WAIT-1 state (ie, attempting to close) will try to tell the other side that it's closing the socket before just giving up. The default is 8, and I'd suggest that a much lower number could be used here.

I'm testing with 2 on my machine, and getting much better results.

@ghost

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 3, 2016

@jldeon - some great discoveries, thanks! - just wanted to note that we have been setting /proc/sys/net/ipv4/tcp_retries2 to 4 which seems to be a good balance between closing when it shouldn't and allowing frequent retries - this is noted at the top of the server.js file, but this was a shot in the dark - you certainly figured out why this is necessary. In general I don't mind if any of the settings we need to change are OS level - this update server will run isolated on its own cloud/virtual server

@jldeon

This comment has been minimized.

Copy link

commented Mar 3, 2016

@digistump Ah, should have looked at the node.js code :) I figured you guys weren't going to do anything else on this box, which is why I suggested some of those OS-level changes. I'm not sure how the kernel does math with that tcp_retries2 value, so I don't know what 4 means in the context of how long the socket will persist. I'd suggest trying some of the other settings, as those helped immensely.

If you want to see if a lot of people are stuck in the FIN-WAIT-1 state and would benefit from the tcp_orphan_retries change, try running the ss command on the server. The State column should show FIN-WAIT-1 if there are a lot of pending socket closures.

For the record, I'm doing no throttling whatsoever on bandwidth and not having any issues flashing the Oak over and over again. The update server and the oak are on the same high-speed LAN.

@ghost

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 3, 2016

@jldeon before you implemented those changes were you getting SOCKET READ
TIMEOUT? that is the main issue that seems to crop up for many.

What server are you using to server it? Apache (version, etc?) or Node or
something else?

On Wed, Mar 2, 2016 at 6:53 PM, jldeon notifications@github.com wrote:

@digistump https://github.com/digistump Ah, should have looked at the
node.js code :) I figured you guys weren't going to do anything else on
this box, which is why I suggested some of those OS-level changes. I'm not
sure how the kernel does math with that tcp_retries2 value, so I don't know
what 4 means in the context of how long the socket will persist. I'd
suggest trying some of the other settings, as those helped immensely.

For the record, I'm doing no throttling whatsoever on bandwidth and not
having any issues flashing the Oak over and over again. The update server
and the oak are on the same high-speed LAN.


Reply to this email directly or view it on GitHub
#54 (comment).

@DarkLotus

This comment has been minimized.

Copy link
Contributor

commented Mar 3, 2016

@jldeon Do you have any issues receiving updates from the official server though? My local server works fine as well, but I have no issues with the official so cant really debug.

@jldeon

This comment has been minimized.

Copy link

commented Mar 3, 2016

@digistump I get the occasional SOCKET READ TIMEOUT, (10% of the time, or so?) in looking at the tcpdump when it occurs, the server is retransmitting the packet but it's not getting to the Oak (at least from the tcpdump on the server, I see no ACK). I don't know that there's much that can be done about this, though, since we're dealing with wi-fi.

I'm going to keep digging on that, though, now that I've done what I can with this issue.

@jldeon

This comment has been minimized.

Copy link

commented Mar 3, 2016

@DarkLotus Yes, I tried last night and this morning to update with the official server on 2 Oaks, and it failed on every attempt.

@DarkLotus

This comment has been minimized.

Copy link
Contributor

commented Mar 3, 2016

Ah at least you can reproduce :) if you need a vps or anything to test from a remote server let me know.

@jldeon

This comment has been minimized.

Copy link

commented Mar 3, 2016

@DarkLotus Thanks for the offer! I think I've probably got it covered. I've got a couple of VPSes, credit to AWS, credit to Azure... probably some other junk if I dug around a bit :)

@jldeon

This comment has been minimized.

Copy link

commented Mar 3, 2016

@digistump Totally flaked on your questions. The box I'm using currently is Ubuntu 14.04.4 LTS, 32-bit, testing with Apache 2.4.7 (latest available in the Ubuntu repos) currently.

@ghost

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 3, 2016

@jldeon - when it failed on every attempt last night against the live
server, was it due to NO CLIENT or SOCKET READ TIMEOUT?

On Wed, Mar 2, 2016 at 7:04 PM, jldeon notifications@github.com wrote:

@digistump https://github.com/digistump Totally flaked on your
questions. The box I'm using currently is Ubuntu 14.04.4 LTS, 32-bit,
testing with Apache 2.4.7 (latest available in the Ubuntu repos) currently.


Reply to this email directly or view it on GitHub
#54 (comment).

@jldeon

This comment has been minimized.

Copy link

commented Mar 3, 2016

@digistump I was running the factory build, so there was no debug output. I posted what I had to the forums: http://digistump.com/board/index.php/topic,2034.msg9360.html#msg9360

@jldeon

This comment has been minimized.

Copy link

commented Mar 3, 2016

I can reliably reproduce the "SOCKET READ TIMEOUT" error by pointing my Oak at my VPS, and I've been digging into it for the last hour or so.

It doesn't look like an actual packet timeout, it looks more like some sort of weird conflict or maybe a race condition? I see a lot of retransmitted packets on both sides.

I've got to sleep now, but I'll try and craft an experiment to test this tomorrow if I've got time.

@DarkLotus

This comment has been minimized.

Copy link
Contributor

commented Mar 3, 2016

If anyone is around ive spun up a CentOS 6 server with Apache 2.2 just in case reverting back to Apache 2.2 is the fix.

set
43
{"first-update-domain":"oak.jameskidd.net"}
set
90
{"first-update-fingerprint":"07 99 62 B4 0C 4D 4D 59 22 1F 62 A2 83 04 0B E5 00 94 BE 18"}

At least on my end i can confirm this works at 100% 5/5 times thus far.

Will setup a ubuntu 14.04 apache 2.4 setup next on the same host, and see if i can get some time-outs reliably happening.

@DarkLotus

This comment has been minimized.

Copy link
Contributor

commented Mar 3, 2016

set
44
{"first-update-domain":"oak2.jameskidd.net"}
set
90
{"first-update-fingerprint":"07 99 62 B4 0C 4D 4D 59 22 1F 62 A2 83 04 0B E5 00 94 BE 18"}

This one is Ubuntu 14.04 with apache 2.4 All stock as well. Still I cant reproduce socket timeouts reliably, I start to wonder if its router chipset related or something.

@DeuxVis

This comment has been minimized.

Copy link

commented Mar 3, 2016

When I see that there are more problems with high speed networks it kinda ring a bell for me : MTU !
Packet fragmentation might not be handled very well by simplified / old IP stacks. I remember it being a problem in the early days of DSL internet access.

So I don't have time to test it myself currently, but trying to lower the server network interface MTU might help there ?

On a different subject, I would be happy to test if any server you guys put out there are an improvement : I am "lucky" enough to not being able of upgrading any of my 3 oaks here at home (Cable internet), and have a 3.3V able USB/TTL adapter available.
Just don't expect lightning speed replys.

@DarkLotus

This comment has been minimized.

Copy link
Contributor

commented Mar 3, 2016

@DeuxVis oak2.jameskidd.net is running with MTU set to 576 if you want to give it a shot.
and oak.jameskidd.net is running apache2.2 on CentOS 6 to try and match Erik's original server

@DarkLotus

This comment has been minimized.

Copy link
Contributor

commented Mar 3, 2016

I have 100% failure rate with the MTU lowered to 576, Socket read timeouts. So Packet size could definitely be a factor in this. Bed time for me, will look at it again tomorrow.

@fri-sch

This comment has been minimized.

Copy link

commented Mar 5, 2016

@jldeon You're right, the COULD NOT CONNECT TO UPDATE SERVER errors follow after SOCKET READ TIMEOUT, so I won't count them in the following summary of my overnight test loop:

update runs: 574
update ok: 426
update failed because of SOCKET READ TIMEOUTS: 148
success rate: ~74%

@fri-sch

This comment has been minimized.

Copy link

commented Mar 5, 2016

I don't know if this means anything, but during my test run I had one single failure that threw an exception:

OakBoot v1 - H,BU,0
START UPDATE ROM
WIFI
WIFI CONNECT
GO TO UPDATE
START UPDATE
52.37.37.115
HOST LOOKUP OK
PARSING HTTP HEADER
HTTP/1.1 200 OK
FILE LENGTH: 778096
START WRITING UPDATE
Exception (28):
epc1=0x40103187 epc2=0x00000000 epc3=0x00000000 excvaddr=0x00000024 depc=0x00000000
ctx: cont 
sp: 3fff1300 end: 3fff16a0 offset: 01a0
>>>stack>>>
3fff14a0:  3ffeb338 7fffffff 3ffeb338 00000001  
3fff14b0:  40104cc5 ffffffe7 36363636 36363636  
3fff14c0:  00040000 00717fee 50522200 3ffeb344  
3fff14d0:  4000050c 3fffc278 40104b48 3fffc200  
3fff14e0:  00000022 5d18fbcd 9f060eaf 3fff2da0  
3fff14f0:  40203933 00000030 00000013 ffffffff  
3fff1500:  40203930 0000002f 00000000 00000001  
3fff1510:  fbf8ffff 04000002 3feffe00 00000100  
3fff1520:  0000001a 00000018 04000102 3fff57a7  
3fff1530:  00000203 00000002 00001000 00000030  
3fff1540:  3fff629d 3fff15b0 3fff2da0 4021a5c8  
3fff1550:  fc4897ef 1b4a30a2 5fcbe39c aa7006a7  
3fff1560:  c99ffe47 00000000 3fff0680 00000203  
3fff1570:  00000000 4000444e 40204e00 3fff0680  
3fff1580:  00000000 400041bc 60000200 3fff57a8  
3fff1590:  00000100 40004b14 00001000 00203000  
3fff15a0:  00000100 3fff47a8 3fffc718 00204000  
3fff15b0:  00001000 0000000f 401015f8 00001000  
3fff15c0:  3fff47a8 00000002 00001000 0000000f  
3fff15d0:  3fffc718 3fff47a8 00000000 401015fd  
3fff15e0:  00203000 40203930 3fff57a7 00000203  
3fff15f0:  3ffe9340 00000000 00002710 00001a51  
3fff1600:  00000000 3fff29d0 3fff29a8 3fff2d20  
3fff1610:  0000000f 00000000 3fff2e98 0000002f  
3fff1620:  00000001 00000001 40203c10 3fff040c  
3fff1630:  000bbf70 00000000 00000001 00000000  
3fff1640:  3fff47a8 00202000 00000202 ffffffff  
3fff1650:  3ffe9284 3fff0678 3fff16f0 3fff0678  
3fff1660:  00003d30 3fff040c 3fff16f0 402039f0  
3fff1670:  3fffdc20 3fff16f0 3fff0564 40203bb6  
3fff1680:  3fffdc20 00000000 3fff0670 40204e3e  
3fff1690:  00000000 00000000 3fff0680 40100114  
<<<stack<<<
 ets Jan  8 2013,rst cause:4, boot mode:(3,0)
wdt reset
load 0x40100000, len 3632, room 16 
tail 0
chksum 0xc0
load 0x3ffe8000, len 352, room 8 
tail 8
chksum 0x82
csum 0x82
@jldeon

This comment has been minimized.

Copy link

commented Mar 5, 2016

EDIT: There is now an official build/release of this tool from Digistump, visit this page instead of using the instructions in this comment.


Okay. This is "it." Still very beta, testers needed. I've built a standalone and/or server hosted solution that you can play with.

Code is in this repo. It's a fork of OakSoftAP so that I can get the config.html file.

Setup

Windows:

The "easy way" is to use this build: oakupsrv-win-983bbf7.zip. It should be self-contained.

If you run into issues with the pyinstaller version, you can run it from source. Grab the source from the repo above, and install Python 2.7 from python.org. Install pyopenssl, twisted, and service_identity from PIP (ie, python -m pip install pyopenssl twisted service_identity)

Ubuntu (14.04 LTS tested):

sudo apt-get install python-twisted-web

Other platforms probably work, assuming you can install Python 2.7 and get the requisite packages from PIP or your package manager. These are just the two platforms I've tested on.

Running the Update Server

The update server does two things:

  1. It serves a custom version of config.html, which allows you to select a different server for your Oak to get firmware from
  2. It is capable of serving the firmware itself, in an Oak-optimized manner.

If you plan on serving the firmware yourself, I strongly suggest using a Linux machine for your update server. Windows works, but I can't tune the TCP parameters enough to get it fully reliable, and you're highly likely to get NO CLIENT errors.

For Windows users using the prebuilt exe, open a command prompt, navigate to the app directory, and run oakupsrv.exe. Windows users running from Python source, use python oakupsrv.py.

*nix users, do sudo python oakupsrv.py. (We have to run as root on *nix, because we're binding to port 443.)

On Linux, you can also optionally run sudo ./tcp_params.sh, which will tune the Linux kernel for optimal Oak update performance. The old values will be printed by this script, so save them somewhere to restore later.

Give the program a minute to run, you should get output like:

New key generated for: 192.168.x.y
Key thumbprint: D6 45 63 F8 1C 7D E5 C1 A0 49 90 BD F6 1C 32 63 9C 78 9A D4
Fetching https://oakota.digistump.com/firmware/firmware_v1.bin
2016-03-05 16:09:36 Startup complete, running main loop...

Configuring the Oak

On another machine that has WiFi (not the computer from the previous step - anything with a browser is fine - phone, tablet, laptop, etc), navigate to:

http://update.server.ip:8080/config.html

Where update.server.ip is the IP of the update server from the previous step.

Follow the configuration setup as per normal, until you arrive at the WiFi network setup step. Below the list of WiFi networks, you'll see options to enter an update server IP and certificate thumbprint.

At this point, you can pick between the Digistump server, the AWS testing instance I have, or using the custom update server you set up in the previous step. If you select the custom server, the values will be populated based on the IP and thumbprint of the server, (hopefully) automatically. If this fails, you can fill them in based on the messages that printed on the console when you started running the server.

Don't forget to click "Save" after entering the proper values!

Then you can click the "next" button on the wifi config as per normal.

if everything goes according to plan, you should see your Oak reboot and attempt to update. If you're pointed at your own custom update server, the server console should show lines like:

2016-03-05 15:23:15 New connection from: 192.168.a.b
2016-03-05 15:23:26 Starting firmware transfer to: 192.168.a.b
2016-03-05 15:24:43 Finishing firmware transfer to: 192.168.a.b (1 transfers done)
2016-03-05 15:24:44 Firmware request finished for 192.168.a.b (Reason: None)
2016-03-05 15:25:03 Connection lost to: 192.168.a.b

Other Notes

Windows update servers are more likely to cause issues. I can't tune the retransmission parameters hardly at all. Plus, Windows' timeout on sockets is something like 4 minutes, and the only way to change that is via the registry. Expect a lot of NO CLIENT errors.

If you want to generate your own standalone binary, make sure pyinstaller is installed (ie, on Windows, python -m pip install pyinstaller) and then run pyinstaller --hiddenimport _cffi_backend oakupsrv.py. Don't forget to add config.html to the package before deploying.

If something goes wrong with the server's configuration, you can wipe all the created files by deleting the data directory.

  • This program auto-downloads the latest firmware_v1.bin from the real update server on first boot. It is stored in data/firmware/
  • This program also generates a self-signed SSL certificate on the fly to match the IP of the machine on which it is run. This is stored in data/cert. If you've got multiple IPs on your box, it's probably going to pick the first one it comes across. Let me know if this runs into problems on your machine. I've not yet messed around with command line arguments, but I could add one to force a particular interface.
  • The certificate fingerprint is saved in data/static so that it is accessible to the config.html setup program.
@epatel

This comment has been minimized.

Copy link

commented Mar 5, 2016

@jldeon I am very impressed 👍 Think it looks really good considering the short time. I hope it also helps, do improve the update statistics. May I ask where you work? you need a job?

@jldeon

This comment has been minimized.

Copy link

commented Mar 6, 2016

@epatel Thanks! For me, using a local Linux server is about as good as my AWS instance. I have good (200mbit) internet, though. Hosting your own update server will probably help more for folks with less reliable internet.

I am, in fact, gainfully employed :) I'm a senior firmware engineer in the CTO group of a videoconferencing company, I do a lot of prototyping and research-y stuff. What kind of jobs have you got?

I was sick for a couple of days this week and wanted to work on my Oak project, so I had some time to hack away at this. Once I got started, the problem grabbed me and I've spent most of my free time on it...

@epatel99

This comment has been minimized.

Copy link

commented Mar 6, 2016

@jldeon Ah figures, you being senior and having a good gig already. I am pretty senior too, Lead Dev/Architect here at Mag+. Wish I had had time to get dirty with this challenge, I like challenges, especially when one need to think outside the box. But, I very much enjoyed the show, and the collaboration everyone pitched in with. http://www.fastcompany.com/3031498/hit-the-ground-running/problem-solving-lessons-from-nasa

@jldeon

This comment has been minimized.

Copy link

commented Mar 6, 2016

@digistump My solution(s) work around a few bugs in the update code, which makes the initial update a bit more reliable. However, going forward, it would be best if those bugs in the update code could be fixed. That way, at least subsequent updates would be far more likely to succeed. Is that feasible? Is that part of the .bin that is downloaded from the update server?

I'm still not 100% clear on all the software architecture in place, so I'm not certain how exactly to go about building and testing this sort of change. If it's possible, any documentation is appreciated.

@AtomicCat

This comment has been minimized.

Copy link

commented Mar 7, 2016

I'm trying to debug the "Unable to connect or save settings to your Oak" problem where you can't even get the Oak to connect to your WiFi network. Using Charles and Postman, I've narrowed it down to how the configure-ap JSON parser is handling my SSID name.

I've grabbed the source for OakSystem, oak_fallback (from above), and oak_update (from above). I'm using Arduino IDE 1.6.5 and installed the 2.0.0-rc1 folder from oak_fallback into ~/Library/Arduino15/Hardware (Mac OS).

My Arduino settings are: Oak by Digistump (Pin 1 Safe Mode - Default), Serial (Expert Use Only), 80 MHz, and the port is set to my USB -> Serial adaptor.

I can now get OakSystem.ino to compile, but I'm unclear on how to get it installed onto my Oak. I've tried both uploading from the IDE and exporting the compiled binary and using esptool, but in both cases, when the upload is complete and I re-boot the Oak the LED flashes 3 times, pauses, and repeats.

The esptool command I tried was:
esptool.py --baud 115200 --port /dev/tty.usbserial-A50482AV write_flash -fs 32m 0x1000 blank.bin 0x2000 OakSystem.cpp.oak1.bin 0x0081000 OakSystem.cpp.oak1.bin 0x101000 blank.bin 0x102000 blank.bin 0x202000 blank.bin (total crapshoot, cobbled together from comments above)

What is the correct way to get a new OakSystem installed for debugging?

@jldeon

This comment has been minimized.

Copy link

commented Mar 7, 2016

@AtomicCat I might be able to help somewhat, but let's take this over to the Digistump Oak forum so that other folks can find the info - it's kind of tangentally related to the problem this issue is trying to address. If you make a post over there, I'll give you what I know :)

@jldeon

This comment has been minimized.

Copy link

commented Mar 7, 2016

@fri-sch Doing some digging into the occasional stack traces I get like that.

Exception (28):
epc1=0x40103187 epc2=0x00000000 epc3=0x00000000 excvaddr=0x00000024 depc=0x00000000

I used objdump on my shiny new oakupdate binary to get the full disassembly:

path\to\xtensa-lx106-elf-gcc\bin\xtensa-lx106-elf-objdump.exe -d oakupdate.cpp.elf

(I had to pull oakupdate.cpp.elf from the Arduino IDE build directory under %TEMP%)

The exception 28 means LoadProhibited or trying to read from an invalid address. The part of the code that is executing is in lmacProcessAckTimeout, which is only referenced in libpp.a, a binary part of the ESP8266 SDK.

I assume this means that it's in Espressif's network driver, and not something that's going to be easy for us to fix.

@AtomicCat

This comment has been minimized.

Copy link

commented Mar 7, 2016

@jldeon I've started a thread for building OakSystem at How to build OakSystem.

@ghost

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 7, 2016

@jldeon - Sorry for my absence in this thread this weekend, it's been a busy one around here.

The bounty doesn't require fixing the issues in the firmware itself because that firmware is a one time use updater - after this first update all future updates occur through the Particle cloud using totally different firmware that seems to be pretty darn reliable for everyone. In addition, the next run of Oaks at the factory will be pre-loaded with whatever the latest Particle firmware is, so this flawed firmware will never be used anywhere again and its whole lifespan is the first use of each Oak.

Also, I imagine this has been assumed but - the Bounty is awarded to jldeon, assuming he sticks around here to fix any issues/improve things further if possible. Awesome work, many thanks!

I'll dive more into all of this tomorrow.

@jldeon - is your AWS server running the same source as the package is using?

@pfeerick

This comment has been minimized.

Copy link
Contributor

commented Mar 7, 2016

@jldeon Nice work!! Thank you! Thank you! All hail jideon... the Oak Update God!! xD I can say on my end, with Windows 10 Pro x64 as the server, and my wifi configured normally (WPA2 TKIP+PSK, N) that updating via the custom local server is the first time I have had a new Oak update first time out of the box. I did try your AWS server by OakRestoring the updated Oak, and picking your server via the config file dropdown (and I remembered to press save!!), but it failed with the usual socket timeout error. So for me the local server is a perfect fix - that's my 2 cents (and tries for Oak 5!!) anyway ;)

@jldeon

This comment has been minimized.

Copy link

commented Mar 7, 2016

@digistump No worries! :)

I did dig into the firmware a bit to figure out as close as I could to the source of all the issues, I'll summarize in another post so that perhaps someone can learn from this or if the issues come up again we'll have whatever I learned from my investigation.

Woo! Bounty! Awesome. Yes, I'll stick around and help, clearly. Code's all up in the repo I linked above, feel free to report issues. Looking forward to actually starting work on my Oak-based project at some point :P

My AWS server is running an older version of my code. Most of what I changed between that version and the published code is stuff like the auto-generation of SSL keys, the firmware auto-download, Windows changes, etc. Usability changes, basically. The core of it is unchanged between AWS and the repo.

Let me know if there's anything logistics-wise that would be useful. I don't know if my UI changes to the setup app are what you want, so I didn't submit a pull request, but I can. I can keep the AWS server running or you guys can set up your own based on my code, either way.

Maybe next time you guys are prepping to ship something neat, you could get in touch, maybe hook me up with one to help test early? ;)

@pfeerick woo! Glad it worked for you, and happy to help :) The local server solution is best if you've got kind of spotty internet compared to your wireless LAN. I still do get socket timeout errors occasionally from both my local server and the AWS instance; it hovers around 10% for both.

@jldeon

This comment has been minimized.

Copy link

commented Mar 7, 2016

Error Analysis

NO CLIENT error

As far as I can tell, this is due to client port reuse in the WiFiClient in the ESP8266 libraries. It's probably fine if you're making multiple connections before you reset, but our loop of connect->reset->connect causes the same port to be used every time.

The easy fix from the Arduino/Processing code level is:

  // Generate a random local port number, to avoid NO CLIENT errors.
  // register used is apparently a RNG, see:
  // http://esp8266-re.foogod.com/wiki/Random_Number_Generator
  uint16_t local_port = ESP8266_DREG(0x20E44);
  // Avoid using low port numbers, 4k is probably overkill, but it doesn't matter.
  if(local_port < 4096) local_port += 4096;
  ota_client.setLocalPortStart((uint16_t)local_port);

This is probably just a good "best practice" for whenever you're making a request using the ESP8266's WiFiClient library. Saves headaches.

I found that (supposedly) RNG register which I think is pretty useful. Might want to keep that one in your back pocket, as the standard Arduino lib RNG is psuedo-random and always generates the same sequence after a reset. The other option is reading an unconnected analog pin, which generates random noise.

I worked around this one by being more aggressive about closing sockets.

SOCKET TIMEOUT

This appears to be some sort of issue in the TCP stack. When packet loss occurs, the ESP8266 doesn't recover well. It seems to cause a lot of extra retransmitted ACK packets, which confuse most standards-compliant TCP servers. They exhaust their retransmissions and sort of give up on trying to figure out what's up.

I can't put my finger on precisely where the bug is, but given the other bugs (the stack dump that I talked about in a previous post and that fri-sch posted about) this might be down in the Espressif WiFi driver. It's also possible it's in the software TCP stack, but that's lwIP and should be pretty stable. I tried modifying the firmware with a ridiculously long timeout (100 seconds) and upping the retransmit count on the server side (to something like 20) and that made the problem a bit better but didn't fix it.

If I had to solve this one on the firmware side, I'd probably have the download retried, and attempt to support the "Range" header - https://tools.ietf.org/html/rfc7233 on both the client and server. Typically you can at least get a good 200k or so, so 3-4 attempts to grab the whole thing should be sufficient.

Worked around this one in my server code by transmitting more slowly and modifying the server's socket and TCP parameters to try to give the Oak the best chance of surviving the download.

On the client side, anything you can do to reduce or eliminate possible causes of packet loss that would start retransmission helps. You can help the local network side by doing things like change your wifi to B only, get a stronger signal, go to a different, less crowded channel, turn off other wifi devices, etc. On the server side, hosting your own LAN-based server may help avoid internet-related packet loss.

wdt reset during firmware download

Appears to be a timeout in spi_flash_erase_sector. This is run with interrupts disabled. I didn't dig into spi_flash_erase_sector to see if it was something fixable, a hardware issue, etc.

Exception (28) and stack dump

I mentioned this one in an earlier post, but it seems like it could be a bug in the Espressif driver based on the stack trace.

@pfeerick

This comment has been minimized.

Copy link
Contributor

commented Mar 8, 2016

@jldeon nice error analysis... hope it helps in resolving the issues that people have encoutered.

Yeah, I don't know what the go is with my connection - I suspect it is more MTU / packet related, as my internet connection is pretty stable, and the Broadcom chip in the modem seems to be rock solid for the wifi. I have to other ESP modules running 24/7 posting temperature stats to thingspeak, and they rarely miss a beat - considering they don't have any retry code or anything - they just power up every 10 minutes, push out a update, and go to sleep until the next reboot cycle. I was able to update an Oak using the official server via a portable hotspot on the first attempt (or was it the second?), so make what you will of that.

Regardless of all of that - I think you have given us the 2nd and 3rd options, making it very unlikely that anyone will have to resort to the 4th - 1) update from the main server 2) update from an alternate server 3) update from local server, and 4) manual update via serial.

@alfem

This comment has been minimized.

Copy link

commented Mar 9, 2016

I am not sure if this is related to the discussed issue, but I have seen the SoftAP SSID corrupted in my tests today.

At home I have got only a couple of wifi networks, and my Oaks connect and get their update perfectly.

But I took one of them with me this morning (to impress my coworkers). When I tried the update process, my linux laptop detected an ACORN-0bda41 SSID, started a connection, launched the dhcp, and then the SSID dissapeared!. A new scan showed a weird SSID, ending with a strange character.

I repeated the process half a dozen of times, with the same result. And my NetworkManager has got two different networks stored in /etc/NetworkManager/system-connections: "ACORN-0bda41 automatic" and "ACORN-0bda41? automatic"

Checking this file with a binary editor, the weird character in the SSID is a '01':


00000000  5B 63 6F 6E 6E 65 63 74 69 6F 6E 5D 0A 69 64 3D 41 43 4F 52 4E 2D 30 62 64 61 34 31 01 20 61 75 74 6F 6D C3 [connection].id=ACORN-0bda41. autom.
00000024  A1 74 69 63 61 0A 75 75 69 64 3D 62 38 64 35 64 35 30 37 2D 39 31 63 33 2D 34 39 62 38 2D 62 62 64 33 2D 64 .tica.uuid=b8d5d507-91c3-49b8-bbd3-d
00000048  37 64 35 63 37 63 61 30 33 34 39 0A 74 79 70 65 3D 38 30 32 2D 31 31 2D 77 69 72 65 6C 65 73 73 0A 0A 5B 38 7d5c7ca0349.type=802-11-wireless..[8
0000006C  30 32 2D 31 31 2D 77 69 72 65 6C 65 73 73 5D 0A 73 73 69 64 3D 36 35 3B 36 37 3B 37 39 3B 38 32 3B 37 38 3B 02-11-wireless].ssid=65;67;79;82;78;
00000090  34 35 3B 34 38 3B 39 38 3B 31 30 30 3B 39 37 3B 35 32 3B 34 39 3B 31 3B 0A 6D 6F 64 65 3D 69 6E 66 72 61 73 45;48;98;100;97;52;49;1;.mode=infras
000000B4  74 72 75 63 74 75 72 65 0A 6D 61 63 2D 61 64 64 72 65 73 73 3D 32 30 3A 31 30 3A 37 41 3A 33 39 3A 37 38 3A tructure.mac-address=20:10:7A:39:78:
000000D8  36 35 0A 0A 5B 69 70 76 36 5D 0A 6D 65 74 68 6F 64 3D 61 75 74 6F 0A 0A 5B 69 70 76 34 5D 0A 6D 65 74 68 6F 65..[ipv6].method=auto..[ipv4].metho
000000FC  64 3D 61 75 74 6F 0A                                                                                        d=auto.

My office is in a shared building, with lots of little startups and loads of wireless networks (and many different security policies). Possibly that has something to do with this strange behaviour. As soon as I arrived home, tried the update again, and it worked at the first attempt.

@TomKeddie

This comment has been minimized.

Copy link

commented Mar 10, 2016

Thanks, this worked for me where nothing worked before (including serial recovery).

I used the digistump server.

Much appreciated.

@DeuxVis

This comment has been minimized.

Copy link

commented Mar 10, 2016

@alfem I have experienced the same thing multiple time at home, extra non-printable character at the end of the acorn SSID. There are a lots of WIFI networks too here, I live in a large residential building.

I have not been able to make my USB/serial adapter talk to the oak yet, so cannot provide more details currently.

@DeuxVis

This comment has been minimized.

Copy link

commented Mar 12, 2016

Sorry for delay. Here are some unsuccessful attempts logs against your server jldeon. I will do more tests later with improved wifi environment.

jldeon_server.logs.txt

@DeuxVis

This comment has been minimized.

Copy link

commented Mar 14, 2016

Attempts against the digistump server, with the new softapp, still fails for me at home. Some socket read timeouts and some unability to connect to the server.

Note that the first (fast) update went way more far than what I have experienced on the official server until now.

Official_server_new_SoftAP.logs.txt

I'm going to try with the oak near to my router, but this won't allow me to get the debug output.

Later : nope, no luck with proximity to router. Anything else I can try to help debug this ?

@jldeon

This comment has been minimized.

Copy link

commented Mar 14, 2016

@DeuxVis looks like you're getting SOCKET TIMEOUT errors, followed by NO CLIENT errors. I'd expect the NO CLIENT errors right after a SOCKET TIMEOUT as I've said in the past, due to the Oak's port reuse.

The SOCKET TIMEOUT errors usually indicate at least a temporary packet loss situation, either on the wireless link or between you and the server. The slower transfer rate and tweaked TCP parameters of my custom server give the Oak the best chance to recover, but there's still a strong possibility of failure if your wifi is prone to dropping packets.

Have you tried running your own local instance of the update server? That would eliminate any internet-based packet loss.

If that still doesn't work, you can try tweaking your wifi settings - for me, switching to B only on wireless made a big difference.

@DeuxVis

This comment has been minimized.

Copy link

commented Mar 15, 2016

Thanks for your reply @jldeon

To clarify, I am not looking for help to get my oaks updated, I can do that by serial or by using another wifi router/internet connection.
As I am still experiencing the problem I was hoping to help diagnose it, is that still needed or is the current solution considered good enough ?

I will try a local instance of update server next, should help narrow down where the problem comes from.

@pfeerick

This comment has been minimized.

Copy link
Contributor

commented Mar 19, 2016

Hey @jldeon, can you update the version of the firmware your custom server is pushing please? It appears to still be hosting the 0.9.5 / 5 core, instead of the more recent 1.0.0 / 6 core. I can update from your server just fine while the official server is giving me the 'SOCKET TIMEOUT' errors, but I then end up having to update (again!) using the local server to get 1.0.0 / 6. Interestingly, on the official server, the './+' symbols appear in groups of four, whereas yours is in groups of one.

@jldeon

This comment has been minimized.

Copy link

commented Mar 19, 2016

@DeuxVis I don't know that there's much more diagnostics that would be helpful at this point. I think I know why the errors are occurring, and beyond fixing the factory firmware (which is impossible and/or useless) I don't think I can do a better job of working around them.

@pfeerick Firmware file is updated & the server on my AWS instance has been restarted.

while we're on the subject...

@digistump Did you guys set up an instance of this server somewhere? Just wondering when I can shut the AWS instance down & point it at your version. :)

@ghost

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 19, 2016

@jldeon - yes at oakotafallback.digistump.com - and the config app now points to it automatically if the first server update fails (and then refers people to this tutorial if that fails too: http://digistump.com/wiki/oak/tutorials/local_update)

For our fallback server I used your python code tweaked for running on a remote server, and ran your linux tcp settings file - anything else you did to set that up?

Closing this issue as well, as we've now released all of this officially - many thanks @jldeon, please email me with how you'd like your bounty (support@digistump.com)

@digistump digistump closed this Mar 19, 2016

@jldeon

This comment has been minimized.

Copy link

commented Mar 19, 2016

@digistump Gotcha. I've edited my fork and the comments on this page so they point there instead of to my AWS instance. I'll take the AWS instance down here in a bit.

I didn't do anything else to my AWS instance except for what's in the Python script and the TCP parameters shell script.

You're quite welcome! Glad I could help. I'll send you guys an email here this weekend, got to run off to 'work' in a few.

kh90909 added a commit to kh90909/OakCore that referenced this issue Jun 17, 2016

Randomize local port used for Particle connection
It's a good idea to randomize the port used, otherwise a rebooted Oak
will try and reconnect on the same port, which may result in a long
delay if the connection was not closed before the reboot, as mentioned
by @jldeon here:

digistump#54 (comment)

I believe I have observed this issue in practice on a few occasions,
where it would take exactly five seconds for the first data to come
through the connection, which seems too constant to be a random delay.
As the read timeout in blocking_receive() is 2 s, this is obviously
problematic.

As most users will be behind routers running NAT, it will be the router
settings, rather than the Particle server settings, that affect how
reconnections from the same source port are treated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.