Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I2C/Wire library: Make Wire library non-blocking #42

Open
wmarkow opened this issue Sep 17, 2018 · 86 comments
Open

I2C/Wire library: Make Wire library non-blocking #42

wmarkow opened this issue Sep 17, 2018 · 86 comments

Comments

@wmarkow
Copy link

@wmarkow wmarkow commented Sep 17, 2018

This is a placeholder issue to cover an arduino/Arduino/issues/1476 improvement.

It looks like arduino/ArduinoCore-avr repository is the correct one to cover that case, doesn't it?

@wmarkow wmarkow changed the title Make Wire library non-blocking I2C/Wire library: Make Wire library non-blocking Sep 17, 2018
wmarkow added a commit to wmarkow/ArduinoCore-avr that referenced this issue Sep 17, 2018
@wmarkow
Copy link
Author

@wmarkow wmarkow commented Sep 17, 2018

I have just integrated my proposal from arduino/Arduino/issues/1476.

Loading

@matthijskooijman
Copy link
Collaborator

@matthijskooijman matthijskooijman commented Sep 6, 2019

Possible implementations:

Looking at these, I suspect that the version by @wmarkow and my own might be the best starting points, but neither completely solve the problem AFAICS.

One particular challenge I've found (but I'm not sure if it is really properly documented yet), is that the AVR TWI hardware is particularly sensitive to noise that makes it look like there is a second master on the bus. Since I²C is a multi-master bus, when the hardware detects an external start condition, or an arbitration error, it will assume there is another master active and hold off any bus activity until it sees a stop condition. However, when there is no other master but just noise on the bus, this will probably never happen.

When the hardware ends up in such a state, there is no way to actually detect this state (i.e. no status bit or anything), other than detecting a timeout (i.e. detecting that the hardware hasn't finished in reasonable time, so probably never started in the first place). There is a arbitration lost interrupt that can detect the start of this state in some cases (but not the end). The only way I've found to recover from this situation is to disable and re-enable the TWI hardware.

So adding timeouts is probably the only way to fix this. However, just having hardcoded timeouts as most of these implementations (including my own) have is problematic, because:

  • Slaves might be using clock stretching, which can erronously trigger a timeout if it is too short.
  • In a true multi-master situation, a transaction might need to wait for another master to finish their transaction, which will likely trigger such a hardcoded timeout. Since transactions might be long, and repeated start can be used to chain multiple together, there is no real upper limit on how long the timeout should be to keep facilitating a multi-master situation.

I suspect the only really correct way to handle this is to let the sketch specify custom timeout values (or perhaps specify the max clock stretching time, whether multi-master should be supported and if so, the max transaction time of other masters). However, this requires API changes that made implementing this a lot more tricky, which is why I didn't submit my fixes for inclusion (and instead also ended up with hardcoded timeouts tailored to my specific case without multi-master and with limited clock stretching).

Loading

@VladimirAkopyan
Copy link

@VladimirAkopyan VladimirAkopyan commented Sep 9, 2019

Custom timeout values are perfectly fine, and I believe API change is justified. Current situation leads to permanent lock-up of the microcontroller, and is completely unacceptable.
An inexperienced developer may know nothing about this issue and it will manifest itself many months after hardware has been designed, built and installed . Sometimes people do use Arduino for serious project because that's all they can do to solve a problem.

Loading

@greyltc
Copy link
Contributor

@greyltc greyltc commented Sep 15, 2019

These lockups are murdering me right now!
I've just tried https://github.com/IanSC/Arduino-Wire.h and I very much do not recommend it, seems to slow bus comms to a crawl without solving the issue. I'll go on down the list...

Loading

@greyltc
Copy link
Contributor

@greyltc greyltc commented Sep 15, 2019

I've just tried https://github.com/3devo/ArduinoLibraryWire on my I2C lockups and it doesn't solve them either :-(
Maybe I'm using it wrong though? I didn't see any changes to Wire.h so I didn't do anything different in my interface to the library.

Loading

@greyltc
Copy link
Contributor

@greyltc greyltc commented Sep 15, 2019

https://github.com/wmarkow/Arduino/tree/issue_%231476 seems to prevent my firmware from locking up! 🎉

  • I do a Wire.setClock(400000); before I get started, but after the unlock procedure, that's forgotten and the bus runs at 100kHz
  • I'm in a loop fetching values from an ADC with great speed. I'd love to be able to set the timeout in microseconds instead of milliseconds. I can be pretty sure I'm in a lock state in my application when 100us goes by without traffic on the bus and I want to do everything I can to recover from a lock up quickly so I can miss as few ADC conversions as possible!

Loading

@greyltc
Copy link
Contributor

@greyltc greyltc commented Sep 15, 2019

I've been looking at these lockups in my scope very closely for a few days now.
I have an idea what the root cause of them might be in my application (one master = Mega2560 rev3, one slave = TI's ADS122C04).

See 3devo/ArduinoLibraryWire#1 for my details with scope traces if you're interested. I have some ideas for changing the Wire library to prevent them in the first place, but I haven't been able to figure out how to actually implement those fix ideas in the code yet.

Loading

@wmarkow
Copy link
Author

@wmarkow wmarkow commented Sep 16, 2019

Hello @greyltc,

https://github.com/wmarkow/Arduino/tree/issue_%231476 seems to prevent my firmware from locking up! tada

I just wanted to propose you to check out my code. Good that it works for you. However not everything seems to be covered there. It is nice that you give a few more cases to take a look into:

* I do a `Wire.setClock(400000);` before I get started, but after the unlock procedure, that's forgotten and the bus runs at 100kHz

Indeed, when a timeout condition is met, then I restart the TWI hardware (twi_disable() and twi_init()). In this review is suggested that it should not restart TWI but return a result code instead (or set some flag indicating a timeout failure). The user can check the flag later in the main loop, and reinit TWI in his way (like setting the clock to 400000). In my case that would help but I'm not sure if it works for you, when you need to recover from timeout failure very fast, so you can make your ADC conversion.
Imagine the case: you do not know exactly in which part of your code the timeout will happen. The timeout is set to 10ms (for example). Lets assume there is like 20 another TWI operations somewhere in the code between the timeout and your ADC conversion code. I have the felling that all of those 20 TWI operations may end up with the timeout (but I'm not 100% sure), so it will take at least 20 x 10ms = 200ms before your ADC code will be executed.
In my solution I can wait those 200ms and reinit TWI later in loop. For you - the Wire library may go into a "timeout failure" state and may/should not execute any TWI operations (all API methods may/should return immediatelly with a correct result code) until you reinit it somewhere in main loop. That's only a proposed solution.

* I'm in a loop fetching values from an ADC with great speed. I'd love to be able to set the timeout in microseconds instead of milliseconds. I can be pretty sure I'm in a lock state in my application when 100us goes by without traffic on the bus and I want to do everything I can to recover from a lock up quickly so I can miss as few ADC conversions as possible!

Yes, my code sets the timeout in millis but it seems to be no problems to rework it into microseconds.

Loading

@greyltc
Copy link
Contributor

@greyltc greyltc commented Sep 16, 2019

@wmarkow, I took your https://github.com/wmarkow/Arduino/tree/issue_%231476 branch and changed the timeout argument to microseconds and made any changes the user might have made to slave address or bitrate (the only two register values exposed by the Wire library) re-applied after the reset and put it in PR #107

Loading

@Hyperdimensionals
Copy link

@Hyperdimensionals Hyperdimensionals commented Feb 2, 2020

I'm using a DS3231 external clock and noticed it 'freezes' when I disconnect power to the component. I want the program to continue looping even if the clock is disconnected. I narrowed it down to a Wire.endTransmission() line in the module code, and then found this page.

Forgive my possible ignorant novice question, but which implementation listed by @matthijskooijman works best for this issue? Or is this a situation where there's some line of code I can add that will check whether the clock device is connected/powered before reaching the Wire.endTransmission() line? A lot of this discussion is over my head so I wasn't sure if this has a simpler solution than the issues discussed above?

Loading

@Mahdi-Hosseinali
Copy link

@Mahdi-Hosseinali Mahdi-Hosseinali commented Feb 7, 2020

I'm using a DS3231 external clock and noticed it 'freezes' when I disconnect power to the component. I want the program to continue looping even if the clock is disconnected. I narrowed it down to a Wire.endTransmission() line in the module code, and then found this page.

Forgive my possible ignorant novice question, but which implementation listed by @matthijskooijman works best for this issue? Or is this a situation where there's some line of code I can add that will check whether the clock device is connected/powered before reaching the Wire.endTransmission() line? A lot of this discussion is over my head so I wasn't sure if this has a simpler solution than the issues discussed above?

Use wire.endTransmission(true), it would end the transmission even if there's no response from the other side. Although this still does not act like a TIME_OUT.

Loading

@Hyperdimensionals
Copy link

@Hyperdimensionals Hyperdimensionals commented Feb 8, 2020

wire.endTransmission(true) worked for the SDA and SCL lines getting disconnected, but unfortunately when the power line is disconnected it still blocks. I'm using a battery powered clock module so the battery could die and I don't want that to become a failure point.

Loading

@Mahdi-Hosseinali
Copy link

@Mahdi-Hosseinali Mahdi-Hosseinali commented Feb 10, 2020

It probably depends on your circuitry, mine works when the other device is off. The reference manual says it sends a stop signal and releases the line, but it apparently doesn't. Or it could be that something else is wrong and we are not adept enough to understand it (electronics can be very complicated)

Loading

@ermtl
Copy link

@ermtl ermtl commented Mar 27, 2020

It's been 9 FUCKING YEARS since that bug was first discussed (2011), countless people pulled their hair trying to understand why their Arduino was freezing, why would it work normally, then suddenly stop, a dozen of times, it's been raised in here, dozens of times people had been told to get lost, use something else, etc ...
#7328
#2418
#1842
#5249
#42
Google shows pages after pages of people getting started and who get discouraged by this elusive bug they don't understand. This runs totally contrary of what arduino is supposed to stand for.

If the fix was difficult, if it compromised compatibility or added other problem, it would be understandable, but it's not the case, all that shit is caused by those 2 damn lines of code that obviously can create an infinite loop if for some reason the read operation does not complete :

// wait for read operation to complete
while(TWI_MRX == twi_state){
continue;
}

Yes, I know, this state is not supposed to happen according to the I2C protocol, but guess what ? electrical glitches didn't get the memo.

Countless people, after losing hours or days kind of solved the issue either by making a modified version for themselves, or switching to another library, so there are several implementations of a timeout that are simple and would easily solve the issue.

But arduino developers stubbornly refuse to fix it !

You think I'm rude ? After 9 years of giving the finger on this issue to the whole community for absolutely no reason, I couldn't care less.

How long before you close it again ?
Just be honest, and put a "fuck you all, WONTFIX" label on it ...

Loading

@VladimirAkopyan
Copy link

@VladimirAkopyan VladimirAkopyan commented Mar 27, 2020

Basically, @ermtl is right, this is ridiculous, WTF?

Loading

@cmaglie
Copy link
Member

@cmaglie cmaglie commented Mar 27, 2020

You think I'm rude?

Nah I don't think it... you are, period.

Also posting the same message 5 times is not nice, especially for the 99.9% of people subscribed to this repo that are not interested in this discussion. Anyway, now you did it, so let's just skip over this.

How long before you close it again ?

We haven't closed anything... the real problem is that there isn't a one-fit-all solution. If you read the previous discussions all the proposed fix were always barely tested and tailored to the OP specific use case.

I personally tested two PR in the past that, in their promises, will add timeout "without breaking anything" but you know what? at the times when I tested in my setup (an I2C display with a bunch of env sensors) they just broke my previously perfectly working sketch or made it unreasonably slow.

Don't underestimate the complexity of this issue, it's not just "2 damn lines of code".

Said that, I ask you, did you read the previous discussions?
Which one, of the previously proposed solution, you vouch?
Have you already tested them in your setup?

Loading

@VladimirAkopyan
Copy link

@VladimirAkopyan VladimirAkopyan commented Mar 27, 2020

@cmaglie sure the @ermtl is rude, but his assessment is broadly accurate. I am willing to bet my house that someone is going to use Arduino to build a ventilator to deal with CORVID-19, and it will hang and kill someone.

I tried to follow the discussion, but it's spread over many threads. Let's consider a strategy here:

  1. What is the central place for this discussions, is it this thread? Is it this PR? arduino/Arduino#2489
  2. What are the acceptance criteria, can we define a list of libraries and sensors that we are determined not to break? Are breaking changes acceptable?
  3. What is the strategy if breaks are a unavoidable? For libraries that will break, are we going to do a PR or get in touch with the author?
  4. Given that I2C protocol is often abused and misused, including by sensor manufacturers, how do we cater to those cases?
  5. Perhaps the alternative is to create a separate method / API and gradually depricate the old one? Would that be acceptable?

I think the frustration is not just from lack of a solution - that's understandable if the problem is complex. I think it stems from lack of clarity and leadership, there does not appear to be a plan on how to get it fixed. If someone wanted to contribute, is it not clear what they should do, at least to me.

Loading

@bxparks
Copy link
Contributor

@bxparks bxparks commented Mar 27, 2020

@VladimirAkopyan: You would win that bet. I was amazed and mortified when one of the authors of this (https://github.com/ermtl/Open-Source-Ventilator) asked me a question about one of my libraries. Fortunately, I just noticed this in their README.md:
"0.15 Minor version change. This version allows the replacement of the buggy Wire.h arduino library (can hang the controller) with a correct version known as jm_Wire"

Loading

@ermtl
Copy link

@ermtl ermtl commented Mar 28, 2020

bxparks : Well, it's a small world out there and I'm the main author of this Open Source Ventilator controller as part of an international team trying to design and assemble complete open source, easy to replicate emergency ventilators.

A few month ago I had the problem on a commercial product, and I went all the way, from adding capacitors to better filter supply electrical noise, to lowering the value of the pullup resistors to 1Ko, adding little serie resistors just in case, to re designing the wiring, to thinking that I might have had a bad lot of ICs, ordering new ones, waiting for them to arrive, seeing they were no better, adding a watchdog (but it was just an ugly fix), searching for a new similar sensor, ordering the new sensor, trying it, understanding that it could not possibly be 2 different kind of sensors that are both bad the same way, finally suspecting something could be wrong with the library and finally finding this utterly stupid bug !
I then switched to this library that simply includes timeouts:
http://dsscircuits.com/articles/arduino-i2c-master-library
It was nice, but forced me to rewrite the sensor library.
In the meantime, what I billed as a 1 week job had taken me more than 2 month (that's 1 month 3 weeks I could not even imagine billing the customer anything for) and the customer grew impatient. When I finally solved the issue, it was a sight of relief, but in the process, my customer's trust got broken, and he hired someone else to rebuild it, and it's not with an Arduino !

This is just what happened to me, and the bug was caught fast as the device was a noisy environment and it would lockup every few hours / minutes, but some are designing products that are not in an electrically noisy environment, and they might never see the bug, start selling their stuff and people place them close to motors, or old blinking fluo tubes, and suddenly, the products start to fail !

People use Arduinos as the glue between sensors and actuators, and a large proportion of them are I2C circuits. Having such a vicious bug lurking, ready to lockup the entire board undermines the whole platform and confines it to "hobby" status. Even then, when newcomers assemble their first gizmos, seeing it die on them for apparently no reason is really frustrating and might drive them away.

In the Open-Source-Ventilator project, many people must cooperate, we're in a hurry, people are trying several sensors, so telling them they can't use any of the most usual libraries without rewriting them first is totally unrealistic.

After spending a day on the problem (again), I found the jm_wire library (available in the library manager) that's compatible with Wire and the author simply made minimal changes to implement a fucking timeout, and, call that rocket science if you will, but yes, the timeout length can be changed, no less ! Crazy ... just what everyone asked for 9 years.

However, it does not entirely solve the issue as I2C sensor libraries will reference the wire library within their code with
#include <Wire.h>

When also including jm_Wire, all the library's function will be defined twice, and you can bet the compiler will yell and fail.

So the less ugly way I found is to look at the source files of each and every I2C related library searching for #include <Wire.h> and manually replacing it with:

#if __has_include("jm_Wire.h")
#include <jm_Wire.h>
#else
#include <Wire.h>
#endif

It does the trick ... until you update the library !

Since I'm working on a device that needs utmost reliability, not just a gizmo that blinks, it's ok for me to warn people about the stupid problem, and tell them how to overcome it. Here's my explanation:
https://github.com/ermtl/Open-Source-Ventilator/blob/master/OpenSourceVentilator/README.md

There is just a thing I would change in the library, and it's that you have to set jm_Wire.twi_readFrom_wait and jm_Wire.twi_writeTo_wait to true manually. If that was by default, there would be zero impact on the API, the only change would be that in case of electrical error / glitch, the automagic random self hanging of the controller would be gone. And if people want multi master (given the flaws in the current library, I bet not many successfully ever did that, it can barely work with a single sensor) they can add delays by setting twi_readFrom_timeout and twi_writeTo_timeout !


If, after 9 FUCKING YEARS, Arduino developers still refuse to understand the problem and it's impact, I don't know what will wake them up. People tried asking, alerting, complaining, begging, and every time, the pull requests have been ignored, the bug have been closed, replaced with a new one so that newcomers have no idea this is such an old problem, and that countless others before them also complained, here and elsewhere about it.

So yes, I'm rude and if cmaglie and others don't like it, they know where they can shove it up !

Loading

@bperrybap
Copy link

@bperrybap bperrybap commented Jan 10, 2021

@Aleev2007
There are many ways to design an i2c library and different API that has the potential to be better than the existing Wire library if starting from scratch.
Unfortunately, the Wire library does not have that luxury since the Wire API is well established and exists across many platforms, including many 3rd party platforms.
Backward compatibility with the existing API to support existing applications that use the Wire library is very important.

IMO, the only way to move to an i2c library API that is semantically different than the existing API, would be to create an entirely new library where there is complete freedom to design it from scratch and no restrictions of backward compatibility.
Lots of things could be done differently, including handling multiple buses, and potentially asynchronous i/o that occurs in the background.

A new ic2 library with a new API is very different from making an incremental change to the current Wire library to try to make the existing Wire library a bit better, which is what this issue and the corresponding loop timeout code is about.

Loading

@Aleev2007
Copy link

@Aleev2007 Aleev2007 commented Jan 10, 2021

@bperrybap

Unfortunately, the Wire library does not have that luxury since the Wire API is well established and exists across many platforms, including many 3rd party platforms.
Backward compatibility with the existing API to support existing applications that use the Wire library is very important.

That is why I urge you to stop before it's too late.
All you have to do is put on the fuse. And in the event of a short circuit, it should work. And it's all!
Choose a sufficient timeout that will satisfy 99% of users and hard code it.
Let it be 1 second.
Then the cycle is interrupted and the error flag is set.
And close this topic for good.
Otherwise, you will declare after 10 versions that you need backward compatibility to this useless new functionality. Which in 99% of cases only slows down the system.
The remaining 1% of those who need to change the size of the timeout, moreover in engineering units, can use alternative versions or simply fix the constant in the twi.c. file.

Loading

@Aleev2007
Copy link

@Aleev2007 Aleev2007 commented Jan 10, 2021

Anyone trying to use the I2C protocol to poll a bunch of sensors scattered throughout the room. And people's lives depend on the performance of the device. I say - Stop it!
For such things, the CAN bus was invented long ago.
I2C works well only within a board, which is closed in an iron box.

May the force be with you. ;))

Loading

@greyltc
Copy link
Contributor

@greyltc greyltc commented Jan 10, 2021

@bperrybap Thanks for doing some performance comparisons!
Could you try this patch to see if it improves the performance of your 1.8.13(no timeout) test case?

Loading

@ermtl
Copy link

@ermtl ermtl commented Jan 10, 2021

@greyltc : In addition to the conditional call to micros() in your patch, another easy way to speed things up would be to only initialize startMicros once so that the timeout delay is used for the whole twi_readFrom function and not reinitialized for each of the 3 places where a timeout could occur. This way, 2 calls to micros() are removed enhancing both speed and code size. It would also guarantee that the function exits after the timeout delay and not something between 1x and 3x that delay.

Loading

@greyltc
Copy link
Contributor

@greyltc greyltc commented Jan 10, 2021

@ermtl great point. Your suggestion is now incorporated into the above patch for both twi_readFrom and twi_writeTo via greyltc@588135c

Loading

@Aleev2007
Copy link

@Aleev2007 Aleev2007 commented Jan 10, 2021

@greyltc

I am Russian, and therefore it is difficult for me to convey an important message to you.
Everything you are doing now is making a big mistake!
Everything worked fine except for the I2C speed of 0 kbit / s.
Stop writing code. The user does not need it.
You just need to write one line in three places.
It will prohibit the speed 0 kbit / s.
Just do it.
Good luck.

Loading

@Aleev2007
Copy link

@Aleev2007 Aleev2007 commented Jan 10, 2021

@greyltc
If you think I'm an idiot. so be it.
I'll give you a quick fix.
For better library performance.

in two places you use a construction like this:

  // activate internal pullups for twi.
  digitalWrite (SDA, 1);
  digitalWrite (SCL, 1);

  // deactivate internal pullups for twi.
  digitalWrite (SDA, 0);
  digitalWrite (SCL, 0);

It doesn't make sense as these pins cannot be PWM.
Directly set the state of the conclusions.

// declare
volatile uint8_t *out;
// activate internal pullups for twi.
out = portOutputRegister (digitalPinToPort (SDA)) 
*out |= (digitalPinToBitMask (SDA) | digitalPinToBitMask (SCL));

// declare
volatile uint8_t *out;
// deactivate internal pullups for twi.
out = portOutputRegister (digitalPinToPort (SDA)) 
*out &= (~digitalPinToBitMask (SDA) & ~digitalPinToBitMask (SCL));

Loading

@bperrybap
Copy link

@bperrybap bperrybap commented Jan 10, 2021

@greyltc
yes, that patch improves things, but only a slight amount (~ 3%) and then there is a similar penalty when timeouts are enabled.
using 32 bit values hurt on an 8 bit AVR.
IMO, that patch is not a good way to try to restore performance as it is not an overall improvement since it hurts the i/o timing even more than what it was before when timeouts are enabled.

What I mentioned before is that the code be implemented in a way to quickly select between multiple loops to be able to provide similar timing when timeouts were disabled.
That would be something like this, where twi_timeout_enabled is a 8 bit flag that gets set when timeouts are enabled. The twi_timeout_enable flag is NOT the actual timeout value.

  // wait until twi is ready, become master receiver
  if (twi_timeout_enabled)
  {
    startMicros = micros();
    while(TWI_READY != twi_state)
    {
      if(((micros() - startMicros) > twi_timeout_us))
      {
        twi_handleTimeout(twi_do_reset_on_timeout);
        return 0;
      }
    }
  }
  else
  {
    while(TWI_READY != twi_state) {}
  }

That said, I would not jump on this as the solution as I think the entire timeout stuff needs more thought and analysis as I think the overall goal should be to come up with a way to avoid the loop lockups without hurting i/o throughput too much - which is the overall comment from @Aleev2007

I haven't looked that closely at the code and the specific i2c state for each state wait loop but I'm guessing that it may be possible that not all the timeouts should be or need to be the same.
I'm also not sure that it would be a good idea to change to the timeout being for the entire transaction as @ermtl suggested vs doing a timeout on the specific state transition as that introduces potential timeouts when there is no real issue on the bus if the transaction is running slow due to multiple masters, or a slave doing clock stretching.

There is definitely an art to doing performance optimization.
I've been involved with embedded code projects for 40+ years.
Projects from fault tolerant / must always run systems, to ones in products that were deployed by the 10's of millions.
The most important thing to always keep in mind when optimizing is to not optimize the code but the system.
What I mean by that is that when doing optimization you need to optimize the entire system, not a tiny piece of code.
And to understand how to optimize the system, the system must be fully understood and then profiled.
Without some amount of profiling, there is no way to really know where the overhead really is so trying to do anything is just trial and error guess work - which I am really opposed to.
The type of optimization is does not matter. i.e. you can optimizing for speed, code size, or ram usage.
In some cases it may require doing things differently vs trying to speed up what is currently being done.

I think the best comment so far was your comment

This I2C library could probably benefit from a rewrite from scratch

This library is definitely a bit like a Frankenstein.
Lots of odd parts bolted together and some don't fit together very well.
Things like some of the double buffering between Wire.cpp and twi.c to name one area.

However, I do think that some improvements could be made

For the loop timeout code, if I were doing it, I'd definitely be striving to see if there was a way to avoid 32 bit values for an AVR implementation.
i.e. is 1 us timeout resolution really necessary?
Could 1ms resolution be good enough? (micros() has more overhead than millis())
Could 255ms be enough for a state change loop timeout?
As @Aleev2007 suggested, do users really need to configure the timeout?

For timeouts using ms there are games that could be played using millis() and truncating the return value to just the lower 8 bits to allow 8 bit math and compares if 255ms were long enough.
And for longer than 255ms a loop around that could be done.

IMO,
Whatever the implementation choices are, before these type of code changes are ever committed and made available to users, the implementer should be fully profiling it to know any/all new issues, including code and ram usage changes and i/o performance should be measured with a test sketch, logic analyzer and/or scope to know the full implications of the changes to i/o performance.

Loading

@bperrybap
Copy link

@bperrybap bperrybap commented Jan 10, 2021

Guys,
This seems to have entered into a technical discussion about implementation.
Would this better handled on the mailing list rather than through a git issue?


From a larger picture, all these fast and furious incremental code updates and suggestion for updates are, IMO, hacks. i.e. coding on the fly while sitting at a keyboard, with no overall design and not enough thought behind them.
I am not in favor of using this type of methodology, particularly in a piece of code that is bundled with the IDE and used by lots of users and 3rd party libraries.


@Aleev2007

I2C speed of 0 kbit / s ?

I have no idea what you describing.

In terms of inlining the i/o register updates vs using digitalWrite()
Why the desire for the change?
Particularly when what you are proposing can cause register corruption.

In this case, using digitalWrite() is better than what you have proposed.
You need to understand that digitalWrite() ensures atomic register updates. The code you provided does not.
Without atomic register updates, this code could corrupt a register if the same register were used by and updated by another piece of code (say a library) that is modifying the same register in an ISR.
Even that other code calls digitalWrite() in the ISR, your proposed code could still corrupt the register since it does not update the register atomically.
The digitalWrite() calls in the twi code are only being used in non time critical portions of the code (initialization & error recovery) that are not in the real time path for i/o so it will have absolutely no effect on normal i/o performance so there is no need to optimize that for speed.
Because of all this, it is better to stick with using the portable pin i/o routines that ensure that registers will not be corrupted.

Loading

@ermtl
Copy link

@ermtl ermtl commented Jan 11, 2021

@greyltc :
Further speedup could be obtained by adding a byte loop counter. At first, and while the loop counter has not reached 0, the while statement should execute nearly as fast as the old version without timeout. The value for the byte loop counter (here 48 @ 16MHZ, should be tweaked) must be high enough to catch most of the fast peripherals without slowdown as the counter does not reach 0 and the time consuming calculation is never done. If the delay is longer and the counter elapses, it falls back to the more precise but time consuming current timeout routine. This would create a lower limit of a few tens of microseconds to the minimum timeout value, this should not be a problem since even a short timeout should be much longer than this. If the programmed timeout is shorter than the loop counter time, the routine will exit after the first test once the counter reaches 0.

uint8_t cnt=3*clockCyclesPerMicrosecond();  // Speedup loop counter. Should be long enough for most devices to respond, but not overflow the 8 bits value on fast processors
while(TWI_READY != twi_state){
  if (cnt) {cnt--;} else {
    if((twi_timeout_us > 0ul) && ((micros() - startMicros) > twi_timeout_us)) {
      twi_handleTimeout(twi_do_reset_on_timeout);
      return 0;
    }
  }
}

Loading

@Aleev2007
Copy link

@Aleev2007 Aleev2007 commented Jan 11, 2021

@bperrybap

Even that other code calls digitalWrite() in the ISR, your proposed code could still corrupt the register since it does not update the register atomically.

You're right. It will be correct.

#include < util / atomic.h >

// declare
volatile uint8_t *out;
// activate internal pullups for twi.
ATOMIC_BLOCK (ATOMIC_FORCEON){
  out = portOutputRegister (digitalPinToPort (SDA)); 
  *out |= (digitalPinToBitMask (SDA) | digitalPinToBitMask (SCL));
}

// declare
volatile uint8_t *out;
// deactivate internal pullups for twi.
ATOMIC_BLOCK (ATOMIC_FORCEON){
  out = portOutputRegister (digitalPinToPort (SDA));
  *out &= (~digitalPinToBitMask (SDA) & ~digitalPinToBitMask (SCL));
}

I2C speed of 0 kbit / s ?

I have no idea what you describing.

I2C-bus specification and user manual
4.2.1 I2C/SMBus compliancy

SMBus and I2C protocols are basically the same: A SMBus master is able to control I2C
devices and vice versa at the protocol level. The SMBus clock is defined from 10 kHz to
100 kHz while I2C can be 0 Hz to 100 kHz, 0 Hz to 400 kHz, 0 Hz to 1 MHz and
0 Hz to 3.4 MHz,
depending on the mode. This means that an I2C-bus running at less
than 10 kHz is not SMBus compliant since the SMBus devices may time-out.
NXP Semiconductors UM10204 User manual Rev. 6 — 4 April 2014 page 33 of 64
link

I suggest solving the problem in a similar way.
Specify in the specification for the library the minimum baud rate of 1 Hz.
And hard-code it without the ability to change by the user by means of the library.

Loading

@Aleev2007
Copy link

@Aleev2007 Aleev2007 commented Jan 11, 2021

@greyltc :
Further speedup could be obtained by adding a byte loop counter. At first, and while the loop counter has not reached 0, the while statement should execute nearly as fast as the old version without timeout. The value for the byte loop counter (here 48 @ 16MHZ, should be tweaked) must be high enough to catch most of the fast peripherals without slowdown as the counter does not reach 0 and the time consuming calculation is never done. If the delay is longer and the counter elapses, it falls back to the more precise but time consuming current timeout routine. This would create a lower limit of a few tens of microseconds to the minimum timeout value, this should not be a problem since even a short timeout should be much longer than this. If the programmed timeout is shorter than the loop counter time, the routine will exit after the first test once the counter reaches 0.

uint8_t cnt=3*clockCyclesPerMicrosecond();  // Speedup loop counter. Should be long enough for most devices to respond, but not overflow the 8 bits value on fast processors
while(TWI_READY != twi_state){
  if (cnt) {cnt--;} else {
    if((twi_timeout_us > 0ul) && ((micros() - startMicros) > twi_timeout_us)) {
      twi_handleTimeout(twi_do_reset_on_timeout);
      return 0;
    }
  }
}

Obviously, this (TWI_READY != twi_state) check is done to make sure the second device is ready to receive / transmit.
If the device is not ready, then it holds the line, and the loop body starts executing.
If the loop body takes much longer than the time it took to prepare the second device, it will slow down the transfer rate.

Loading

@Aleev2007
Copy link

@Aleev2007 Aleev2007 commented Jan 11, 2021

It seems that not everyone understands what.
To reach 400kHz speed. The driver must have time to process the incoming or prepare the outgoing byte in 40 clock cycles (16 MHz / 400 kHz = 40). This also includes the time to check the timeout.
In the latest driver implementation, with blackjack and whores, this is far from the case. :))

Loading

@greyltc
Copy link
Contributor

@greyltc greyltc commented Jan 11, 2021

I think further convoluting this issue with discussion of the throughput performance of the I2C timeout implementation we've used is not such a great idea. I understand it's only still open because there's still some documentation left to do (which of course is the boring part, so that lingers).

If anyone has a specific I2C performance problem that they'd like to raise, it seems best to file a new issue to cover that. Feel free to tag me there, especially if it's related to the newish timeout stuff and I'll do my best to chime in.

If anyone has a specific idea for making an improvement to the library, you could submit that as a PR and we can discuss that specific change there. Though in making your PR, consider that API changes need to present some pretty compelling evidence to have a chance at being merged, especially backwards incompatible ones (and rightly so). Maybe less formal ideas for improvement could be hashed out on the forums or something before they're ready for a PR?

I definitely think some of the specific performance improvement ideas that we've thrown around here show promise and I'll try to find some time to do some apples to apples oscilloscope informed benchmarking on them myself and submit them as a PR if I think the numbers I see warrant that. I'd tag a few of you on such a PR so we could review that together.

Loading

@Aleev2007
Copy link

@Aleev2007 Aleev2007 commented Jan 11, 2021

@bperrybap
You can test the speed of this
https://github.com/Aleev2007/twi
no micros(), no uint32_t, fully compatible with the current version.

Loading

@bperrybap
Copy link

@bperrybap bperrybap commented Jan 11, 2021

@Aleev2007
The approach does appear to offer faster state transition detection but it still would need some updates to be fully compatible.
While the code may be API compatible, it is not functionally compatible.
The timeout period will vary depending on the CPU clock speed and likely depending on how the optimizer optimizes the counter variables for the counter decrements and goto operations given the 8 bit counters may or may not be in a register.
The CPU clock speed affecting the timeout could be resolved with some compile time conditionals and macros to modify the delay & counter values depending the value of F_CPU.
It could be a simple as just modifying the timeout value down in twi_setTimeoutInMicros() to scale the timeout value used based on F_CPU to set the counter values to offer the requested timeout consistently.
I would think it would be simpler and easier to maintain if it were all handled in twi_setTimeoutInMicros()
That way begin() in Wire.cpp would call twi_setTimeoutInMicros() to set the default timeout and then there is no need to pre-calculate 8 bit counter values.
However it is done, I think it is important that the API and the underlying code ensure that the timeout for a given timeout value passed to twi_setTimeoutInMicros() not be affected by the CPU clock speed.
i.e. if 25000 is passed to twi_setTimeoutInMicros() the timeout is the same regardless of CPU clock speed.

The way to work around the potential optimization issues, would be to declare the 8 bit counters volatile, so that consistent code is always generated.

Also there is still a 32 bit comparison in the runtime loop of when timeouts are enabled.
32 bit comparisons are expensive on the AVR.
i.e. this line is doing a 32 bit comparison:
if (twi_timeout_us == 0) goto check;

To speed this up an additional 8 bit flag could be used to indicate that timeouts are enabled (set to true if twi_timeout_us is non zero.
or since twi_timeout_us is not used for anything other than this test after twi_setTimeoutInMicros()
just get rid of it and have simple 8 bit flag to indicate that timeouts are set. It would be set if twi_setTimeoutInMicros() is passed a non zero timeout.
That flag could be used in the loop and tested to see if the 3 counters need to be loaded/initialized.

Loading

@Aleev2007
Copy link

@Aleev2007 Aleev2007 commented Jan 11, 2021

@bperrybap
You are absolutely right. I know that you need to add the twi_setTimeoutInMicros () function.
And at the expense of the line
if (twi_timeout_us == 0) goto check;
it is probably not needed at all, even if there is a flag.
I don't see any scenarios where you need to turn off the timeout.
You can just make it even bigger.
And without this line it will work faster.
two "if" operators to disable three "=" operators
I need to look at the generated code to determine which is better.
I'll try to do it tomorrow.

Loading

@Aleev2007
Copy link

@Aleev2007 Aleev2007 commented Jan 12, 2021

@bperrybap
I think you can try to measure the speed.
https://github.com/Aleev2007/twi

timeout switch now uses only 10 MSU ticks ;-)

if (!twi_timeout_off_flag){                                                   		  |ticks
 19c:	80 91 02 01 	lds	r24, 0x0102	; 0x800102 <twi_timeout_off_flag>         | 2
 1a0:	81 11       	cpse	r24, r1                                             	  | 1  timeout off | 2 timeout on
 1a2:	09 c0       	rjmp	.+18     	; 0x1b6 <twi_stop+0x2c>                   | 2  goto check; | 0 timeout on     [off +5] 
    counter_1 = set_1;                                                        	          |
 1a4:	80 91 b7 01 	lds	r24, 0x01B7	; 0x8001b7 <set_1>                        | 2                     
 1a8:	89 83       	std	Y+1, r24	; 0x01                                    | 2 
    counter_2 = set_2;                                                                    |
 1aa:	80 91 01 01 	lds	r24, 0x0101	; 0x800101 <set_2>                        | 2
 1ae:	8a 83       	std	Y+2, r24	; 0x02                                    | 2
    counter_3 = set_3;                                                                    |
 1b0:	80 91 00 01 	lds	r24, 0x0100	; 0x800100 <__data_start>                 | 2
 1b4:	8b 83       	std	Y+3, r24	; 0x03                                    | 2     			      [on +16] 
  }                                                                                       | 
check:                                                                                    |                                        
  if (!(TWCR & _BV(TWSTO))) goto go;                                                      | 
 1b6:	80 91 bc 00 	lds	r24, 0x00BC	; 0x8000bc <__DATA_REGION_ORIGIN__+0x5c>  | 2
 1ba:	84 ff       	sbrs	r24, 4                                                    | 1  goto go;    | 2  wait
 1bc:	1c c0       	rjmp	.+56     	; 0x1f6 <twi_stop+0x6c>                   | 2  goto go;    
  if (twi_timeout_off_flag) goto check;                                                   |
 1be:	80 91 02 01 	lds	r24, 0x0102	; 0x800102 <twi_timeout_off_flag>         | 2
 1c2:	81 11       	cpse	r24, r1                                                   | 1  timeout off | 2 timeout on
 1c4:	f8 cf       	rjmp	.-16     	; 0x1b6 <twi_stop+0x2c>                   | 2  goto check; | 0 timeout on     [off +10]  
  if (!(--counter_1)) goto check;                                                         |
 1c6:	89 81       	ldd	r24, Y+1	; 0x01                                    | 2
 1c8:	81 50       	subi	r24, 0x01	; 1                                       | 1
 1ca:	89 83       	std	Y+1, r24	; 0x01                                    | 2
 1cc:	88 23       	and	r24, r24                                                  | 1
 1ce:	99 f3       	breq	.-26     	; 0x1b6 <twi_stop+0x2c>                   | 2  goto check; | 1 counter = 0    [on +28] [1 loop 16 ticks]
  if (!(--counter_2)) goto check;                                                         | 
 1d0:	8a 81       	ldd	r24, Y+2	; 0x02                                    | 2
 1d2:	81 50       	subi	r24, 0x01	; 1                                       | 1 
 1d4:	8a 83       	std	Y+2, r24	; 0x02                                    | 2
 1d6:	88 23       	and	r24, r24                                                  | 1
 1d8:	71 f3       	breq	.-36     	; 0x1b6 <twi_stop+0x2c>                   | 2  goto check; | 1 counter = 0      
  if (!(--counter_3)) goto check;                                                         |
 1da:	8b 81       	ldd	r24, Y+3	; 0x03                                    | 2
 1dc:	81 50       	subi	r24, 0x01	; 1                                       | 1
 1de:	8b 83       	std	Y+3, r24	; 0x03                                    | 2
 1e0:	88 23       	and	r24, r24                                                  | 1
 1e2:	49 f3       	breq	.-46     	; 0x1b6 <twi_stop+0x2c>                   | 2  goto check; | 1 timeout                 

Loading

@bperrybap
Copy link

@bperrybap bperrybap commented Jan 12, 2021

Coming back to the orignal topic of eliminating the potential of a loop lockup in the Wire library,

Having looked at the 1.8.13 "timeout" code now in a bit more detail,
IMO, I think the current code was put together a bit too hurried and didn't get enough design input/review/thought or testing by technical users that understand things at this level as well as API portability issues before it was released in the 1.8.13 IDE.
As an example, one big issue was that as the code was released in 1.8.13, it extended the WIRE API in a non detectable way, with timeouts disabled, which means libraries/sketches that wanted timeouts had to hard code things for this specific library and would then break when used with the older versions of the Wire library, including on the very same platform.

I think it would be useful for everyone to step back and think about the loop lockup issue and look at the lockup/timeout problem with a fresh set of eyes.

IMO, the current implementation is a bit too heavy so it impacts i/o performance a bit too much.
The other thing I think that is worth REALLY thinking about is how to actually handle the loop timeouts.
IMO, 1 us timeout resolution is not needed and is total overkill.
Can 1 us even accurately ever truly be supported, particularly on slower processors?
I think for a configurable timeout, 1ms would be more than good enough.
This "timeout" is a failsafe to avoid a lockup loop in the code.
It is something that normally should not ever happen.
The length of the timeout in no way affects the i/o performance, so it isn't something that needs to be tuned.
However, how the loop timeout code is implemented does affect i/o performance.
In fact how much configurability of the timeout is really needed?
There is no spec for how long to wait for these events,
but there are certain minimums that are required for some of these, like when multiple masters are used, that are currently not enforced.
How will a typical Arduino user ever know what to pick? They won't.
The vast majority will just run with whatever the default timeout is.
If at some point a timeout is enabled by default, the vast majority of users will never even mess with setting timeout values.

Would a simple hard coded non configurable timeout, on the order of say 1/2 or 1 second, be good enough?
IMO, it likely would be good enough.

Since the WIRE library and WIRE API is somewhat a universal API implemented across nearly all platforms, running on processors at various speeds with various resources, the timeout functionality and any new WIRE API functions should carefully consider their impact on other platforms or slower processors.
For example, what about a slower processor or a software Wire implementation?
Those will likely never be able to support accurate 1us timeout resolution.
Also, other platforms may not have the loop lockup situation like the AVR WIRE library did.

In other words would it be better in the grand scheme of things to have a simpler implementation that could be adopted more widely? vs trying to make something that has such granular configurability?

I bring this up because if it becomes desirable to make some significant changes to how loop timeouts are handled to simplify things or offer better performance, now would be the time to do it, before the existing 1.8.13 "timeout" code and API starts to get documented and gets too many dependencies.


Anybody that is working on this type of code and making these types of changes should be capable of testing their stuff themselves first before they make it available to others.
Like said earlier I absolute HATE trial & error fly by the seat of the pants keyboard hacking and, IMO, that is what I see going on.

@Aleev2007 You have made significant changes to the low level twi.c code in places that are not related to this loop timeout. Also, nearly all the comments have been stripped out.
The code is now broken.

There is no need to be making these types of changes, particularly when 1) they are in places that will not affect the runtime i/o performance, and 2) breaks things.
The latest code I saw at 2020-01-12 13:00 CST commit ( 80cf430 ) on your github repository was not stable enough to run performance numbers.
After calling Wire.setWireTimeout() to set the default timeout, every read and write times out.
If you call Wire.setWireTimeout(0) after that to disable the timeouts, the code locks up on the next attempt to do a write

The twi.c code also has several warnings in it and I believe that one is indicating a calculation error due to a math/expression precedence order.

Loading

@ermtl
Copy link

@ermtl ermtl commented Jan 12, 2021

@bperrybap Even if you have no use for a timeout faster than 500ms or 1 second, it's still useful for others.
Most sensors respond almost immediately, and a few hundred microsecond delay means it will timeout. Knowing about it ASAP can be important (to stop a motor, shut off a pneumatic valve, etc ...) and one second later could be too late. Also, some applications require compatibility with the SMBus specification (25msec). However, as you mention, a 1us granularity is not really useful. If a bigger granularity is used, the stored timeout value could be 16 bits, that's much faster to test and uses 2 RAM bytes instead of 4. With the fast test, the need for a separate boolean value also goes away, freeing another RAM byte.
As an example, if the granularity is 64uS, a 16 bit value would allow a timeout between 64us and 4.19 seconds + unlimited (no timeout) if the value is 0.
The changes proposed by @greyltc and myself practically remove the slowdown for fast responding devices without requiring extensive rewrite of the existing code.
Adding this timeout feature have been so laborious and took so long, doing it all over again and changing the API again instead of fixing the speed issue does not seem being the best way forward.

Loading

@bperrybap
Copy link

@bperrybap bperrybap commented Jan 13, 2021

For me personally, none of this really matters as I don't have anything that requires and demands maximum i/o throughput or the ability to set very short state change timeouts.

I do think of utmost critical importance is that there needs to be some kind of macro that user code can test at compile time so the user code call tell if this new loop abort/timeout capability exists, especially if it is not enabled by default which means the user code do something to enable the timeouts.
The WIRE library code in 1.8.13 failed to do this.

Whenever you guys have something that works, I'll be happy to try it.

In terms of performance, as of right now, from what I've seen, the approach that @Aleev2007 has gone down or something similar to it (i.e. not using micros()), can reduce state change latency and/or state change loop timeouts, especially when loop timeouts are enabled, more than an approach that uses micros()
It also demonstrates than it is still possible to be compatible with the current 1.8.13 setWireTimeout() API without using micros() which is expensive due to all the 32 bit values and runtime math involved inside the micros() function.

Loading

@Aleev2007
Copy link

@Aleev2007 Aleev2007 commented Jan 13, 2021

@bperrybap

I probably got tired yesterday, from many attempts, to force the compiler to make code in which I can calculate the delay in the loop.
And that's why it broke the logic of twi .c work. Sorry. :)

I decided to go over from the beginning and have now made minimal changes to the old version 1.8.12 of the library.
I don’t have the test equipment right now, but everything is going without errors.
https://github.com/Aleev2007/twi
If it's not difficult for you, I would like to know how much slower it has become in comparison with the unmodified version 1.8.12.
If it's acceptable. So I have ideas on how to make it compatible with 1.8.13 and fast.

And I haven't looked at ISR yet, I also have ideas about it.
Last time I was mistaken in saying that the TWI module spends 40 cycles to transfer one byte. I forgot that the connection is serial and needs to be multiplied by 9. So it is quite possible to make everything fly even at 400 Hz.

Loading

@bperrybap
Copy link

@bperrybap bperrybap commented Jan 13, 2021

@Aleev2007
The code is MUCH slower. (obviously broken)
1.8.12 is 77us writes, 118us reads
this code is 189,279us writes and 126,041us reads
After doing reads, writes change to 63,028us
Obviously the code is not working as intended.


There is too much keyboard hacking going on for this project by multiple parties.
I believe in a traditional methodical design & development process.
I'm done testing code that isn't handled that way.

Loading

matthijskooijman added a commit to matthijskooijman/ArduinoCore-avr that referenced this issue Apr 23, 2021
In commit deea929 (Introduce non compulsory Wire timeout), some
timeout-related functions were added. To allow writing portable
sketches, it is important for those sketch to know whether they are
using a Wire library version that is new enough to offer these functions
without having to rely on version number checking (since other Arduino
cores have their own versioning schemes).

This adds a WIRE_HAS_TIMEOUT macro, similar to the existing WIRE_HAS_END
macro, to facilitate that.

This relates to arduino#42.
@Wuntenn
Copy link

@Wuntenn Wuntenn commented May 14, 2021

I had the same error, I'm a newb too and haven't used cpp for a while.

I'm the kinda newb that hits every bug and had a similar error when I used interrupts a few weeks ago. I found out somewhere that if you place a lot of code in the vector code blocks that it can cause the code to hang.

I don't remember where I came across the advice but the recommended approach was to set a flag only in the vector and to use the loop to poll for the flag and carry out the necessary changes.

Reading through the datasheet for the UNO, it says that the TWI is an interrupt based system. I used this approach and it overcame my issue. Thought it might be useful for someone here! 🤷‍♂️

Loading

@acicuc
Copy link

@acicuc acicuc commented Oct 23, 2021

How about int16 for the timeout, default +25ms (and enabled, use millis()), use < 0 for advanced use i.e. micros() resolution?
The implementation has skewed a bit away from being 'Arduino', more niche. Keep it simple and consistent for the beginners.

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet