Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.9.64 and 2.9.65 doesn't work #1610

Closed
WM0I opened this issue Nov 14, 2018 · 144 comments
Closed

2.9.64 and 2.9.65 doesn't work #1610

WM0I opened this issue Nov 14, 2018 · 144 comments

Comments

@WM0I
Copy link

WM0I commented Nov 14, 2018

both updates do not load, I had to go back to 2.9.63

WM0I

@WM0I WM0I closed this as completed Nov 14, 2018
@db4ple
Copy link
Collaborator

db4ple commented Nov 14, 2018

Hi,
could be explain, what you meant by "did not load", was the boot process not working? Or was it a problem on your side?

Thanks!
Danilo

@WM0I
Copy link
Author

WM0I commented Nov 15, 2018

2.9.65 loaded but when ran the new firmware all I got on the screen was the upper right portion of a normal screen. Nothing worked. So I tried to reload 2.9.64 which worked before I loaded 2.9.65. Now when loading the 64 version I got an error 2 message at the bottom of the screen and it wouldn't run. I am now at firmware version 2.9.63 and it works.

WM9I

@df8oe
Copy link
Owner

df8oe commented Nov 15, 2018

"Error 2" I have seen some months ago when bootloader was not able to access USB stick. In my case it was a damaged stick. But it is so long ago that I cannot remember what was exactly the text message and where it was placed... Could you please test using another USB stick?

@WM0I
Copy link
Author

WM0I commented Nov 15, 2018

Replace the USB stick and loaded 2.9.65 firmware. I got the same results but with No Error 2 message. When I tried to run the new firmware I get just the upper right hand one 4th of a normal screen. 2.9.64 firmware loads OK now.

WM0I

@db4ple
Copy link
Collaborator

db4ple commented Nov 15, 2018

Hi,
thanks for trying again. Well, seems there is an issue in 2.9.65 With is not entirely unlikely.
Is the issue already on the boot screen or is boot screen working but the main screen is only partial?

@WM0I
Copy link
Author

WM0I commented Nov 15, 2018

Boot screen seams to be working but the main screen is not. The best I can describe what I see is if you cut the main screen up into 4 sections 2 x 2 . Nothing works and I have to take power off to turn it off.
WM0I

@db4ple
Copy link
Collaborator

db4ple commented Nov 15, 2018

Ok, yes, description makes enough sense to me. You have a STM32F4 with 512k if I remember correctly. Is that right? Parallel or Serial display, see System Info menu, in case you don't know this.
Thanks.

@WM0I
Copy link
Author

WM0I commented Nov 15, 2018

What I am seeing in system info is HY28B SP1 and 512k. So I guess I need a newer UI board which is no longer available.
WM0I

@db4ple
Copy link
Collaborator

db4ple commented Nov 15, 2018

Hi,
it is not HY28B SP1 but HY28B SPI -> aka serial display connection. You don't need a new UI board, we need to fix the software. I have an idea what the problem could be. And I also have the hardware to test this idea. For now, simply stay at 2.9.64 until we release a fix.

@df8oe
Copy link
Owner

df8oe commented Nov 15, 2018

I do have a mcHF with SPI-display and it works like a charm with 2.9.65 firmware. It uses STM32F429 MCU...

@WM0I
Copy link
Author

WM0I commented Nov 15, 2018

My CPU is 413h.

WM0I

@df8oe
Copy link
Owner

df8oe commented Nov 15, 2018

Hi Rich,

your MCU is F40x@512KB - the smallest one which is available / will work on mcHF. The number alone is not sufficient for description - the amount of flash must be added. Noone here has access to this MCU.
But it is nearly impossible that amount of flash is the reason.

@db4ple
Copy link
Collaborator

db4ple commented Nov 15, 2018

Hi,
@df8oe, Rich already gave us the 512k, but like you, I don't think it is the flash size. I have a very different hunch. See my next pull request.

@df8oe df8oe closed this as completed in dc4edbb Nov 15, 2018
df8oe added a commit that referenced this issue Nov 15, 2018
Attempt to fix #1610: Delay reaction to external interrupts until full init
@df8oe df8oe reopened this Nov 15, 2018
@df8oe
Copy link
Owner

df8oe commented Nov 15, 2018

Sorry - accidently closed.

@df8oe
Copy link
Owner

df8oe commented Nov 15, 2018

I am exited if 2.9.66 has different behaviour at affected radio...

@ok1if
Copy link

ok1if commented Nov 15, 2018

2.9.64 working normal, but 2.9.65 and 2.9.66 not (512 kB)

@db4ple
Copy link
Collaborator

db4ple commented Nov 15, 2018

@ok1if: What is happening? Same as @WM0I described, "broken" main screen?
Do you also have an SPI display or is it parallel?

@UR7FM
Copy link

UR7FM commented Nov 15, 2018

Same problem with 2.9.65 and 2.9.66
error2

@db4ple
Copy link
Collaborator

db4ple commented Nov 15, 2018

Hi @UR7FM
thanks a lot for the picture. Interesting.

@db4ple
Copy link
Collaborator

db4ple commented Nov 15, 2018

@df8oe , should we try to disabled your version check code? Just to see if this makes a difference?
I cannot reproduce the issue although my machine is very close to the one of @WM0I (STM32F407, 192k RAM, SPI display, major difference is 1024k vs. 512k flash).

@df8oe
Copy link
Owner

df8oe commented Nov 15, 2018

I have disabled it. If this makes a difference we must investigate what is going on. Release 67 will be published in a few seconds

@df8oe
Copy link
Owner

df8oe commented Nov 15, 2018

This was my idea too: the problem is sleeping at an unidentified part. Unhappily I am nearly sure we will not find it until we do have access to a machine which shows the issue. I never have had seen such a machine.

@df8oe
Copy link
Owner

df8oe commented Nov 15, 2018

Question to all: Does anyone have a MCU > 512KB flash which shows the issue?

@df8oe
Copy link
Owner

df8oe commented Nov 18, 2018

And again: all of my ideas are trapped in "impossible because it does not impact all machines with F40x processor". All do have same amount of RAM, SRAM, same registers - everything identical - only the amount of flash differs and that cannot be the reason for crashing (I do not see what mech can cause this). Possibly we have found a hardware issue in some revisions of F4 MCUs @512Kb. Of course it can be approached by peeking and poking in the existing code (bisect...) but having direct acess to such a device will make it much, much easier. And if the reason is not trivial (and I am thinking it is not trivial) time to fix it will be long enough if we can access device. Maybe it is not fixable because we can detemine the part of code but not explain what is happening.

@db4ple
Copy link
Collaborator

db4ple commented Nov 18, 2018

Yes, that it is not happening on all devices is what puzzles me most.
And I also would like to say, that I will stop working on this issue until someone sends me an affected unit (which I will have to take apart and add the debug port connections, to be clear about this).
If no one does this, the issue will have to remain unfixed unless someone else spends more time to identify the cause.
You can get in touch with me here or via the german forum (db4ple , https://www.amateurfunk-sulingen.de/forum/index.php#1), send me a personal message there.

@df8oe
Copy link
Owner

df8oe commented Nov 18, 2018

You have done great work Danilo. It is very wise and I am glad that your decision is like this. Could you pse mak a PR for the feature "showing CPU revision in System Menu"? @ALL who are impacted by this issue: many thanks for your help. We have done big steps forward but now we need more than testing... Until we have done that you can test new binaries of course. They may - or may not - work for you. Last time we have had the isue it was enough to move the position of an independent working function to another place - rediculuous and without any explanation - but it was solving the issue for the 512KB machines. By chance...

@ok8oi
Copy link

ok8oi commented Nov 18, 2018

Unfortunately, even the 2.9.69 has the same behaviour in my Chinese clone RS-918 - which, to my knowledge, is identical to the mchf. Only the upper right part of the screen is visible as I mentioned in a previous post of mine. Access to an impacted device seems to be the only solution to the problem. Thank you so much for your efforts guys!

@df8oe
Copy link
Owner

df8oe commented Nov 18, 2018

We need a radio which shows the issue. No other way. None of our radios does have any issue, and it seems that nobody in our German discussion group does have the issue, too.

EDIT:
Only very old mcHFs and nearly all clones are impacted. McHF changed to 1MB MCU in 2015 (so RS918 is not identical to actual mcHF!). Cloners want to save every Cent and fitted 512KB MCU :)

@WM0I
Copy link
Author

WM0I commented Nov 18, 2018

df8oe, again if I sent you my rig/unite to you I would not have anything to use on HF. How do you propose we handle the transfer to you? How long would have to have it? Is their away to trade my rig/unite for the next level up? How much? I want to help but I don't want to be without of a radio either.

Best regards
Richard WM0I

@db4ple
Copy link
Collaborator

db4ple commented Nov 18, 2018

Hi Guys,
seems I have found someone sending an affected device to me. So all we can do now is to relax and wait for what there is come.
Hope to have some news for you soon.

@ok1if
Copy link

ok1if commented Nov 19, 2018

Super MSG, thank you

@db4ple
Copy link
Collaborator

db4ple commented Nov 22, 2018

Hi Guys,
you will not believe what I found out, but it is true:

  • The issue is not at all related to the flash size, that is just coincidence
  • It is not related to any specific hardware difference
  • The core of the issue is not related to the F4 processor

And we will have a solution soon!
Stay tuned!

@m-chichikalov
Copy link
Contributor

This sounds like a trailer for new movie :)

@sp9bsl
Copy link
Collaborator

sp9bsl commented Nov 22, 2018

Hi Danilo,
interesting...

@db4ple
Copy link
Collaborator

db4ple commented Nov 22, 2018

It is a game against the clock. Literally!

db4ple added a commit to db4ple/UHSDR that referenced this issue Nov 22, 2018
The bootloader's interrupt handlers manipulated the data segment memory since
this is initialized before the new vector table is set and the new interrupts
get activated.

Solution: We the bootloader starts the firmware through a shared variable
(which we used already, but not for normal start) and after reset it
goes straight to firmware without enabling any interrupt. Safest way to
do it.
@db4ple
Copy link
Collaborator

db4ple commented Nov 22, 2018

The end is near, do not fear!

db4ple added a commit to db4ple/UHSDR that referenced this issue Nov 22, 2018
The bootloader's interrupt handlers manipulated the data segment memory since
this is initialized before the new vector table is set and the new interrupts
get activated.

Solution: We the bootloader starts the firmware through a shared variable
(which we used already, but not for normal start) and after reset it
goes straight to firmware without enabling any interrupt. Safest way to
do it.
@db4ple
Copy link
Collaborator

db4ple commented Nov 22, 2018

While we are all waiting for Travis to do its job, here is the thing:
It was the bootloader! The bootloader? Yes, the bootloader.
How on earth can the bootloader create such as mess in the firmware and only in some?
When the firmware starts, it first initializes some data area by copying data from the read only memory to the RAM. Once this is done, the last thing to do is to set the interrupt vector table and shortly after the control is passed to the actual firmware main function, which then initializes hardware and turns on interrupts etc.
Now, the bootloader does exactly the same before it then starts the firmware. It also enables interrupts. These interrupts do something, for instance the SysTick interrupt ticks a clock (i.e. counts up a timer in a memory cell). So when does this stop? In our old implementation when the new main code switched to the new vector table. The bootloader interrupts, unfortunately were never turned off.
And it can happen that a timer tick happens between the bootloaders and switching to the new firmwares interrupt vector table. See it? The bootloader SysTick increments memory cell, unfortunately sometimes after the memory was already initialized with prepared data from the firmware.
In this particular case the data being changed was a pointer to a float32_t (which was luck, since pointer to a 4 byte data type are never an odd number and cause an fault).
Only by chance we had that situation and we could find the problem. Might have cause other weird things before.

Solution is straightforward: We disable all interrupts before starting the firmware. We do this in a safe way, we simply reset the processor and then go straight to the firmware, the process basically goes straight from reset to firmware. We do this by checking for a magic number in a special memory location, if it is there, we know we should start the firmware rightaway, otherwise we run through the normal bootloader code.

@sp9bsl
Copy link
Collaborator

sp9bsl commented Nov 22, 2018

it sounds like dejavu for me... I had similar problem few months ago in F103 project... The total calm down for core before exiting the bootloader and changing the VTOR is a must. Anyway, good job, congratulations!

@m-chichikalov
Copy link
Contributor

Good job!

@DF5LI
Copy link

DF5LI commented Nov 23, 2018

What an amazing story! The only 512kB chinese clone, i ever had on my bench, was löng ago back to the owner, so I can't help you . But anyway, you solved the problen, Danilo! What should we do without you?

df8oe added a commit that referenced this issue Nov 23, 2018
 Fixed issue #1610: Firmware not working.
@ok1if
Copy link

ok1if commented Nov 23, 2018

new_bootloader

Congratulation - good job

Milan

@WM0I
Copy link
Author

WM0I commented Nov 23, 2018

FANTASTIC, Works great. I am going to do some test to see if other small issues have cleared up with this new bootloader.

Great work.

Best regards
Richard WM0I

@df8oe
Copy link
Owner

df8oe commented Nov 23, 2018

Hi Rich,

can you power on with normal press on power button or must you press button for 2 seconds?

@la2fda
Copy link

la2fda commented Nov 23, 2018

Great !
I am now running BL 5.0.0 and FW 2.9.73. Display is normal again.
I have to press power-button for 2 sec to power on. That's fine for me !
Another thing. I have for both of my mchf's struggeled with power off ( sometimes they dont power off after save settings, and I have to take power cable ) Have not seen this yet after fw upgrade.
I have to test this some days to verify.....

And another thing. My m0nka mchf number two have a 1024 flash. Yesterday I saw the same problem on that one to !!

You are the best !
Thanks for a fantastic job you are doing !

@WM0I
Copy link
Author

WM0I commented Nov 23, 2018

It looks normal except for the double white screen flash. I do have to hold it down until the RED transmit led comes on. But that seams to be normal.

But overall it works.
Richard WM0I

@db4ple
Copy link
Collaborator

db4ple commented Nov 23, 2018

I think we can close this issue now.
To all readers: please update your bootloader to 5.0.1. Bootloader 5.0.0 has some issues, it is good for testing but not the final solution.

Please note: We will not support anyone with older bootloaders and strange issues! Confirm that the issue exists with 5.0.1 or newer and only then do a report on it.

73
Danilo

@WM0I WM0I closed this as completed Nov 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests