Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continued-conversation for esp32-s3-box-3 #173

Open
wants to merge 85 commits into
base: main
Choose a base branch
from

Conversation

jaymunro
Copy link

@jaymunro jaymunro commented Mar 3, 2024

Adds the following features to the s3-box-3 firmware:

  • Continued conversation (not having to say the wake word for every command). Default: Off
  • A user adjustable timeout to return back to idle and waiting for the wake word. Default: 8s
  • Display live Time and/or Date in various formats on the Box3 display. User selectable dropdown. Default: None
  • Display prompts, user query as understood (STT), the spoken response (TTS) on the Box3 display. Default: Off
  • User customisation of the on-screen prompts to any phrase or language.
  • Sensors in HA to show the user query and Assist response for use in automations and troubleshooting.
Screenshot 2024-03-30 at 9 22 42 PM

A slightly old video (prior to the recent update adding conversation display) of the system showing the continued conversation ability is at: https://drive.google.com/file/d/1DjV5XPmsqwHq7iph_kFEb6XFILpzw4Pt/view?usp=sharing

IMG_1069

@jaymunro
Copy link
Author

jaymunro commented Mar 3, 2024

I have done a fair amount of testing but will be looking for more people to try this draft before marking as ready for review.
Also hope to make a short video to demo the features and how well it works. Continued conversation works so well, it is almost like a natural conversation.

@vtolstov
Copy link

vtolstov commented Mar 4, 2024

Does it possible to do something like this on atom echo?

@jaymunro
Copy link
Author

jaymunro commented Mar 5, 2024

Does it possible to do something like this on atom echo?

Could be possible, but needs to be done by someone that has the hardware. Give it a go.

@jaymunro
Copy link
Author

Added variable (based on text width) width text outlines to the other prompts added in this pull to match the conversation boxes added by @jlpouffier
Added user configurable outline color for the new text via a substitution 'text_outline_color'

Removed 42 lines of debug logging. It was all commented out but removed for clarity of the PR changes.
Tweaked vertical spacing of the date.
Comment on wake_word: !lambda return wake_word; for continuous not yet supported
@jaymunro jaymunro marked this pull request as ready for review March 30, 2024 08:00
Limited the display and publishing of "..." while listening to only non-continued conversation mode as it does not make sense while in continued conversation mode.
Changed defaults of displaying text prompts and time (not date) to ON so as to match the current new default with displaying conversation.
Updated the "Display text" switch to "Display text prompts" for clarity to the user.
@jaymunro jaymunro requested a review from DrShivang March 30, 2024 08:52
@DrShivang
Copy link

DrShivang commented Mar 31, 2024

@jaymunro, its working great .. just one small issue am facing
In Home Assistant Wake Word Detection isn't working as intended. On device does..
Sharing the logs for the same.

[D][esp-idf:000]: I (109894) AUDIO_PIPELINE: Pipeline started

[W][component:232]: Component voice_assistant took a long time for an operation (236 ms).
[W][component:233]: Components should block for at most 30 ms.
[D][esp_adf.microphone:273]: Microphone started
[D][voice_assistant:416]: State changed from STARTING_MICROPHONE to STREAMING_MICROPHONE
[D][select:062]: 'Wake word engine location' - Setting
[D][select:115]: 'Wake word engine location' - Set selected option to: In Home Assistant
[D][select:015]: 'Wake word engine location': Sending state In Home Assistant (index 0)
[W][component:232]: Component script took a long time for an operation (237 ms).
[W][component:233]: Components should block for at most 30 ms.
[D][voice_assistant:523]: Event Type: 11
[D][voice_assistant:677]: Starting STT by VAD
[D][voice_assistant:523]: Event Type: 12
[D][voice_assistant:681]: STT by VAD end
[D][voice_assistant:416]: State changed from STREAMING_MICROPHONE to STOP_MICROPHONE
[D][voice_assistant:422]: Desired state set to AWAITING_RESPONSE
[D][esp_adf.microphone:234]: Stopping microphone
[D][voice_assistant:416]: State changed from STOP_MICROPHONE to STOPPING_MICROPHONE
[D][esp-idf:000]: W (123945) AUDIO_ELEMENT: IN-[filter] AEL_IO_ABORT

[D][esp-idf:000]: E (123947) AUDIO_ELEMENT: [filter] Element already stopped

[D][esp-idf:000]: W (123979) AUDIO_PIPELINE: There are no listener registered

[D][esp-idf:000]: I (123981) AUDIO_PIPELINE: audio_pipeline_unlinked

[D][esp-idf:000]: W (123981) AUDIO_ELEMENT: [i2s] Element has not create when AUDIO_ELEMENT_TERMINATE

[D][esp-idf:000]: I (123985) I2S: DMA queue destroyed

[D][esp-idf:000]: W (123985) AUDIO_ELEMENT: [filter] Element has not create when AUDIO_ELEMENT_TERMINATE

[D][esp-idf:000]: W (123987) AUDIO_ELEMENT: [raw] Element has not create when AUDIO_ELEMENT_TERMINATE

[W][component:232]: Component voice_assistant took a long time for an operation (239 ms).
[W][component:233]: Components should block for at most 30 ms.
[D][esp_adf.microphone:285]: Microphone stopped
[D][voice_assistant:416]: State changed from STOPPING_MICROPHONE to AWAITING_RESPONSE
[W][component:232]: Component script took a long time for an operation (235 ms).
[W][component:233]: Components should block for at most 30 ms.
[D][select:062]: 'Wake word engine location' - Setting
[D][select:115]: 'Wake word engine location' - Set selected option to: On device
[D][select:015]: 'Wake word engine location': Sending state On device (index 1)
[D][voice_assistant:523]: Event Type: 4
[D][voice_assistant:551]: Speech recognised as: " . . ."
[D][text_sensor:064]: 'Assist query': Sending state ' . . .'
[W][component:232]: Component voice_assistant took a long time for an operation (240 ms).
[W][component:233]: Components should block for at most 30 ms.
[D][voice_assistant:523]: Event Type: 5
[D][voice_assistant:556]: Intent started
[D][voice_assistant:523]: Event Type: 6
[D][voice_assistant:523]: Event Type: 7
[D][voice_assistant:579]: Response: "Sorry, I couldn't understand that"
[D][text_sensor:064]: 'Assist reply': Sending state 'Sorry, I couldn't understand that'
[D][voice_assistant:523]: Event Type: 8
[D][voice_assistant:599]: Response URL: "http://192.168.0.110/api/tts_proxy/dae2cdcb27a1d1c3b07ba2c7db91480f9d4bfd8f_en-us_4d30e09a66_tts.piper.wav"
[D][voice_assistant:416]: State changed from AWAITING_RESPONSE to STREAMING_RESPONSE
[D][voice_assistant:422]: Desired state set to STREAMING_RESPONSE
[D][voice_assistant:523]: Event Type: 2
[D][voice_assistant:613]: Assist Pipeline ended
[D][esp-idf:000]: I (139872) I2S: DMA Malloc info, datalen=blocksize=2048, dma_buf_count=8

[D][esp-idf:000]: I (139876) I2S: I2S0, MCLK output by GPIO2

[D][esp-idf:000]: I (139880) AUDIO_PIPELINE: link el->rb, el:0x3d05d254, tag:raw, rb:0x3d05d3c4

[D][esp-idf:000]: I (139882) AUDIO_ELEMENT: [raw-0x3d05d254] Element task created0;36m[D][esp-idf:000]: I (139885) AUDIO_ELEMENT: [i2s-0x3d05cfb0] Element task created

[D][esp-idf:000]: I (139888) AUDIO_ELEMENT: [i2s] AEL_MSG_CMD_RESUME,state:1

[D][esp-idf:000]: I (139890) I2S_STREAM: AUDIO_STREAM_WRITER

[D][esp-idf:000]: I (139891) AUDIO_PIPELINE: Pipeline started

[D][script:077]: Script 'stt_timeout_to_idle' restarting (mode: restart)
[W][component:232]: Component voice_assistant took a long time for an operation (268 ms).
[W][component:233]: Components should bloc[D][select:062]: 'Wake word engine location' - Setting
[D][select:115]: 'Wake word engine location' - Set selected option to: On device
[D][select:015]: 'Wake word engine location': Sending state On device (index 1)
[D][voice_assistant:416]: State changed from STARTING_MICROPHONE to STOP_MICROPHONE
[D][voice_assistant:422]: Desired state set to IDLE
[D][voice_assistant:416]: State changed from STOP_MICROPHONE to IDLE
[W][component:232]: Component script took a long time for an operation (235 ms).
[W][component:233]: Components should block for at most 30 ms.
[W][component:232]: Component time took a long time for an operation (235 ms).
[W][component:233]: Components should block for at most 30 ms.
[D][esp32.preferences:114]: Saving 1 preferences to flash...
[D][esp32.preferences:143]: Saving 1 preferences to flash: 0 cached, 1 written, 0 failed
[W][component:232]: Component time took a long time for an operation (235 ms).
[W][component:233]: Components should block for at most 30 ms.
[I][ota:117]: Boot seems successful, resetting boot loop counter.
[D][esp32.preferences:114]: Saving 1 preferences to flash...
[D][esp32.preferences:143]: Saving 1 preferences to flash: 0 cached, 1 written, 0 failed
[W][component:232]: Component time took a long time for an operation (236 ms).
[W][component:233]: Components should block for at most 30 ms.

@jaymunro
Copy link
Author

jaymunro commented Apr 3, 2024

Home Assistant Wake Word Detection isn't working as intended

Thanks @DrShivang. In what way is it not working as intended? Is it freezing, not responding to wake word, not taking query, not giving a response, other?

If you are talking about the "..." in the response/query, that is something @jlpouffier put in. I think he may have intended it as a filler while listening but I found it wasn't working nicely with the HA sensor history so I disabled it when the Continued conversation is turned on. But this is not related to the In device / In HA switch so maybe you're talking about something else?

@DrShivang
Copy link

DrShivang commented Apr 4, 2024 via email

@a-d-r-i-a-n-d
Copy link

Hey @jaymunro , this looks great, thanks for your work. I've got a esp32-s3-box and I can help testing if you point me to the right directions.

@jaymunro
Copy link
Author

jaymunro commented Apr 5, 2024

esp32-s3-box

At the moment it is set up for the Box3 but functionality should be easily transferable to the Box once the merge is complete. If the merge is not going to happen I'm not sure if it's worth spending the time on it. Personally I think it's fantastic and I frankly should've entered it into that competition but didn't think about it. I've no idea why there is no activity on the merging except busy maintainers and a lack of time to visit this thread. @jesserockz ?

@jaymunro
Copy link
Author

Wake word detection isn't working when it's changed to "In Home Assistant" from "On Device". If anyone can confirm this issue.

I have been able to reproduce this by moving from 'On device' to 'In Home Assistant'. If the device wakes up with 'In HA' selected it works (e.g. turning 'Mute' on and off again).

I'll try and track down why and add a fix.

added a wait after microwakeword stop to prevent HA wakeword from starting too soon
@jaymunro
Copy link
Author

I think that update fixes the "Home Assistant Wake Word Detection isn't working as intended". Thanks so much @DrShivang for finding that.

@DrShivang
Copy link

@jaymunro , Thanks for the update I'll test and revert.

Also see this great achievement by @X-Ryl669 at esphome/issues#5296
All the sensors are functioning well and tested.

I'll try if we can integrate both of these together.

@X-Ryl669
Copy link

Didn't know about this PR. I'm French (but I speak English well, I think) and for me, the default "Ok Nabu" almost never trigger on HA. Maybe 1 out of 10 tries, which makes the system kind of useless for its purpose. The wakeword "Hi ESP" from the default firmware from Espressif works 95% of the time in comparison.

What's strange is that the Assist on HA's webpage is working almost correct for STT recognition and TTS generation. So, when the Wake word is actually detected the upcoming command is actually working most of the time.

I've started collecting samples for training my own wake word, and using microwakeword, and for this, I need to modify the ADF component (I've worked on this part to use the latest version, see main esphome's pull request and related issue #5296

I was wondering about few improvement to the current code, please comment if you agree (or not) with me:

Short tasks should be replaced by permanent tasks

Currently, the esp-adf component create 2 sub components (speaker & microphone).
Each of these sub components starts a task with an audio pipeline and stop that task when processing a batch of data. This makes a lot of tasks creation and deletion, and a lot of allocations (since each tasks creates numerous buffers via malloc).
This means that after some time, the system will reboot since the allocator will fails to allocate due to memory fragmentation.
Also, you'll always hear a small "pop" or "click" when the speaker task is recreated, since the last samples in the audio buffer aren't always fading to 0, you'll get a discontinuity.

I think the tasks should be started once and kept alive for the entire runtime of the system. State tracking should be implemented (so that the microphone isn't streaming while the speaker is outputting sound). I think it's possible to do so without too much changes.

Voice assistant should have a feature to record audio

In order to train for wakeword, it's absolutely required to use the same environment that'll be used for actual inference (same audio pipeline, same device). Using a TTS to generate samples for the wakeword doesn't work well since the TTS quality will likely be higher than the actual sound.

The current screen isn't very useful

I think the LVGL PR in #6363 should be merged in. This would allow to have a real interface, that's displaying a kind of chat window (conversation history) and also would allow to have some output to the "what can I say" intent. Having to store huge PNG in the firmware for the interface is clunky. In LVGL you can store SVG or simply a TTF font for the current icons.

@BigBobbas
Copy link

Didn't know about this PR. I'm French (but I speak English well, I think) and for me, the default "Ok Nabu" almost never trigger on HA. Maybe 1 out of 10 tries, which makes the system kind of useless for its purpose. The wakeword "Hi ESP" from the default firmware from Espressif works 95% of the time in comparison.

What's strange is that the Assist on HA's webpage is working almost correct for STT recognition and TTS generation. So, when the Wake word is actually detected the upcoming command is actually working most of the time.

I've started collecting samples for training my own wake word, and using microwakeword, and for this, I need to modify the ADF component (I've worked on this part to use the latest version, see main esphome's pull request and related issue #5296

I was wondering about few improvement to the current code, please comment if you agree (or not) with me:

Short tasks should be replaced by permanent tasks

Currently, the esp-adf component create 2 sub components (speaker & microphone). Each of these sub components starts a task with an audio pipeline and stop that task when processing a batch of data. This makes a lot of tasks creation and deletion, and a lot of allocations (since each tasks creates numerous buffers via malloc). This means that after some time, the system will reboot since the allocator will fails to allocate due to memory fragmentation. Also, you'll always hear a small "pop" or "click" when the speaker task is recreated, since the last samples in the audio buffer aren't always fading to 0, you'll get a discontinuity.

I think the tasks should be started once and kept alive for the entire runtime of the system. State tracking should be implemented (so that the microphone isn't streaming while the speaker is outputting sound). I think it's possible to do so without too much changes.

Voice assistant should have a feature to record audio

In order to train for wakeword, it's absolutely required to use the same environment that'll be used for actual inference (same audio pipeline, same device). Using a TTS to generate samples for the wakeword doesn't work well since the TTS quality will likely be higher than the actual sound.

The current screen isn't very useful

I think the LVGL PR in #6363 should be merged in. This would allow to have a real interface, that's displaying a kind of chat window (conversation history) and also would allow to have some output to the "what can I say" intent. Having to store huge PNG in the firmware for the interface is clunky. In LVGL you can store SVG or simply a TTF font for the current icons.

you may be interested in this https://github.com/gnumpi/esphome_audio
which I believe that the creator of micro_wake_word has also been talking with the dev to make improvements.
This component provides an adf pipeline so media player can also now be used on esp-idf framework.

@DrShivang
Copy link

Didn't know about this PR. I'm French (but I speak English well, I think) and for me, the default "Ok Nabu" almost never trigger on HA. Maybe 1 out of 10 tries, which makes the system kind of useless for its purpose. The wakeword "Hi ESP" from the default firmware from Espressif works 95% of the time in comparison.

What's strange is that the Assist on HA's webpage is working almost correct for STT recognition and TTS generation. So, when the Wake word is actually detected the upcoming command is actually working most of the time.

I've started collecting samples for training my own wake word, and using microwakeword, and for this, I need to modify the ADF component (I've worked on this part to use the latest version, see main esphome's pull request and related issue #5296

I was wondering about few improvement to the current code, please comment if you agree (or not) with me:

Short tasks should be replaced by permanent tasks

Currently, the esp-adf component create 2 sub components (speaker & microphone). Each of these sub components starts a task with an audio pipeline and stop that task when processing a batch of data. This makes a lot of tasks creation and deletion, and a lot of allocations (since each tasks creates numerous buffers via malloc). This means that after some time, the system will reboot since the allocator will fails to allocate due to memory fragmentation. Also, you'll always hear a small "pop" or "click" when the speaker task is recreated, since the last samples in the audio buffer aren't always fading to 0, you'll get a discontinuity.

I think the tasks should be started once and kept alive for the entire runtime of the system. State tracking should be implemented (so that the microphone isn't streaming while the speaker is outputting sound). I think it's possible to do so without too much changes.

Voice assistant should have a feature to record audio

In order to train for wakeword, it's absolutely required to use the same environment that'll be used for actual inference (same audio pipeline, same device). Using a TTS to generate samples for the wakeword doesn't work well since the TTS quality will likely be higher than the actual sound.

The current screen isn't very useful

I think the LVGL PR in #6363 should be merged in. This would allow to have a real interface, that's displaying a kind of chat window (conversation history) and also would allow to have some output to the "what can I say" intent. Having to store huge PNG in the firmware for the interface is clunky. In LVGL you can store SVG or simply a TTF font for the current icons.

Agree, these would be great enhancements. Especially training samples generated by record voice feature.

@docics
Copy link

docics commented May 2, 2024

Can I use this on other esp32 boards? I currently have 3 esp32wroom boards that I use. But I would like to implement continuous conversation too.

@jaymunro
Copy link
Author

jaymunro commented May 4, 2024

Can I use this on other esp32 boards? I currently have 3 esp32wroom boards that I use. But I would like to implement continuous conversation too.

It should work without too many changes but it should be on an ESP32 S3 to use this code.

@william-aqn
Copy link

Great addition! I'm really looking forward to it appearing in the main branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants