Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/dff/voice skill #362

Merged
merged 85 commits into from
Dec 6, 2023
Merged

Feat/dff/voice skill #362

merged 85 commits into from
Dec 6, 2023

Conversation

moon-strider
Copy link
Contributor

@moon-strider moon-strider commented Mar 21, 2023

Первая рабочая версия войс скилла.
Что в данный момент работает:

  • кепшенинг голосовых сообщений в телеграме

Что потенциально работает, но не в первом коммите:

  • кепшенинг аудио (хотя для этого есть все пререквизиты типа обработки mp3 -> wav)
  • обработка кружочков в телеграме, а именно их звукового содержания (аналогично предыдущему пункту)

Что может работать не так:

  • так как поддержка кепшенинга именно аудиофайлов, а не голосовых, ещё не готова, то отправка множества голосовых в одном сообщении либо всё положит, либо вернёт только один кепшен (или вообще не выйдет загрузить больше 1 голосового в сообщение по механикам телеграма)
  • иногда (особенно, когда сообщения очень короткое — 1с, например) падает ffpmeg, с этим ещё предстоит разобраться.

Как запускал и что писал:
0. вставить свой токен бота в телеграме или использовать мой, тогда бот находится по адресу @dprabota_bot

  1. поднять ВСЁ командой docker-compose -f docker-compose.yml -f assistant_dists/dream_multimodal/docker-compose.override.yml -f assistant_dists/dream_multimodal/dev.yml -f assistant_dists/dream_multimodal/proxy.yml build --no-cache && docker-compose -f docker-compose.yml -f assistant_dists/dream_multimodal/docker-compose.override.yml -f assistant_dists/dream_multimodal/dev.yml -f assistant_dists/dream_multimodal/proxy.yml up
  2. написать боту /begin
  3. записать боту голосовое
  4. получить ответ вида Is there a [caption] in this audio?

image_360

@@ -1,6 +1,6 @@
services:
agent:
command: sh -c 'bin/wait && python -m deeppavlov_agent.run agent.pipeline_config=assistant_dists/dream_multimodal/pipeline_conf.json'
command: sh -c 'bin/wait && python -m deeppavlov_agent.run agent.channel=telegram agent.telegram_token=$TG_TOKEN agent.pipeline_config=assistant_dists/dream_multimodal/pipeline_conf.json'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, we have a separate file for command for telegram actually:
telegram.yml

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in every dist

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert the whole file -- this is another dist

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

assistant_dists/dream_voice/dev.yml Show resolved Hide resolved
assistant_dists/dream_voice/dev.yml Show resolved Hide resolved
command: sh -c 'bin/wait && python -m deeppavlov_agent.run agent.channel=telegram agent.telegram_token=$TG_TOKEN agent.pipeline_config=assistant_dists/dream_voice/pipeline_conf.json'
environment:
WAIT_HOSTS: "dff-program-y-skill:8008, sentseg:8011, convers-evaluation-selector:8009,
dff-intent-responder-skill:8012, intent-catcher:8014, badlisted-words:8018,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

badlisted -- you wanted to remove it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

maybe forgot to git add, check again please


voice-service:
ports:
- "8333:8333"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no ports mapping here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed



try:
# test_server.run_test(handler)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turn on tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

def respond():
import common.test_utils as t_utils

t_utils.save_to_test(request.json, "tests/lets_talk_in.json", indent=4) # TEST
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line and the next with saving ofd the tests, should be commented. they are used only to create test files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented

return [{"sound_path": [dialog["human_utterances"][-1]["attributes"].get("sound_path")],
"sound_duration": [dialog["human_utterances"][-1]["attributes"].get("sound_duration")],
"sound_type": [dialog["human_utterances"][-1]["attributes"].get("sound_type")],
"captions": [dialog["human_utterances"][-1]["attributes"].get("captions")]}]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what captions do you mean here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the audiocaptions that the voice service returns: the captions like "wind blowing with the sirens in the background"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

voice_formatter_service is a input formatter, so why do you return something that is not yet in dialog state? (as you said, voice_service returns these captions)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤯 removed

state_formatters/dp_formatters.py Show resolved Hide resolved

path = request.json.get("sound_path")
duration = request.json.get("sound_duration")
type = request.json.get("sound_type")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, THESE ARE BATCHES!
DO NOT CONSIDER IT AS A LIST OF ONE ELEMENT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was the first ever service I worked on, I didn't know better. Fixed now

@moon-strider moon-strider self-assigned this Nov 21, 2023
assistant_dists/dream_voice/dev.yml Show resolved Hide resolved
fire>=0.5.0
kaldiio>=2.17.2
matplotlib>=3.5.3
PyYAML>=6.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I meant exactly opposite.
Better NOT to use >=


paths = request.json.get("sound_path")
durations = request.json.get("sound_duration")
types = request.json.get("sound_type")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

я, конечно, докапываюсь, но раз уж это батчи, это должно быть во множественном числе (и в форматтерах не забыдь поправить)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

исправил

logger.info("Scanning finished successfully, files found, starting inference...")
captions = infer(AUDIO_DIR, MODEL_PATH)
logger.info("Inference finished successfully")
responses = [{"sound_type": atype, "sound_duration": duration, "sound_path": path, "captions": captions}]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not = but += -- this is a step in cycle

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

logger.info(f'VOICE NOT YET DETECTED: {user_uttr["attributes"].get("sound_path")}')
if user_uttr["attributes"].get("sound_path") is not None:
logger.info(f'VOICE DETECTED: {user_uttr["attributes"].get("sound_path")}')
if "dff_voice_skill" not in skills_for_uttr:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

можно не проверять, а просто добавить. Там в конце дублирование будет убрано (list(set(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Убрал

skills/dff_voice_skill/scenario/condition.py Show resolved Hide resolved

def caption(ctx: Context, actor: Actor, excluded_skills=None, *args, **kwargs) -> str:
cap = "ERROR"
if not ctx.validation:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you use int_ctx.get_last_human_utterance(ctx, actor) and methods get (as below) you will not face validation problems

int_ctx.get_last_human_utterance(ctx, actor)
.get("annotations", {})
.get("voice_service", {})
.get("captions", "No cap")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may return f"Is there No cap in that audio?" - but this is strange. Make some default response, if you have an audio attached but could not caption it -> some question like "I could not read your audio, attach another one"

return [{"sound_path": [dialog["human_utterances"][-1]["attributes"].get("sound_path")],
"sound_duration": [dialog["human_utterances"][-1]["attributes"].get("sound_duration")],
"sound_type": [dialog["human_utterances"][-1]["attributes"].get("sound_type")],
"captions": [dialog["human_utterances"][-1]["attributes"].get("captions")]}]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

voice_formatter_service is a input formatter, so why do you return something that is not yet in dialog state? (as you said, voice_service returns these captions)

@dilyararimovna
Copy link
Collaborator

btw do not forget about codestyle at the end

voice = int_ctx.get_last_human_utterance(ctx, actor).get("annotations", {}).get("voice_service", {})
logger.debug(f"CONDITION.PY VOICE: {voice}")
not_default = voice.get("captions", "Error") != "Error"
if voice is not {} and not_default:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

не надо так проверять, что это не пустой дикт) is not {} - такое себе. здесь достаточно првоерки, что voice.get("captions", "Error") != "Error" без введения доп переменных

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

и вообще почему captionS? елси там точно 1 строка, а не лист

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

исправлено



logger = logging.getLogger(__name__)
# ....
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

можно удалить файл

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

удалено

)

rsp = "I couldn't caption the audio in your message, please try again with another file" \
if cap == "Error" else f"Is there {cap} in that audio?"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

все понимаю, но можно не в одну строку, а просто более читаемо норм проверку сделать

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

исправлено

@dilyararimovna
Copy link
Collaborator

ну и кодстайл, конечно же. ИНструкция, как править -- в доке

ENV SERVICE_NAME ${SERVICE_NAME}

ARG FLASK_APP
ENV FLASK_APP ${FLASK_APP}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this two lines. You provide FLASK_APP as environment already in docker-compose. So, here you actually can overwrite this value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@@ -0,0 +1,3 @@
GPU RAM = 1Gb
cpu time = 0.15 sec
gpu time = 0.05 sec
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a relevant info or copied?
Could you please write here a description of th eservice, how it works, what input and output
We now work on readmes, so it would be neccessary anyway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

st_time = time.time()

paths = request.json.get("sound_paths")
paths = request.json.get("video_paths") if paths == [None] else paths
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paths - это батч, он не обязательно будет длины 1. Агент может при большой нагрузке сделать батч из нескольких элоементов.
paths = request.json.get("video_paths") if all([el is None for el in paths]) else paths

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Сделаю так, хорошо, но я делал смотри почему длины один. В агенте логика следующая — либо есть звук, либо видео. Не может быть такого, что есть и видео, и голос в одном сообщении (по крайней мере такое поведение не предусмотрено), поэтому если мы загрузили только видео, звук гарантировано будет [None], и аналогично наоборот

RUN python -m pip install -U pip
RUN pip install gdown

RUN git clone https://github.com/moon-strider/audio-captioning-dcase /src/aux_files
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

так будет тоже самое, что и проблема с установкой чего-то из папки кого-то (где image-captioning)
ну и плюсом даже версия/комит не фиксированы

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Обновил ссылку и зафиксировал коммит

@dilyararimovna dilyararimovna merged commit d75a628 into dev Dec 6, 2023
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants