MossFormer2: Combining Transformer and RNN-Free Recurrent Network for Enhanced Time-Domain Monaural Speech Separation

This repository provides the processed samples and information for MossFormer2 monaural speech separation model. MossFormer2 is an extended model from MossFormer. It can be retrained and evaluated based on ModelScope open source platform. The users can either go to the ModelScope website or follow the steps given below to downloand and install the full pytorch version of MossFormer2 program. MossFormer2 was proposed in the paper MossFormer2: Combining Transformer and RNN-Free Recurrent Network for Enhanced Time-Domain Monaural Speech Separation.

Model Description

Our previously proposed MossFormer has achieved promising performance in monaural speech separation. However, it predominantly adopts a self-attention-based MossFormer module, which tends to emphasize longer-range, coarser-scale dependencies, with a deficiency in effectively modelling finer-scale recurrent patterns. In MossFormer2, we introduce a novel hybrid model that provides the capabilities to model both long-range, coarse-scale dependencies and fine-scale recurrent patterns by integrating a recurrent module into the MossFormer framework. MossFormer module concurrently executes full-computation self-attention on non-overlapping local segments and employs a linearized, resource-efficient self-attention mechanism over the entire sequence. The recurrent module complements the MossFormer’s capabilities in modelling finer-scale recurrent patterns, a crucial aspect of the speech separation task.

Instead of applying the recurrent neural networks (RNNs) that use traditional recurrent connections, we present a recurrent module based on a feedforward sequential memory network (FSMN), which is considered "RNN-free" recurrent network due to the ability to capture recurrent patterns without using recurrent connections. Our recurrent module mainly comprises an enhanced dilated FSMN block by using gated convolutional units (GCU) and dense connections. In addition, a bottleneck layer and an output layer are also added for controlling information flow. The recurrent module relies on linear projections and convolutions for seamless, parallel processing of the entire sequence.

The integrated MossFormer2 hybrid model demonstrates remarkable enhancements over MossFormer and surpasses other state-of-the-art methods in WSJ0-2/3mix, Libri2Mix, and WHAM!/WHAMR! benchmarks.

Installation

After installing ModelScope, you can use speech_mossformer2_separation_temporal_8k for inference. In order to facilitate the usage, the pipeline adds wav file processing logics before and after model processing, which can directly read a WAV file and save the output result in the specified WAV file. The model pipeline takes in a single-channel WAV file sampled at 8000Hz, containing mixed speech of two people, and outputs two separated single-channel audio files.

Environment Preparation

This model supports Linux, Windows, and MacOS platforms.

The pipeline of this model uses the third-party library SoundFile to process wav files. On the Linux system, users need to manually install the underlying dependency library libsndfile of SoundFile. On Windows and MacOS, it will be installed automatically without user operation. For detailed information, please refer to the official website of SoundFile. Taking the Ubuntu system as an example, the user needs to execute the following command:

sudo apt-get update
sudo apt-get install libsndfile1

Code Example

import numpy
import soundfile as sf
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

# input can be a URL or a local path
input = 'https://modelscope.cn/api/v1/models/damo/speech_mossformer2_separation_temporal_8k/repo?Revision=master&FilePath=examples/mix_speech1.wav'
separation = pipeline(
   Tasks.speech_separation,
   model='damo/speech_mossformer2_separation_temporal_8k')
result = separation(input)
for i, signal in enumerate(result['output_pcm_list']):
    save_file = f'output_spk{i}.wav'
    sf.write(save_file, numpy.frombuffer(signal, dtype=numpy.int16), 8000)

For more details, please refer to the related paper below:

@misc{zhao2023mossformer2,
      title={MossFormer2: Combining Transformer and RNN-Free Recurrent Network for Enhanced Time-Domain Monaural Speech Separation}, 
      author={Shengkui Zhao and Yukun Ma and Chongjia Ni and Chong Zhang and Hao Wang and Trung Hieu Nguyen and Kun Zhou and Jiaqi Yip and Dianwen Ng and Bin Ma},
      year={2023},
      eprint={2312.11825},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
MossFormer2_samples_Libri2Mix		MossFormer2_samples_Libri2Mix
MossFormer2_samples_WHAM		MossFormer2_samples_WHAM
MossFormer2_samples_WHAMR		MossFormer2_samples_WHAMR
MossFormer2_samples_WSJ0-2mix		MossFormer2_samples_WSJ0-2mix
MossFormer2_samples_WSJ0-3mix		MossFormer2_samples_WSJ0-3mix
MossFormer2_standalone		MossFormer2_standalone
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MossFormer2_samples_Libri2Mix

MossFormer2_samples_Libri2Mix

MossFormer2_samples_WHAM

MossFormer2_samples_WHAM

MossFormer2_samples_WHAMR

MossFormer2_samples_WHAMR

MossFormer2_samples_WSJ0-2mix

MossFormer2_samples_WSJ0-2mix

MossFormer2_samples_WSJ0-3mix

MossFormer2_samples_WSJ0-3mix

MossFormer2_standalone

MossFormer2_standalone

.DS_Store

.DS_Store

.gitattributes

.gitattributes

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

MossFormer2: Combining Transformer and RNN-Free Recurrent Network for Enhanced Time-Domain Monaural Speech Separation

Model Description

Installation

Environment Preparation

Code Example

About

Releases

Packages

Contributors 2

Languages

License

alibabasglab/MossFormer2

Folders and files

Latest commit

History

Repository files navigation

MossFormer2: Combining Transformer and RNN-Free Recurrent Network for Enhanced Time-Domain Monaural Speech Separation

Model Description

Installation

Environment Preparation

Code Example

About

Resources

License

Stars

Watchers

Forks

Languages