Skip to content

archiki/ASR-Accent-Analysis

Repository files navigation

Analyzing Confounding Effect of Accents in E-2-E ASR models

This repository contains code for our paper How Accents Confound: Probing for Accent Information in End-to-End Speech Recognition Systems, on understanding the confounding effect of accents in an end-to-end Automatic Speech Recognition (ASR) model: DeepSpeech2 through several probing/analysis techniques, which is going to appear in ACL 2020.

Requirements

  • Docker: Version 19.03.1, build 74b1e89
  • nvidia-docker
  • apex==0.1
  • numpy==1.16.3
  • torch==1.1.0
  • tqdm==4.31.1
  • librosa==0.7.0
  • scipy==1.3.1

Instructions

  1. Clone deepspeech.pytorch and checkout the commit id e73ccf6. This was the stable commit used in all our experiments.
  2. Use the docker file provided in this directory and build the docker image followed by running it via the bash entrypoint,use the commands below. This should be same as the dockerfile present in your folder deepspeech.pytorch, the instructions in the README.md of that folder have been modified.
sudo docker build -t  deepspeech2.docker .
sudo docker run -ti --gpus all -v `pwd`/data:/workspace/data --entrypoint=/bin/bash --net=host --ipc=host deepspeech2.docker
  1. Install all the requirements using pip install -r requirements.txt
  2. Clone this repository code inside the docker container in the directory /workspace/ and install the other requirements.
  3. Install the Mozilla Common Voice Dataset, TIMIT Dataset used in the experiments and the optional Librispeech Dataset which is used only for training purposes.
  4. Preparing Manifests: The data used in deepspeech.pytorch is required to be in .csv called manifests with two columns: path to .wav file, path to .txt file. The .wav file is the speech clip and the .txt files contain the transcript in upper case. For Librispeech, use the data/librispeech.py in deepspeech.pytorch. For the other datsets, use the files DeepSpeech/data/make_{MCV,timit}_manifest.py provided. The file corresponding to TIMIT works on the original folder structure whereas as for MCV, we need to provide a .txt file with entries of the format- file.mp3 : reference text.
  5. The additional and/or modified files can be found in DeepSpeech/ along with our trained model and Language Model (LM) used in DeepSpeech/models.

Reproducing Experiment Results

  • Section 2.1, Table 1: This was obtained by testing the model using the following command and the appropriate manuscript:
cd deepspeech.pytorch/
python test.py --model-path ../Deepspeech/models/deepspeech_final.pth --test-manifest {accent manifest}.csv --cuda --decoder beam --alpha 2 --beta 0.4 --beam-width 128 --lm-path ../Deepspeech/models/4-gram.arpa
  • Section 3.1, Attribution Analysis: Code for all experiments in this section can be found in AttrbutionAnalysis.ipynb. The main requirements for this notebook include the gradient attributions calculated using Deepspeech/test_attr.pyand the frame-level alignments that can be derived from the time(s)-level alignments using gentle along with accent labels and reference transcripts. The paper contains attribution maps for the sentence: 'The burning fire had been extinguished.', the audio files for the various accents can be found in the folder audioFiles.

  • Section 3.2, Information Mixing Analysis: Datapoints for the figures showing phone focus and neighbour analysis can be found in Contribution.ipynb. Deepspeech/test_contr.py is used to calculate the gradient contributions given by equation (1).

  • Section 4, Mutual Information Experiments: Data for all experiments involving mutual information can be generated using MI.ipynb which uses averaged phone representations which can be generated by using frame-level alignments and averaging all the consecutive frames corresponding to a particular phone.

  • Section 5, Classifier-driven Analysis: All the code files relevant to the accent probe/classifiers and phone probe/classifiers can be found in the folders AccentProbe/ and PhoneProbes/ respectively. These probes are trained on entire represenations and frame-level (and average) representations respectively.

Citation

If you use this code in your work, please consider citing our paper:

@inproceedings{prasad-jyothi-2020-accents,
    title = "How Accents Confound: Probing for Accent Information in End-to-End Speech Recognition Systems",
    author = "Prasad, Archiki  and
      Jyothi, Preethi",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.345",
    pages = "3739--3753"}
    

Acknowledgements

This project uses code from deepspeech.pytorch.