<a href="https://colab.research.google.com/github/bijmuj/StreamTranslation/blob/main/CommonVoiceJapaneseConversion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Common Voice Japanese Conversion

I'm using the [Common Voice](https://commonvoice.mozilla.org/en/datasets) dataset by Mozilla. It is a large, open source, multi language dataset, available under a [Creative Commons License](https://www.mozilla.org/en-US/foundation/licensing/website-content/). This project uses the Japanese language subset of the dataset, which consists of 397 voices and 26 hours of validated utterances. The sound clips by default are in .mp3 format, so I converted it to .wav to use with [Coqui.ai's Speech to Text Model](https://github.com/coqui-ai/stt). I also did some conversions to the validated.csv file to use natively with that model.

## Installs

We need to get pydub to do the conversions from .mp3 to .wav.

In [None]:
pip install -q pydub

Collecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Installing collected packages: pydub
Successfully installed pydub-0.25.1


## Getting the Dataset

I'm not downloading it directly from the Common Voices site because it was easier to download it locally and upload it to Google Drive and work from there. Make sure to connect Drive before running.

In [None]:
! mkdir Dataset
! tar -xkf drive/MyDrive/cv-corpus-7.0-2021-07-21-ja.tar.gz -C Dataset

## Removing Unnecessary Attributes

For this project, attributes like age, gender, accent are not useful. All we need are the file names aka path and transcripts aka sentence. I appended the directory path to the filenames to make opening them easier later on.

In [None]:
from pydub import AudioSegment
import pandas as pd
from tqdm import tqdm

In [None]:
tsv_base = pd.read_csv('/content/Dataset/cv-corpus-7.0-2021-07-21/ja/validated.tsv', sep='\t')
tsv_base.head(5)

Unnamed: 0,client_id,path,sentence,up_votes,down_votes,age,gender,accent,locale,segment
0,033ede7ca4c60dc27cef421b4d33799d38924ed36fa8dd...,common_voice_ja_21409740.mp3,祖母は、おおむね機嫌よく、サイコロをころがしている。,2,0,,,,ja,
1,087edae49ce1e0f600682ceccc7fc28e81e64ae890e647...,common_voice_ja_22072759.mp3,財布をなくしたので、交番へ行きます。,2,0,teens,female,,ja,
2,09e6ae463786aae9071baa9044ac8b7466aa7c48dcdaf4...,common_voice_ja_23677003.mp3,背の高さは一七〇センチほどで、目が大きく、やや太っている。,2,0,,,,ja,
3,15b7d87a73d28b37664fdf7fea1ff232f89e80ce954c9b...,common_voice_ja_19499629.mp3,新しい靴をはいて出かけます。,2,0,,,,ja,
4,1c6e8463b08279962ad37c0946d0b1df78a82a4c907f4b...,common_voice_ja_22717324.mp3,松井さんはサッカーより野球のほうが上手です。,2,0,thirties,male,,ja,


In [None]:
mp3_path = "/content/Dataset/cv-corpus-7.0-2021-07-21/ja/clips/"
tsv_base['path'] = mp3_path + tsv_base['path'].astype(str)
features = ['path', 'sentence']
tsv_small = tsv_base[features].copy()
tsv_small.rename(columns={'sentence':'transcript'}, inplace=True)

## Conversion

Only keeping the paths, audio file sizes and transciptions at the end.

In [None]:
wav_path = './Dataset/wav/'
os.mkdir(wav_path)

In [None]:
def convert_to_wav(paths):
    file_names = []
    file_sizes = []
    for _, path in tqdm(enumerate(paths), total=len(paths)):
        file_name = (path.split('/')[-1]).split('.')[0]
        file_name = wav_path + file_name + '.wav'
        file_names.append(file_name)
        
        sound = AudioSegment.from_mp3(path)
        sound.export(file_name, format='wav')
        file_sizes.append(os.path.getsize(file_name))
    return file_names, file_sizes

In [None]:
tsv_small['wav_filename'], tsv_small['wav_filesize'] = convert_to_wav(tsv_small['path'])
features = ['wav_filename', 'wav_filesize', 'transcript']
tsv_small = tsv_small[features]
tsv_small.to_csv('./Dataset/validated_samples.csv', index=False)

## Finishing up

Tarballing and compressing the audio clips and uploading the tarball and csv files to Drive.

In [None]:
! tar -ckzf wav_files.tar.gz ./Dataset/wav/
! cp wav_files.tar.gz ./drive/MyDrive/Common\ Voice\ Japanese
! cp ./Dataset/validated_samples.csv ./drive/MyDrive/Common\ Voice\ Japanese