# Preparing Commonvoice Dataset
Notebook for the conversion of commonvoice dataset in wav and json files for training.

Based on [A-Hackers-AI-Voice_Assistant](https://github.com/LearnedVector/A-Hackers-AI-Voice-Assistant)

In [1]:
from tqdm.notebook import tqdm
import json
import csv
from pydub import AudioSegment
import random

### Arguments
* `file_path` - path to one of the .tsv files found in cv-corpus
* `split_percent` - percentage of clips to put into test.json instead of train.json
* `convert` - tells the notebook whether to convert mp3 to wav
* `verbose` - increases output verbosity

In [2]:
file_path = 'E:/cv-corpus-11.0-2022-09-21/en/train.tsv'
save_path = 'F:/cv-corpus-11.0-2022-09-21/en'
convert = True
verbose = True
split_percent = 10

In [3]:
data = []
directory = file_path.rpartition('/')[0]

with open(file_path, encoding='utf8') as f:
    length = sum(1 for line in f)
    
if verbose:
    print('Number of audio samples:', length)

Number of audio samples: 948737


### Convert mp3 to wav
Files from the `clips` folder are taken and converted to wav format and saved in the `save_path`.

In [4]:
with open(file_path, newline='', encoding='utf8') as tsvfile:
    reader = csv.DictReader(tsvfile, delimiter='\t')
    index = 1
    if (convert and verbose):
        print("Converting audio samples from mp3 to wav")
    for row, _ in zip(reader, tqdm(range(length))):
        file_name = row['path']
        new_file_name = file_name.rpartition('.')[0] + ".wav"
        text = row['sentence']
        data.append({
            "key": directory + '/wav/' + new_file_name,
            "text": text
        })
        if convert:
            src = directory + '/clips/' + file_name
            dst = save_path + '/wav/' + new_file_name
            sound = AudioSegment.from_mp3(src)
            sound.export(dst, format='wav')

Converting audio samples from mp3 to wav


  0%|          | 0/948737 [00:00<?, ?it/s]

### Write JSON files
We create two files, `train.json` and `test.json`, that contain the mp3/text pairs.

In [5]:
random.shuffle(data)
print("Creating JSONs")

f = open(save_path + '/train.json', 'w')

with open(save_path + '/train.json', 'w') as f:
    d = len(data)
    i = 0
    while(i < int(d-d/split_percent)):
        r = data[i]
        line = json.dumps(r)
        f.write(line + '\n')
        i += 1

f = open(save_path + '/test.json', 'w')

with open(save_path + '/test.json', 'w') as f:
    d = len(data)
    i = int(d-d/split_percent)
    while(i<d):
        r = data[i]
        line = json.dumps(r)
        f.write(line + '\n')
        i += 1

print("Done!")

Creating JSONs
Done!
