# Download and Log data to W&B

For our tutorial, we will use a small part of the Hi-Fi Multi-Speaker English TTS (Hi-Fi TTS) dataset. You can read more about dataset [here](https://arxiv.org/abs/2104.01497). We will use speaker 9017 as the target speaker, and only a 5-minute subset of audio will be used for this fine-tuning example. We additionally resample audio to 22050 kHz.

In [1]:
import wandb
import json
import pandas as pd

In [2]:
SPEAKER_ID = "9017"
WANDB_PROJECT = "tts-workshop"
WANDB_ENTITY = "capecape" # replace with your wandb username or team

In [3]:
!wget https://multilangaudiosamples.s3.us-east-2.amazonaws.com/"{SPEAKER_ID}_5_mins.tar.gz"  # Contains 10MB of data
!tar -xzf "{SPEAKER_ID}_5_mins.tar.gz"

--2022-12-07 17:31:31--  https://multilangaudiosamples.s3.us-east-2.amazonaws.com/9017_5_mins.tar.gz
Resolving multilangaudiosamples.s3.us-east-2.amazonaws.com (multilangaudiosamples.s3.us-east-2.amazonaws.com)... 3.5.129.143
Connecting to multilangaudiosamples.s3.us-east-2.amazonaws.com (multilangaudiosamples.s3.us-east-2.amazonaws.com)|3.5.129.143|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10802737 (10M) [application/x-gzip]
Saving to: ‘9017_5_mins.tar.gz.1’


2022-12-07 17:31:31 (41.8 MB/s) - ‘9017_5_mins.tar.gz.1’ saved [10802737/10802737]



Looking at `manifest.json`, we see a standard NeMo json that contains the filepath, text, and duration. Please note that our `manifest.json` contains the relative path.

In [4]:
df = pd.read_json(f"{SPEAKER_ID}_5_mins/manifest.json", lines=True)

In [5]:
df.head()

Unnamed: 0,audio_filepath,text,duration,text_no_preprocessing,text_normalized
0,audio/dartagnan03part1_027_dumas_0047.wav,yes monsieur,1.04,"Yes, monsieur.","Yes, monsieur."
1,audio/dartagnan01_42_dumas_0220.wav,asked he in an undertone,1.66,"asked he, in an undertone.","asked he, in an undertone."
2,audio/dartagnan01_38_dumas_0123.wav,grimaud entered,1.2,Grimaud entered.,Grimaud entered.
3,audio/dartagnan01_53_dumas_0059.wav,in the morning when they entered milady's cham...,3.7,"In the morning, when they entered Milady's cha...","In the morning, when they entered Milady's cha..."
4,audio/dartagnan03part3_66_dumas_0203.wav,yes monseigneur,1.42,"“Yes, monseigneur.","Yes, monseigneur."


Let's log this raw data to W&B

In [6]:
wandb.init(project=WANDB_PROJECT, entity=WANDB_ENTITY, job_type="log_dataset", config={"speaker_id":SPEAKER_ID})

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mcapecape[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [7]:
at = wandb.Artifact("9017_5_mins", type="dataset", description=f"Speaker {SPEAKER_ID} raw audio, 5 minutes lenght")

In [8]:
at.add_dir(f"{SPEAKER_ID}_5_mins")

[34m[1mwandb[0m: Adding directory to artifact (./9017_5_mins)... Done. 0.0s


In [9]:
wandb.log_artifact(at)

<wandb.sdk.wandb_artifacts.Artifact at 0x7f0b158a2040>

In [10]:
wandb.finish()

### Train/Val split

Let's take 2 samples from the dataset and split it off into a validation set. Then, split all other samples into the training set.

As mentioned, since the paths in the manifest are relative, we also create a symbolic link to the audio folder such that `audio/` goes to the correct directory.

In [11]:
!cat ./{SPEAKER_ID}_5_mins/manifest.json | tail -n 2 > ./{SPEAKER_ID}_manifest_valid_local.json
!cat ./{SPEAKER_ID}_5_mins/manifest.json | head -n -2 > ./{SPEAKER_ID}_manifest_train_local.json
!ln -s ./{SPEAKER_ID}_5_mins/audio audio

Let's log the split files to W&B

In [12]:
run = wandb.init(project=WANDB_PROJECT, entity=WANDB_ENTITY,  job_type="dataset_split", config={"speaker_id":SPEAKER_ID})

In [13]:
run.use_artifact(f'{WANDB_ENTITY}/{WANDB_PROJECT}/9017_5_mins:v0', type='dataset')

<Artifact QXJ0aWZhY3Q6Mjk0NzgyOTAz>

In [14]:
at = wandb.Artifact("9017_5_split", type="dataset_split", description=f"Train/valid split for Speaker {SPEAKER_ID} raw audio, 5 minutes lenght")

In [15]:
at.add_file(f"./{SPEAKER_ID}_manifest_train_local.json")
at.add_file(f"./{SPEAKER_ID}_manifest_valid_local.json")

<ManifestEntry digest: J1QARzpwdQAMf2fk8j9CYQ==>

In [16]:
wandb.log_artifact(at)

<wandb.sdk.wandb_artifacts.Artifact at 0x7f0b1483e6d0>

## 👀 Visualizing the dataset (or playing the audio 🤣)

Let's create a W&B Table to inspect these files

In [17]:
train_df = pd.read_json(f"{SPEAKER_ID}_manifest_train_local.json", lines=True)
train_df

Unnamed: 0,audio_filepath,text,duration,text_no_preprocessing,text_normalized
0,audio/dartagnan03part1_027_dumas_0047.wav,yes monsieur,1.04,"Yes, monsieur.","Yes, monsieur."
1,audio/dartagnan01_42_dumas_0220.wav,asked he in an undertone,1.66,"asked he, in an undertone.","asked he, in an undertone."
2,audio/dartagnan01_38_dumas_0123.wav,grimaud entered,1.20,Grimaud entered.,Grimaud entered.
3,audio/dartagnan01_53_dumas_0059.wav,in the morning when they entered milady's cham...,3.70,"In the morning, when they entered Milady's cha...","In the morning, when they entered Milady's cha..."
4,audio/dartagnan03part3_66_dumas_0203.wav,yes monseigneur,1.42,"“Yes, monseigneur.","Yes, monseigneur."
...,...,...,...,...,...
71,audio/dartagnan03part3_09_dumas_0218.wav,and so you are determined to sign the sale of ...,8.76,“And so you are determined to sign the sale of...,And so you are determined to sign the sale of ...
72,audio/dartagnan01_62_dumas_0190.wav,what,0.58,“What?”,"""What?"""
73,audio/dartagnan01_33_dumas_0018.wav,well what is to be done,1.90,"“Well, what is to be done?”","""Well, what is to be done?"""
74,audio/dartagnan03part3_62_dumas_0243.wav,said grimaud addressing athos and pointing to ...,7.88,"said Grimaud, addressing Athos and pointing to...","said Grimaud, addressing Athos and pointing to..."


create a `wandb.Table` from a `DataFrame`
- We need to convert the audio files paths to `wandb.Audio` objects

In [18]:
train_df.audio_filepath = train_df.audio_filepath.apply(wandb.Audio)

In [19]:
train_table = wandb.Table(dataframe=train_df)

In [20]:
wandb.log({"train_data": train_table})

We can do the same with the validation data:

In [21]:
valid_df = pd.read_json(f"{SPEAKER_ID}_manifest_valid_local.json", lines=True)
valid_df.audio_filepath = valid_df.audio_filepath.apply(wandb.Audio)
valid_table = wandb.Table(dataframe=valid_df)

In [22]:
wandb.log({"valid_data": valid_table})

In [23]:
wandb.finish()

wandb: Network error (ConnectTimeout), entering retry loop.
