# Kaldi style data directory creation

The kaldi style data directory is required to use the dataset for ESPnet2 finetuning.

<u>Directory structure:</u>

`data/
  train/
    - text     # The transcription
    - wav.scp  # Wave file path
    - utt2spk  # A file mapping utterance-id to speaker-id
    - spk2utt  # A file mapping speaker-id to utterance-id
    - segments # [Option] Specifying start and end time of each utterance
  dev/
    ...
  test/
    ...`

In [3]:
# Imports
from pathlib import Path
from tqdm import tqdm
from more_itertools import ilen
from collections import defaultdict
import math

### Text transcription

`uttidA <transcription>
uttidB <transcription>
...`

In [4]:
data_dir = Path("/mnt/U/Datasets/lrs3pretrain/raw/pretrain/")

In [6]:
speakers = list(data_dir.glob("*"))
for speaker in tqdm(speakers[:2]):
    for text in speaker.glob("*.txt"):
        utt_id = speaker.name + "_" + text.stem
        with text.open() as f:
            line = f.readline()
            transcription = line.split("Text: ")[1]
        print(utt_id)
        print(transcription)

  0%|          | 0/2 [00:00<?, ?it/s]

00j9bKdiOjk_00001
 IT HAD A STRONG STURDY STEM ROOTED FIRMLY INTO THE RICH SOIL IT RECEIVED LOTS OF WATER FROM THE GARDENER AND IT WAS PROVIDED AMPLE PROTECTION FROM THE WIND THANKS TO ALL OF THE FLOWERS THAT SURROUNDED IT 

00j9bKdiOjk_00002
 WAS CLEAR THAT THIS ONCE BLOSSOMING SUNFLOWER WAS NOW SUFFERING ONE DAY A YOUNG BOY COMES UPON THE SUNFLOWER WHILE VISITING THE GARDEN AND HE NOTICES HOW WEAK IT LOOKS HE TAKES IT AND BRINGS IT HOME PLACING IT IN A POT OF NUTRIENT RICH SOIL AND PROVIDING IT WITH WATER HE TENDS TO THE SUNFLOWER EVERY DAY GIVING IT A NEW OPPORTUNITY FOR GROWTH AND SLOWLY IT REGAINS ITS OLD STRENGTH AND VIBRANCE JUST LIKE REACHING OUT TO THE SUNFLOWER BY PROVIDING SOMEONE WHO IS NEGLECTED ISOLATED OR FORGOTTEN WITH LOVE AND KINDNESS YOU CAN HAVE A TREMENDOUS IMPACT ON THEM 



 50%|█████     | 1/2 [00:00<00:00,  1.45it/s]

00j9bKdiOjk_00003
 THEY ARE THIS GROUP OF PEOPLE THEY ARE INCREDIBLE YET SO OFTEN THEY ARE FORGOTTEN LACKING THE LOVE AND APPRECIATION THEY DESERVE THE ELDERLY THIS IS A GROUP OF PEOPLE THAT IS KIND OF PUSHED AWAY FROM EVERYDAY CONSCIOUSNESS AND THEY SOMETIMES LOSE ALL TOUCH WITH THE OUTSIDE WORLD IN MOST CULTURES ELDERS ARE REGARDED WITH THE UTMOST 

01GWGmg5jn8_00011
 HOW DID I SUDDENLY FIND MYSELF IN CAREER WHERE I WAS SPENDING 14 16 HOURS DAY IN THE OFFICE NOT LEAVING MUCH TIME FOR ANYTHING ELSE THE THINGS THAT I WANTED TO DO I WANTED TO BE SPENDING TIME WITH MY WIFE I WANTED TO SPEND TIME WITH MY FAMILY MY FRIENDS THEY JUST WEREN'T 

01GWGmg5jn8_00022
 ARE YOU STILL HAPPY WITH THAT LIFE THAT YOU HAVE SET UP FOR YOURSELVES I THINK IT IS IMPORTANT TO KEEP ASKING THOSE THINGS IN A CYCLE AND PARDON THE SKEWED GRAPH HERE BUT WHAT WE'RE TRYING TO SAY IS THAT WILL KEEP FEEDING INTO WHAT MATTERS TO YOU IF YOU KEEP ASKING YOURSELF THOSE QUESTIONS YOU WILL ENSURE THAT YOU SET YOURSELF ON TH

100%|██████████| 2/2 [00:00<00:00,  2.05it/s]

01GWGmg5jn8_00030
 A LOT OF TIME IN JOB OR IN A PROFESSION OR IN A LIFE SITUATION YOU KNOW IF YOU STUDYING DAY IN DAY OUT THAT YOU REALLY DO NOT ENJOY OR GET A LOT OUT OF YOU'RE NOT 

01GWGmg5jn8_00031
 DO IT BECAUSE YOU'RE NOT THAT PASSIONATE ABOUT IT IF WE GO BACK TO THAT CHART THE THREE CIRCLES IF YOU'VE WORKED OUT WHAT MATTERS TO YOU 

01GWGmg5jn8_00032
 WHAT CAN HELP YOU GET THERE YOU'LL NEVER FIND YOURSELF IN THIS SCENARIO YOU'LL NEVER BE SITTING AT A DESK BORED BECAUSE YOU'LL ALWAYS BE MOTIVATED TO DO SOMETHING IN MY CASE AS SAID 

01GWGmg5jn8_00034
 WAY WE STRUCTURE OUR DAYS IS MORE BASED ON WHAT WE NEED TO DO OUTSIDE THE OFFICE AND THEREFORE WHAT TO WE DO INSIDE THE OFFICE TO HELP US ACHIEVE THAT WHICH IS A DIFFERENT WAY OF THINKING BUT IF YOU KNOW WHAT MOTIVATES AND WHAT MATTERS TO YOU THEN YOU CAN STRUCTURE YOUR LIFE TO HELP YOU MEET THOSE GOALS IT'S IMPORTANT FOR YOU GUYS WHEN YOU'RE THINKING ABOUT NOT NECESSARILY WORK DOWN THE TRACK BUT IN YOUR STUDIES WHAT ARE YOU PASSION


