1. Download my_mm_female.zip from https://www.openslr.org/80/
2. Unzip my_mm_female.zip folder and rename it to raw-data
3. Create the folder openslr-80 in current directory
3. Copy the raw-data folder to openslr-80 folder 

In [1]:
import os
from pathlib import Path
import pandas as pd
from sklearn.model_selection import train_test_split
import shutil

In [2]:
seed = 42
openslr_dir = Path("./openslr-80")

# 1. Check file extensions

In [3]:
file_names = []
file_extensions = []
for root, dirs, files in os.walk(openslr_dir / "raw-data"):
    for file in files: 
        file_name = root / Path(file)
        file_names.append(file_name)
        file_extensions.append(file_name.suffix)

In [4]:
set(file_extensions)

{'', '.tsv', '.wav'}

# 2. Split Train and Test

## 2.1 Load the dataset

In [5]:
df = pd.read_csv(openslr_dir / "raw-data/line_index.tsv", delimiter="\t", header=None, names=['file_name', 'transcription'])

In [6]:
df.shape

(2530, 2)

In [7]:
df.head()

Unnamed: 0,file_name,transcription
0,bur_7865_1250917969,ပြီးတော့ တရုတ် နဲ့လည်း ချစ်ကြည်ရင်းနှီးတဲ့ ဆက်...
1,bur_8698_6883351313,အဲ့ဒီ ဝေဖန်မှုတွေ နဲ့ ပတ်သက် လို့ ဘယ်လို တုံ့ပ...
2,bur_3260_8853590661,မမီ မွေးဖွားနေတဲ့ အချိန် မှာတော့ ဘုရား ဂုဏ်တော...
3,bur_2446_1980151079,ခင်ဦးသာ နှင့် နန်းလွင်နှင်းပွင့် ပူးပေါင်း ရေး...
4,bur_5189_8958061789,ကြည့်ရသည်မှာ မြန်မာ နိုင်ငံ တွင် သူ့ အတွက် ခို...


## 2.2 Create the Split

In [8]:
train_df, test_df = train_test_split(df, test_size=0.1, random_state=seed)

In [9]:
print(train_df.shape[0], test_df.shape[0])

2277 253


# 3. Create Train Dataset and Test Dataset

## 3.1 Create Train and Test Directories 

In [10]:
data_dir = openslr_dir / "data/myanmar-speech-dataset-openslr-80"
train_dir = data_dir / "train"
test_dir = data_dir / "test"

train_dir.mkdir(parents=True, exist_ok=True) 
train_dir.mkdir(parents=True, exist_ok=True)
test_dir.mkdir(parents=True, exist_ok=True)

## 3.2 Copy the training data to Train Directory

In [11]:
for index, row in train_df.iterrows():
    file_name = row['file_name'] + ".wav"
    source_file = openslr_dir / "raw-data" / file_name
    destination_file = train_dir / file_name
    
    shutil.copy2(source_file, destination_file)   

## 3.3 Copy the test data to Test Directory

In [12]:
for index, row in test_df.iterrows():
    file_name = row['file_name'] + ".wav"
    source_file = openslr_dir / "raw-data" / file_name
    destination_file = test_dir / file_name
    
    shutil.copy2(source_file, destination_file)   

## 3.4 Save the csv

In [13]:
train_df['file_name'] = "train/" + train_df['file_name'] + ".wav"
test_df['file_name'] = "test/" + test_df['file_name'] + ".wav"

In [14]:
metadata_df = pd.concat([train_df, test_df])
metadata_df.reset_index(drop=True, inplace=True)

In [15]:
metadata_df.to_csv(data_dir / "metadata.csv", index=False)

# 4. Upload dataset to hub

In [16]:
from datasets import load_dataset

## 4.1 Load the dataset

In [17]:
dataset = load_dataset("audiofolder", data_dir="openslr-80/data/myanmar-speech-dataset-openslr-80")

Resolving data files:   0%|          | 0/2277 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/253 [00:00<?, ?it/s]

Downloading data:   0%|          | 0/2278 [00:00<?, ?files/s]

Downloading data:   0%|          | 0/254 [00:00<?, ?files/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [18]:
dataset['train'][0]

{'audio': {'path': '/Users/chuu/Documents/myanmar-language-dataset-collection/Crowdsourced Burmese Speech Dataset/openslr-80/data/myanmar-speech-dataset-openslr-80/train/bur_0366_0045318711.wav',
  'array': array([0., 0., 0., ..., 0., 0., 0.]),
  'sampling_rate': 48000},
 'transcription': 'ဆိုတော့ တယ်လီဖုန်း အော်ပရေတာ ဖြစ်လာရင် ကော ဝန်ဆောင်မှုပိုင်း က ကောင်းနိုင်ပါ့မလား'}

## Upload to hub

In [19]:
dataset.push_to_hub("chuuhtetnaing/myanmar-speech-dataset-openslr-80")

Uploading the dataset shards:   0%|          | 0/3 [00:00<?, ?it/s]

Map:   0%|          | 0/759 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/8 [00:00<?, ?ba/s]

Map:   0%|          | 0/759 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/8 [00:00<?, ?ba/s]

Map:   0%|          | 0/759 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/8 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Map:   0%|          | 0/253 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/3 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/chuuhtetnaing/myanmar-speech-dataset-openslr-80/commit/13edfa104c067f3fe68939b91130fccc929b6ff9', commit_message='Upload dataset', commit_description='', oid='13edfa104c067f3fe68939b91130fccc929b6ff9', pr_url=None, pr_revision=None, pr_num=None)