# Preprocessing of audio files
This notebook contains preprocessing steps for audio processing.

## Importing required libraries

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# For audio processing
import librosa
import librosa.display
import IPython.display as ipd

# import custom functions
from modules.remove_zero_size_files import remove_zero_size_file
from modules.number_to_string import nepali_number_to_devnagari

In [2]:
import warnings
warnings.filterwarnings('ignore')

## Removing zero size files from labels
Initially removing unrequired zero size files from labels and adding sentence using sentenceId. Here, pre-created function `remove_zero_size_file()` can be used.

In [2]:
df = remove_zero_size_file(show_process=True, save_to_csv=True)

Current File Path is:- /home/rajmhrj/Documents/Major-Preprocessing/pythonFiles
Path of the audioFiles is found!
The list of the audio files found!
The list of the audio file size found!
Note:- Each index in file size represents same index in audio files list.
Path of data.json:- /home/rajmhrj/Documents/Major-Preprocessing/pythonFiles/labels/data.json
Path of data.json:- /home/rajmhrj/Documents/Major-Preprocessing/pythonFiles/labels/sentenceLabels.json
Json file loaded!
List of file with no zero bytes files found!
List of file with only zero byte files found!
The list of indexs in the files list to be removed found!
The zero bytes files are now removed from the dataframe.
New Dataframe creation completed!
File is now exported to a csv file.


In [3]:
df.head()

Unnamed: 0,_id,userId,fileName,gender,sentences
0,60e2d811552fd6002e30b8fd,L5WMnUqwFRlZUFg48A4DRu9dYwP9srB5s2cqsA/rDZg=,dd42d217-11b5-4107-b8c5-8c60939db63c,male,"रोमान्चक बनेको खेलमा बुलबुलेले आर्मीलाई २०-२५,..."
1,60e2d964552fd6002e30b901,4UO9IETvAMYKoPU5GhL4DRjqb5rNgF1FpAKkyXQ9v/c=,744f492e-8e7f-4a1e-a13e-d65a0f0716c4,male,आईसीसी महिला टी-२० विश्वकपको एसिया छनोट एक महि...
2,61ab150e7526df002f75e921,3upRZGf2oFJMajP1LMVx5vNAMKY+PdM+rIdTvmQHUus=,cfea7257-9d98-481b-a4b2-aa8da3132cca,female,आईसीसी महिला टी-२० विश्वकपको एसिया छनोट एक महि...
3,60e2d958552fd6002e30b8fe,zcRQLjrvRyhg0PDjjhxlGJ1PoM7deRWnvlx08Ja1Wl4=,95ccf7f4-b198-4623-bcf8-45deb7f914e7,male,मैदानको वरिपरी ट्र्याक निर्माणको लागि ५ नम्बर ...
4,60e2d98b552fd6002e30b903,4UO9IETvAMYKoPU5GhL4DRjqb5rNgF1FpAKkyXQ9v/c=,1ad1149b-be62-4099-8356-0e7c5f892374,male,तर त्यसपछि म रियल मड्रिडका लागि धेरै राम्रो गर...


In [4]:
# Total number of user that have participated in data collection
a = []
for i in df['userId']:
    if i not in a:
        a.append(i)
print(len(a))

145


## Converting Nepali Numbers into Devnagari number form
For processing having sentences in numeric form makes it difficult for processing hence, numbes needs to be turned to devnagari(word format). Here, pre-created `nepali_number_to_devnagari()` function can be used.

In [5]:
numberList = nepali_number_to_devnagari(returnNumberList=True, returnValue=False)

In [6]:
number = ''
devnagariSentence = []
for i in df['sentences']:
    currentSentence = ''
    for j in i:
        if j in numberList:
            number += j
            continue
        if len(number)>0:
            currentSentence += nepali_number_to_devnagari(number, returnNumberList=False, returnValue=True)
            number = ''
        currentSentence += j
    devnagariSentence.append(currentSentence)
df['devnagariSentence'] = devnagariSentence
df.drop('sentences', axis=1, inplace=True)
df.to_csv('./labels/combined.csv')

In [7]:
df.head()

Unnamed: 0,_id,userId,fileName,gender,devnagariSentence
0,60e2d811552fd6002e30b8fd,L5WMnUqwFRlZUFg48A4DRu9dYwP9srB5s2cqsA/rDZg=,dd42d217-11b5-4107-b8c5-8c60939db63c,male,रोमान्चक बनेको खेलमा बुलबुलेले आर्मीलाई विस-पच...
1,60e2d964552fd6002e30b901,4UO9IETvAMYKoPU5GhL4DRjqb5rNgF1FpAKkyXQ9v/c=,744f492e-8e7f-4a1e-a13e-d65a0f0716c4,male,आईसीसी महिला टी-विस विश्वकपको एसिया छनोट एक मह...
2,61ab150e7526df002f75e921,3upRZGf2oFJMajP1LMVx5vNAMKY+PdM+rIdTvmQHUus=,cfea7257-9d98-481b-a4b2-aa8da3132cca,female,आईसीसी महिला टी-विस विश्वकपको एसिया छनोट एक मह...
3,60e2d958552fd6002e30b8fe,zcRQLjrvRyhg0PDjjhxlGJ1PoM7deRWnvlx08Ja1Wl4=,95ccf7f4-b198-4623-bcf8-45deb7f914e7,male,मैदानको वरिपरी ट्र्याक निर्माणको लागि पाँच नम्...
4,60e2d98b552fd6002e30b903,4UO9IETvAMYKoPU5GhL4DRjqb5rNgF1FpAKkyXQ9v/c=,1ad1149b-be62-4099-8356-0e7c5f892374,male,तर त्यसपछि म रियल मड्रिडका लागि धेरै राम्रो गर...


## Appending newer data

In [3]:
df2 = pd.read_csv('./labels/final.csv')

In [5]:
df2.shape

(710, 5)

In [21]:
df.iloc[df2.shape[0]+20:].head()

Unnamed: 0,_id,userId,fileName,gender,devnagariSentence
611,61c356a2b692600033b25b6d,MOVPpUJDH7otjjt98PfZhP/IutL34ToVt1LoaV8OAvI=,11d816d5-96aa-4d07-9cc6-1c3660820d25,male,बंगलादेशका फरवार्ड साद उद्दिनको क्रस गरेको बलम...
612,61c3adcdb692600033b25b6e,bngHlX9oKlh6ZUjRjewM2TWP7j4pr9sRq1T0eMZVBDQ=,f8e6b54f-575a-4871-93f6-c189acdbb69a,male,यहि चैत्र उनन्तिस–एकत्तिस गतेसम्म सातदोबाटोस्थ...
613,61c3adeeb692600033b25b6f,bngHlX9oKlh6ZUjRjewM2TWP7j4pr9sRq1T0eMZVBDQ=,933bb79b-4b8c-4e73-8a20-dc1486f77527,male,जर्मन फुटबल एशोसिएसनका अनुसार टेर स्टेगेनको शल...
614,61c3ae00b692600033b25b70,bngHlX9oKlh6ZUjRjewM2TWP7j4pr9sRq1T0eMZVBDQ=,afd3133f-2571-4f30-9a17-797ae18c2180,male,सीताले टी-विस मात्र खेल्दा हतार हुने भएकोले ला...
615,61c3ae12b692600033b25b71,bngHlX9oKlh6ZUjRjewM2TWP7j4pr9sRq1T0eMZVBDQ=,811dad86-4e62-4c61-9d48-326ba77adfe9,male,सुरुआती अग्रता लिएको मच्छिन्द्रको लागि पुजन उप...


In [12]:
df2.iloc[df2.shape[0]-1:]

Unnamed: 0,_id,userId,fileName,gender,devnagariSentence
590,61c34d3bb692600033b25b6a,sbERMGnhuklA9yLoI1reL+mdg6NDufPqO2aqk+lLDwE=,49c0c60d-4b16-4784-a3d1-1bcaa687df55,male,त्यसैल अनुसार बोर्डले यूएईलाई प्रस्ताव गरेको थ...


In [22]:
df2 = df2.append(df.iloc[df2.shape[0]+20:])

In [23]:
df2.shape

(712, 5)

In [24]:
df2.to_csv('./labels/final.csv', index=False)

Removing the files that are not spoken correctly.

In [6]:
%%time
total_time = 0
count = 0
for i in df2["fileName"]:
    filename = os.path.join("audioFiles", i +".wav")
    audio, _ = librosa.load(os.path.join("audioFiles", i+".wav"))
    total_time = total_time + librosa.get_duration(audio)
    count = count+1

CPU times: user 5min 31s, sys: 21 s, total: 5min 52s
Wall time: 6min 38s


In [9]:
time = (total_time / 60) / 60
print("{:.2f} Hrs".format(time))

1.64 Hrs
