# Preprocessing of audio files
This notebook contains preprocessing steps for audio processing.

## Importing required libraries

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# For audio processing
import librosa
import librosa.display
import IPython.display as ipd

# import custom functions
from modules.remove_zero_size_files import remove_zero_size_file
from modules.number_to_string import nepali_number_to_devnagari

## Removing zero size files from labels
Initially removing unrequired zero size files from labels and adding sentence using sentenceId. Here, pre-created function `remove_zero_size_file()` can be used.

In [3]:
df = remove_zero_size_file(show_process=True, save_to_csv=True)

Current File Path is:- G:\Projects\major-project-processing\pythonFiles
Path of the audioFiles is found!
The list of the audio files found!
The list of the audio file size found!
Note:- Each index in file size represents same index in audio files list.
Path of data.json:- G:\Projects\major-project-processing\pythonFiles\labels/data.json
Path of data.json:- G:\Projects\major-project-processing\pythonFiles\labels/sentenceLabels.json
Json file loaded!
List of file with no zero bytes files found!
List of file with only zero byte files found!
The list of indexs in the files list to be removed found!
The zero bytes files are now removed from the dataframe.
New Dataframe creation completed!
File is now exported to a csv file.


In [4]:
df.head()

Unnamed: 0,_id,userId,fileName,gender,sentences
0,60e2d811552fd6002e30b8fd,L5WMnUqwFRlZUFg48A4DRu9dYwP9srB5s2cqsA/rDZg=,dd42d217-11b5-4107-b8c5-8c60939db63c,male,"रोमान्चक बनेको खेलमा बुलबुलेले आर्मीलाई २०-२५,..."
1,60e2d964552fd6002e30b901,4UO9IETvAMYKoPU5GhL4DRjqb5rNgF1FpAKkyXQ9v/c=,744f492e-8e7f-4a1e-a13e-d65a0f0716c4,male,आईसीसी महिला टी-२० विश्वकपको एसिया छनोट एक महि...
2,61ab150e7526df002f75e921,3upRZGf2oFJMajP1LMVx5vNAMKY+PdM+rIdTvmQHUus=,cfea7257-9d98-481b-a4b2-aa8da3132cca,female,आईसीसी महिला टी-२० विश्वकपको एसिया छनोट एक महि...
3,60e2d958552fd6002e30b8fe,zcRQLjrvRyhg0PDjjhxlGJ1PoM7deRWnvlx08Ja1Wl4=,95ccf7f4-b198-4623-bcf8-45deb7f914e7,male,कार्यक्रममा राष्ट्रिय क्रिकेट टोलीका सदस्यहरुम...
4,60e2d98b552fd6002e30b903,4UO9IETvAMYKoPU5GhL4DRjqb5rNgF1FpAKkyXQ9v/c=,1ad1149b-be62-4099-8356-0e7c5f892374,male,तर त्यसपछि म रियल मड्रिडका लागि धेरै राम्रो गर...


In [5]:
# Total number of user that have participated in data collection
a = []
for i in df['userId']:
    if i not in a:
        a.append(i)
print(len(a))

129


## Converting Nepali Numbers into Devnagari number form
For processing having sentences in numeric form makes it difficult for processing hence, numbes needs to be turned to devnagari(word format). Here, pre-created `nepali_number_to_devnagari()` function can be used.

In [6]:
numberList = nepali_number_to_devnagari(returnNumberList=True, returnValue=False)

In [7]:
number = ''
devnagariSentence = []
for i in df['sentences']:
    currentSentence = ''
    for j in i:
        if j in numberList:
            number += j
            continue
        if len(number)>0:
            currentSentence += nepali_number_to_devnagari(number, returnNumberList=False, returnValue=True)
            number = ''
        currentSentence += j
    devnagariSentence.append(currentSentence)
df['devnagariSentence'] = devnagariSentence
df.drop('sentences', axis=1, inplace=True)
# df.to_csv('./labels/final.csv')

In [8]:
df.head()

Unnamed: 0,_id,userId,fileName,gender,devnagariSentence
0,60e2d811552fd6002e30b8fd,L5WMnUqwFRlZUFg48A4DRu9dYwP9srB5s2cqsA/rDZg=,dd42d217-11b5-4107-b8c5-8c60939db63c,male,रोमान्चक बनेको खेलमा बुलबुलेले आर्मीलाई विस-पच...
1,60e2d964552fd6002e30b901,4UO9IETvAMYKoPU5GhL4DRjqb5rNgF1FpAKkyXQ9v/c=,744f492e-8e7f-4a1e-a13e-d65a0f0716c4,male,आईसीसी महिला टी-विस विश्वकपको एसिया छनोट एक मह...
2,61ab150e7526df002f75e921,3upRZGf2oFJMajP1LMVx5vNAMKY+PdM+rIdTvmQHUus=,cfea7257-9d98-481b-a4b2-aa8da3132cca,female,आईसीसी महिला टी-विस विश्वकपको एसिया छनोट एक मह...
3,60e2d958552fd6002e30b8fe,zcRQLjrvRyhg0PDjjhxlGJ1PoM7deRWnvlx08Ja1Wl4=,95ccf7f4-b198-4623-bcf8-45deb7f914e7,male,कार्यक्रममा राष्ट्रिय क्रिकेट टोलीका सदस्यहरुम...
4,60e2d98b552fd6002e30b903,4UO9IETvAMYKoPU5GhL4DRjqb5rNgF1FpAKkyXQ9v/c=,1ad1149b-be62-4099-8356-0e7c5f892374,male,तर त्यसपछि म रियल मड्रिडका लागि धेरै राम्रो गर...


## Checking audio files
Checking audio files for it's speech and it's accuracy with the labels

In [2]:
df = pd.read_csv('./labels/final.csv')

In [3]:
df.shape

(563, 5)

In [4]:
df.head()

Unnamed: 0,_id,userId,fileName,gender,devnagariSentence
0,60e2d811552fd6002e30b8fd,L5WMnUqwFRlZUFg48A4DRu9dYwP9srB5s2cqsA/rDZg=,dd42d217-11b5-4107-b8c5-8c60939db63c,male,रोमान्चक बनेको खेलमा बुलबुलेले आर्मीलाई विस-पच...
1,60e2d964552fd6002e30b901,4UO9IETvAMYKoPU5GhL4DRjqb5rNgF1FpAKkyXQ9v/c=,744f492e-8e7f-4a1e-a13e-d65a0f0716c4,male,आईसीसी महिला टी-विस विश्वकपको एसिया छनोट एक मह...
2,61ab150e7526df002f75e921,3upRZGf2oFJMajP1LMVx5vNAMKY+PdM+rIdTvmQHUus=,cfea7257-9d98-481b-a4b2-aa8da3132cca,female,आईसीसी महिला टी-विस विश्वकपको एसिया छनोट एक मह...
3,60e2d958552fd6002e30b8fe,zcRQLjrvRyhg0PDjjhxlGJ1PoM7deRWnvlx08Ja1Wl4=,95ccf7f4-b198-4623-bcf8-45deb7f914e7,male,कार्यक्रममा राष्ट्रिय क्रिकेट टोलीका सदस्यहरुम...
4,60e2d98b552fd6002e30b903,4UO9IETvAMYKoPU5GhL4DRjqb5rNgF1FpAKkyXQ9v/c=,1ad1149b-be62-4099-8356-0e7c5f892374,male,तर त्यसपछि म रियल मड्रिडका लागि धेरै राम्रो गर...


In [9]:
number = 130
row = df.iloc[number]
filename = os.path.join("audioFiles", row["fileName"] + ".wav")
print(row["gender"])
print(row["fileName"] + ".wav")
print(row["devnagariSentence"])
ipd.Audio(filename, autoplay=True)

male
dd87c472-27c7-4558-9d47-1e22bc0e2c16.wav
पानी पर्न छाडेपछि तेह्र ओभरमा खेल घटाएर पुनः शुरु गरिएको हो


In [18]:
# df.drop(index=number, axis=0, inplace=True)

In [9]:
df.to_csv('./labels/final.csv', index=False)

Removing the files that are not spoken correctly.

In [11]:
%%time
total_time = 0
count = 0
for i in df["fileName"]:
    filename = os.path.join("audioFiles", i +".wav")
    audio, _ = librosa.load(os.path.join("audioFiles", i+".wav"))
    total_time = total_time + librosa.get_duration(audio)
    count = count+1

Wall time: 7min 56s


In [12]:
time = (total_time / 60) / 60
print("{} Hrs".format(time))

1.2920378054925663 Hrs


In [13]:
a = input()

a
