# Singaporean Speech Classification with Neural Networks - Capstone Project

## Introduction to this Notebook

This notebook contains the thought-process of data collection and scoping of the project. It will go through how I efficiently processed large dataset of audio files and create my train-test set from scratch, using Linux Commands (on Git Bash) and the OS library.

If you wish to proceed directly to the EDA, preprocessing and modelling, go straight to the other notebooks.

## Approach to Data Collection

The IMDA National Speech Corpus has more than thousands of hours of recordings, with a total size of more than 1TB. Due to several limitations, I have to reduce the number of audio files that I will be working with. Furthermore, given only 3 weeks to complete this project, I cannot have too much audio files as it will take too long to develop and train the models.

**Limitations**

- Computer does not have that sufficient storage space nor RAM to work with so much audio files.
- Dropbox has restriction which do not allow me to download everything at one go. I have to download one speaker at a time.
- I only have 3-4 weeks to complete this entire project.

As such, in order to build a multi-classification model, I need to identify the labels, the number of audio files available per label, and their respective file names.

**Dataset I Will Be Using**

I have decided to use audio from PART 1, Channel 0 and Session 0. It contains recordings of phonetically-balanced scripts from about 1000 local English speakers and were recorded in quiet rooms using a headset/microphone. Session 0 refers to the first out of the two station.

In each audio file, the speaker will read out a sentence. I will only need one word from each sentence. The audio have to be extracted with a software manually.


**Steps for Scoping**

1. Filter the Transcript to find sentences with the least word count - *the lower the word count, the smaller the file size, saving space on my computer*

2. Identify the most commonly used words - *the more frequent a word is used, the more data I potentially will have for my training set, improving the quality of my model*

3. Pick 5 words as my labels for the multi-classification model

**Steps for Data Collection**

1. Identify all speakers who have used the words of interest (labels), and their respective audio filepaths.

2. From the entire folder of audio files, extract the audio files I need

3. Extract the specific word I need from each audio file. 

4. Create the train-test set

## 1. Examine Transcript

Import and clean transcript to determine the labels for my classification model.

In [13]:
# data science essentials
import numpy as np
import pandas as pd

# for interacting with file explorer
import os
from os import listdir

# for audio related data
import librosa

# for exploring the transcript
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords


### 1.1 IMPORT DATA AND CLEANING

The transcripts folder contains transcripts from both session. 


**Get File Names of Transcripts for Session 0 for Iteration**

IMDA held two recording sessions, named `session 0` and `session 1`. As I am only interested in one session, I will be looking at the corpus of session 0 only.

In [15]:
file_list = os.listdir('./assets/transcript/')
file_list

# get the file names of the transcripts, but this includes both sessions. We just want those from session 0.

['000010.TXT',
 '000011.TXT',
 '000020.TXT',
 '000021.TXT',
 '000030.TXT',
 '000031.TXT',
 '000040.TXT',
 '000041.TXT',
 '000050.TXT',
 '000060.TXT',
 '000061.TXT',
 '000070.TXT',
 '000071.TXT',
 '000080.TXT',
 '000081.TXT',
 '000090.TXT',
 '000091.TXT',
 '000100.TXT',
 '000110.TXT',
 '000111.TXT',
 '000120.TXT',
 '000121.TXT',
 '000130.TXT',
 '000131.TXT',
 '000140.TXT',
 '000141.TXT',
 '000160.TXT',
 '000161.TXT',
 '000170.TXT',
 '000171.TXT',
 '000180.TXT',
 '000181.TXT',
 '000190.TXT',
 '000191.TXT',
 '000200.TXT',
 '000201.TXT',
 '000210.TXT',
 '000211.TXT',
 '000220.TXT',
 '000221.TXT',
 '000230.TXT',
 '000231.TXT',
 '000240.TXT',
 '000241.TXT',
 '000250.TXT',
 '000251.TXT',
 '000260.TXT',
 '000261.TXT',
 '000270.TXT',
 '000271.TXT',
 '000280.TXT',
 '000281.TXT',
 '000290.TXT',
 '000291.TXT',
 '000300.TXT',
 '000301.TXT',
 '000310.TXT',
 '000311.TXT',
 '000320.TXT',
 '000321.TXT',
 '000330.TXT',
 '000331.TXT',
 '000340.TXT',
 '000341.TXT',
 '000350.TXT',
 '000351.TXT',
 '000360.T

In [16]:
# create a list of file paths
session_0 = []
for f in file_list:
    # transcripts from session 0 are marked with '0' on the 5th index
    if f[5] == '0':
        session_0.append("./assets/transcript/" + f)
session_0

['./assets/transcript/000010.TXT',
 './assets/transcript/000020.TXT',
 './assets/transcript/000030.TXT',
 './assets/transcript/000040.TXT',
 './assets/transcript/000050.TXT',
 './assets/transcript/000060.TXT',
 './assets/transcript/000070.TXT',
 './assets/transcript/000080.TXT',
 './assets/transcript/000090.TXT',
 './assets/transcript/000100.TXT',
 './assets/transcript/000110.TXT',
 './assets/transcript/000120.TXT',
 './assets/transcript/000130.TXT',
 './assets/transcript/000140.TXT',
 './assets/transcript/000160.TXT',
 './assets/transcript/000170.TXT',
 './assets/transcript/000180.TXT',
 './assets/transcript/000190.TXT',
 './assets/transcript/000200.TXT',
 './assets/transcript/000210.TXT',
 './assets/transcript/000220.TXT',
 './assets/transcript/000230.TXT',
 './assets/transcript/000240.TXT',
 './assets/transcript/000250.TXT',
 './assets/transcript/000260.TXT',
 './assets/transcript/000270.TXT',
 './assets/transcript/000280.TXT',
 './assets/transcript/000290.TXT',
 './assets/transcrip

In [17]:
def to_df(file_list):
    """
    Create a dataframe ready for vectorising and further processing.
    
    The textfiles contain the transcripts of each sentences. However, they are
    repeated. There are two rows for each spoken sentences, where every even number rows contain
    the original statement while every odd number rows is a repeat, but in lower case and without
    punctuation.
    
    Every odd number rows also do not have an ID since it shares the same ID as the row before.
    
    For our case, the odd number rows are far more useful, saving us time as they are already kind of 
    preprocessed.
    
    This function will return the odd rows only, giving us statements in lower case and without punctuation.
    It will also asign its respective IDs to it for future use.
    """
    # create an empty dataframe which will house all the dataframes
    master_df = pd.DataFrame(columns = ['id', 'text', 'speaker', 'session', 'line'])
    count = 0
    for f in file_list: 
        # open csv of the txtfile
        df = pd.read_csv(f, '\t', header = None, dtype={0: object})

        # rename the columns as the files does not come with headers
        df.rename(columns = {0: 'id', 1 : 'text'}, inplace = True)

        # copy the id of even number rows into odd number rows
        for n in range(0, len(df['id']), 2):
            df.at[n+1, 'id'] = df['id'][n]

        # drop the even number rows and reset index
        df.drop(index = list(range(0, len(df),2)), inplace=True)
        df.reset_index(inplace=True, drop=True)

        # creating three more columns which will be useful when we combine all of the speakers' dataset
        speaker_list = []
        session_list = []
        line_list = []

        for i in df['id']:    
            speaker_list.append(i[-8:-4])
            session_list.append(i[-4])
            line_list.append(i[-3:])

        df['speaker'] = speaker_list    
        df['session'] = session_list
        df['line'] = line_list

        # finally, a function with cleans up "<SPK/>, <NON/> and **". These represents cracking noise in the audio
        # which may not be useful for our case here.

        df['text'] = df['text'].map(lambda x: x.replace("<SPK/> ", "").replace(" <SPK/>", "").replace("<NON/> ","").replace(" <NON/>","").replace("** ", "").replace(" **", "").lower())
        
        # merge it to the main dataframe
        master_df = master_df.merge(df, how='outer')

        # track the progress
        count +=1
        if count%100 == 0:
            print(f"Finished {count} out of {len(file_list)}")
    
    print(f"Finished {count} out of {len(file_list)}")
    return master_df

In [18]:
transcript_df = to_df(session_0)

Finished 100 out of 1024
Finished 200 out of 1024
Finished 300 out of 1024
Finished 400 out of 1024
Finished 500 out of 1024
Finished 600 out of 1024
Finished 700 out of 1024
Finished 800 out of 1024
Finished 900 out of 1024
Finished 1000 out of 1024
Finished 1024 out of 1024


In [19]:
# check dataframe

transcript_df.head()

Unnamed: 0,id,text,speaker,session,line
0,10001,there were barrels of wine in the huge cellar,1,0,1
1,10002,she won a car because she was the twelfth pers...,1,0,2
2,10003,as they walked back they were shocked to see a...,1,0,3
3,10005,heavy rains caused a flood in the village,1,0,5
4,10006,he gulped down his beer,1,0,6


In [20]:
# check for null values
transcript_df.isnull().sum()

id         0
text       0
speaker    0
session    0
line       0
dtype: int64

There are no null values.

**Convert to CSV**

In [21]:
# safe as csv
transcript_df.to_csv("./datasets/transcripts_session_0.csv", index=False)

### 1.2 Transcript EDA

After having the corpus for `session 0`, I want to find out the most common words. This is to identify how much audio data I can potentially have.

**Identify Most Frequently Used Words**

Approach:

1. Tokenise each document and do a Count Vectorisation
2. Sort values from largest to smallest
3. Decide which labels to be used for classification

In [22]:
# open csv file
transcript_df = pd.read_csv("./datasets/transcripts_session_0.csv")

In [23]:
# ensure import is correctly done
transcript_df.isnull().sum()

id         0
text       0
speaker    0
session    0
line       0
dtype: int64

In [24]:
transcript_df.head()

Unnamed: 0,id,text,speaker,session,line
0,10001,there were barrels of wine in the huge cellar,1,0,1
1,10002,she won a car because she was the twelfth pers...,1,0,2
2,10003,as they walked back they were shocked to see a...,1,0,3
3,10005,heavy rains caused a flood in the village,1,0,5
4,10006,he gulped down his beer,1,0,6


In [25]:
len(transcript_df)

380742

There are 380742 audio files in total from this recording session. This could be too much for my computer to download and process.

In [27]:
len(transcript_df['speaker'].unique())

1024

There are a total of 1024 speakers in this session. On average, each speaker has 371-372 audio files. For the scope of this project, I may not need all the audio files.

In [29]:
# vectorise the corpus, eliminating stopwords and return top 100 features
cvec = CountVectorizer(stop_words = stopwords.words("English"), max_features = 100)
corpus = cvec.fit_transform(transcript_df['text'])

In [30]:
# create a vectorised dataframe
corpus_df = pd.DataFrame(corpus.todense(), columns= cvec.get_feature_names())

In [31]:
# top 30 words of the entire corpus
corpus_df.sum().sort_values(ascending = False)[0:30]

also         22222
one          13092
singapore    12720
time          8378
would         8357
could         8048
year          7474
said          7222
people        6999
first         6960
get           6607
new           6501
make          6421
family        6385
take          6300
two           6291
way           6268
think         6145
car           6094
many          6055
even          6050
years         5926
good          5516
still         5424
may           5379
like          5372
always        5249
another       5015
last          5006
day           4973
dtype: int64

This list contains the most frequently used words, along with the number of times each word had appeared in the corpus.

This will be helpful for other data scientist who wish to use this list to expand their model or perform other other forms of analysis.

However, this is still too difficult for me to select my labels because it will still be too much for the computer to process. In order to further narrow it down, I have decided to limit the audio files with lower word counts.

**Identify Most Frequently Used Words Limited to Wordcount of 7**

In [33]:
# create new column to see the number of word count
transcript_df['wordcount'] = transcript_df['text'].map(lambda x: len(x.split(' ')))
transcript_df.head()

Unnamed: 0,id,text,speaker,session,line,wordcount
0,10001,there were barrels of wine in the huge cellar,1,0,1,9
1,10002,she won a car because she was the twelfth pers...,1,0,2,15
2,10003,as they walked back they were shocked to see a...,1,0,3,18
3,10005,heavy rains caused a flood in the village,1,0,5,8
4,10006,he gulped down his beer,1,0,6,5


**I want to find the text which are lower in word count, which will make it easier to process the audio files**

In [34]:
# identify documents with 6 or less word counts
transcript_7_df = transcript_df[transcript_df['wordcount'] < 7]

In [35]:
# transform it with Count Vectoriser to determine most used words
cvec_7 = CountVectorizer(stop_words = stopwords.words("English"), max_features = 100)
corpus_7 = cvec_7.fit_transform(transcript_7_df['text'])
corpus_7_df = pd.DataFrame(corpus_7.todense(), columns= cvec_7.get_feature_names())

In [36]:
# the top 60 words which appeared, these are potential words for my classifier (aka labels)
corpus_7_df.sum().sort_values(ascending = False)[0:60]

singapore     1146
like          1104
could         1089
great         1014
real          1007
around         989
help           968
something      967
need           964
father         957
april          955
flowers        954
star           951
sakura         950
bloom          947
worker         941
music          933
live           929
mozart         929
compare        927
black          924
pieces         923
seems          921
watermelon     919
deep           915
fired          915
pot            913
kettle         913
calls          911
listening      906
apples         895
cheer          894
sleeper        891
bothering      891
tell           887
rap            887
cuddling       881
oranges        880
composed       879
ripened        878
zip            854
water          829
cracks         792
drips          792
lip            765
picker         704
nit            690
also           653
one            330
still          329
good           293
really         260
know        

Based on the list above, I have decided the following words as my labels.

In [41]:
labels = [ "apples", "flowers", "worker", "water", "father"]

The selection is based on 2 main reasons.

1. They all have two syllables - *if they have different numbers of syllable, it could be too easy to classify them*

2. They have similar starting and ending consonant - *for example, 'flowers' and 'father' sound quite similar.

The only exception is "apples", which was picked because apple is a common object which could also have varying pronunciations. For example, some speakers would pronounce it without the "l" sound, making it sound like "ap-pers".

In [37]:
# reset index as this is what I will be working with
transcript_7_df.reset_index(inplace=True)

**Get Speakers Who Used Words from Label List**

In [43]:
# return a dictionary with the list of speakers for each label

dct = {}
# for each label
for l in labels:
    word_id = []
    # for each sentence
    for n, t in enumerate(transcript_7_df['text']):
        # if sentence contains the label
            if l in t.split():
                # append the speaker number
                word_id.append(transcript_7_df['speaker'][n])
    # nest speaker number list into main dictionary
    dct[l] = set(word_id)
dct

{'apples': {3,
  4,
  5,
  6,
  7,
  8,
  10,
  12,
  13,
  14,
  16,
  17,
  18,
  20,
  21,
  22,
  23,
  24,
  25,
  28,
  30,
  31,
  33,
  34,
  35,
  36,
  37,
  39,
  41,
  44,
  46,
  48,
  50,
  52,
  54,
  57,
  59,
  61,
  63,
  65,
  68,
  73,
  76,
  78,
  80,
  81,
  82,
  83,
  84,
  86,
  88,
  92,
  95,
  96,
  97,
  98,
  99,
  100,
  104,
  107,
  108,
  110,
  112,
  113,
  114,
  115,
  116,
  117,
  118,
  123,
  129,
  130,
  133,
  134,
  135,
  136,
  137,
  138,
  139,
  140,
  141,
  143,
  144,
  145,
  146,
  147,
  148,
  149,
  151,
  152,
  153,
  156,
  158,
  159,
  160,
  161,
  163,
  164,
  165,
  166,
  167,
  168,
  169,
  170,
  171,
  172,
  173,
  174,
  175,
  176,
  177,
  179,
  180,
  181,
  182,
  183,
  184,
  185,
  186,
  187,
  188,
  189,
  191,
  193,
  194,
  195,
  196,
  197,
  198,
  199,
  200,
  201,
  202,
  203,
  204,
  205,
  206,
  207,
  208,
  209,
  210,
  211,
  212,
  213,
  214,
  215,
  216,
  217,
  218,
  219,
  2

In [44]:
# find common speakers within them
speakers_list = dct['apples'].intersection(dct['flowers'], dct['worker'], dct['water'], dct['father'])

In [49]:
print(f"There are a total of {len(speakers_list)} speakers who have used all the labels.")
print(f"Potential Dataset is {len(speakers_list) * 5} audio files.")

There are a total of 665 speakers who have used all the labels.
Potential Dataset is 3325 audio files.


**Select Speakers**

Due to time constraint, I have chosen to select the audio files in two separate batches.

In [54]:
# pick my speakers for batch 1
np.random.seed(19)
selected_speakers_1 = np.random.choice(list(speakers_list), size= 110, replace= False)
selected_speakers_1.sort()
selected_speakers_1

array([ 140,  141,  158,  163,  185,  191,  195,  197,  198,  203,  205,
        212,  214,  228,  232,  249,  254,  256,  260,  261,  262,  275,
        280,  291,  294,  306,  324,  328,  331,  332,  344,  355,  361,
        363,  381,  383,  394,  398,  400,  413,  427,  431,  446,  474,
        481,  491,  520,  530,  531,  539,  563,  566,  579,  586,  605,
        617,  640,  661,  663,  665,  692,  697,  714,  717,  734,  737,
        741,  747,  750,  794,  796,  805,  806,  830,  853,  854,  855,
        862,  869,  874,  877,  878,  888,  893,  898,  901,  906,  907,
        925,  926,  927,  935,  940,  947,  955,  956,  957,  965,  966,
        969,  970,  974,  993,  998, 1021, 1029, 1068, 1079, 1081, 1446],
      dtype=int64)

In [55]:
# pick my speakers for batch 2
for s in selected_speakers_1:
    speakers_list.remove(s)

selected_speakers_2 = np.random.choice(list(speakers_list), size= 140, replace= False)
selected_speakers_2.sort()
selected_speakers_2

array([   3,    6,    7,    8,   59,  116,  144,  147,  148,  166,  167,
        180,  181,  187,  188,  194,  207,  219,  226,  243,  267,  270,
        276,  279,  292,  293,  297,  303,  305,  312,  315,  319,  323,
        340,  352,  365,  370,  388,  392,  395,  405,  408,  411,  414,
        417,  420,  425,  438,  439,  455,  463,  466,  468,  473,  486,
        496,  502,  503,  508,  509,  514,  534,  565,  575,  577,  581,
        583,  585,  590,  597,  622,  628,  632,  651,  655,  675,  678,
        690,  708,  709,  719,  720,  733,  736,  740,  759,  781,  789,
        801,  814,  816,  818,  826,  831,  835,  846,  848,  856,  857,
        870,  873,  875,  882,  883,  884,  885,  892,  894,  912,  917,
        920,  931,  941,  942,  943,  945,  946,  977,  985,  986,  987,
        991,  997, 1000, 1001, 1008, 1018, 1027, 1030, 1031, 1032, 1034,
       1040, 1044, 1048, 1051, 1052, 1056, 1058, 1064], dtype=int64)

I have potentially 250 speakers with at least 1250 audio files, with 250 audio files for each class. I will be using 100 audio files for unseen data, and 1000 audio files for training set.

**Find the Specific Audio Files that I Need to Keep**

Each speaker's folder contains recordings from `session 0` and `session 1`, with 350-400 audio files in each session. I only need a maximum of 5 audio files out of these 800 audio files. In order to save space on my computer, I will identify which exact audio files I need, and delete the audio files which I do not need.

Approach:
1. Find the audio ID number which corresponds to the labels
2. Create a list of their file paths
3. Use OS commands and GitBash to mass extract zipfiles and delete all unwanted files

In [62]:
def find_id(speaker_list, df=transcript_7_df):
    """
    This is a function returns a list of tuples (l, id) where l is the label and id is the audio file id.
    This will allow me to find the specific audio files I need. I can then delete those irrelevant audio files later on.
    """
    idlist = []
    for speaker in speaker_list:
        print("")
        print(f'For Speaker {speaker},')
        for l in labels:    
            for n, t in enumerate(df[df['speaker'] == speaker]['text']):
                # for each label, if label is in transcript
                if l in t.split(' '):
                    # get the index of the speaker
                    index = df[df['speaker'] == speaker]['text'].index[n]
                    print(f"Extract {l} from {df['id'][index]}")
                    # append the id and the label name
                    idlist.append((df['id'][index],l))
        if len(idlist) >= 5:
            print("Has at least 5 labels")
    return idlist

**Retrieve the Audio ID for Batch 1**

In [60]:
speaker_list_1 = find_id(selected_speakers_1)


For Speaker 140,
Extract apples from 1400066
Extract flowers from 1400062
Extract worker from 1400049
Extract water from 1400051
Extract father from 1400049
Has at least 5 labels

For Speaker 141,
Extract apples from 1410066
Extract flowers from 1410062
Extract worker from 1410049
Extract water from 1410051
Extract father from 1410049
Has at least 5 labels

For Speaker 158,
Extract apples from 1580066
Extract flowers from 1580062
Extract worker from 1580049
Extract water from 1580051
Extract father from 1580049
Has at least 5 labels

For Speaker 163,
Extract apples from 1630066
Extract flowers from 1630062
Extract worker from 1630049
Extract water from 1630051
Extract father from 1630049
Has at least 5 labels

For Speaker 185,
Extract apples from 1850066
Extract flowers from 1850062
Extract worker from 1850049
Extract water from 1850051
Extract father from 1850049
Has at least 5 labels

For Speaker 191,
Extract apples from 1910066
Extract flowers from 1910062
Extract worker from 19100

Extract apples from 8770066
Extract flowers from 8770062
Extract worker from 8770049
Extract water from 8770051
Extract father from 8770049
Has at least 5 labels

For Speaker 878,
Extract apples from 8780066
Extract flowers from 8780062
Extract worker from 8780049
Extract water from 8780051
Extract father from 8780049
Has at least 5 labels

For Speaker 888,
Extract apples from 8880066
Extract flowers from 8880062
Extract worker from 8880049
Extract water from 8880051
Extract father from 8880049
Has at least 5 labels

For Speaker 893,
Extract apples from 8930066
Extract flowers from 8930062
Extract worker from 8930049
Extract water from 8930051
Extract father from 8930049
Has at least 5 labels

For Speaker 898,
Extract apples from 8980066
Extract flowers from 8980062
Extract worker from 8980049
Extract water from 8980051
Extract father from 8980049
Has at least 5 labels

For Speaker 901,
Extract apples from 9010066
Extract flowers from 9010062
Extract worker from 9010049
Extract water f

In [61]:
speaker_list_1

[(1400066, 'apples'),
 (1400062, 'flowers'),
 (1400049, 'worker'),
 (1400051, 'water'),
 (1400049, 'father'),
 (1410066, 'apples'),
 (1410062, 'flowers'),
 (1410049, 'worker'),
 (1410051, 'water'),
 (1410049, 'father'),
 (1580066, 'apples'),
 (1580062, 'flowers'),
 (1580049, 'worker'),
 (1580051, 'water'),
 (1580049, 'father'),
 (1630066, 'apples'),
 (1630062, 'flowers'),
 (1630049, 'worker'),
 (1630051, 'water'),
 (1630049, 'father'),
 (1850066, 'apples'),
 (1850062, 'flowers'),
 (1850049, 'worker'),
 (1850051, 'water'),
 (1850049, 'father'),
 (1910066, 'apples'),
 (1910062, 'flowers'),
 (1910049, 'worker'),
 (1910051, 'water'),
 (1910049, 'father'),
 (1950066, 'apples'),
 (1950062, 'flowers'),
 (1950049, 'worker'),
 (1950051, 'water'),
 (1950049, 'father'),
 (1970066, 'apples'),
 (1970062, 'flowers'),
 (1970049, 'worker'),
 (1970051, 'water'),
 (1970049, 'father'),
 (1980066, 'apples'),
 (1980062, 'flowers'),
 (1980049, 'worker'),
 (1980051, 'water'),
 (1980049, 'father'),
 (2030066,

**Retrieve the Audio ID for Batch 2**

In [63]:
speaker_list_2 = find_id(selected_speakers_2)


For Speaker 3,
Extract apples from 30034
Extract flowers from 30008
Extract worker from 30009
Extract water from 30095
Extract father from 30009
Has at least 5 labels

For Speaker 6,
Extract apples from 60066
Extract flowers from 60062
Extract worker from 60049
Extract water from 60051
Extract father from 60049
Has at least 5 labels

For Speaker 7,
Extract apples from 70066
Extract flowers from 70062
Extract worker from 70049
Extract water from 70051
Extract father from 70049
Has at least 5 labels

For Speaker 8,
Extract apples from 80066
Extract flowers from 80062
Extract worker from 80049
Extract water from 80197
Extract father from 80049
Has at least 5 labels

For Speaker 59,
Extract apples from 590066
Extract flowers from 590062
Extract worker from 590049
Extract water from 590051
Extract father from 590049
Has at least 5 labels

For Speaker 116,
Extract apples from 1160066
Extract flowers from 1160062
Extract worker from 1160049
Extract water from 1160051
Extract father from 1160

Extract water from 7200051
Extract father from 7200049
Has at least 5 labels

For Speaker 733,
Extract apples from 7330066
Extract flowers from 7330062
Extract worker from 7330049
Extract water from 7330051
Extract father from 7330049
Has at least 5 labels

For Speaker 736,
Extract apples from 7360066
Extract flowers from 7360062
Extract worker from 7360049
Extract water from 7360051
Extract father from 7360049
Has at least 5 labels

For Speaker 740,
Extract apples from 7400066
Extract flowers from 7400062
Extract worker from 7400049
Extract water from 7400051
Extract father from 7400049
Has at least 5 labels

For Speaker 759,
Extract apples from 7590066
Extract flowers from 7590062
Extract worker from 7590049
Extract water from 7590051
Extract father from 7590049
Has at least 5 labels

For Speaker 781,
Extract apples from 7810066
Extract flowers from 7810062
Extract worker from 7810049
Extract water from 7810051
Extract father from 7810049
Has at least 5 labels

For Speaker 789,
Extra

In [64]:
speaker_list_2

[(30034, 'apples'),
 (30008, 'flowers'),
 (30009, 'worker'),
 (30095, 'water'),
 (30009, 'father'),
 (60066, 'apples'),
 (60062, 'flowers'),
 (60049, 'worker'),
 (60051, 'water'),
 (60049, 'father'),
 (70066, 'apples'),
 (70062, 'flowers'),
 (70049, 'worker'),
 (70051, 'water'),
 (70049, 'father'),
 (80066, 'apples'),
 (80062, 'flowers'),
 (80049, 'worker'),
 (80197, 'water'),
 (80049, 'father'),
 (590066, 'apples'),
 (590062, 'flowers'),
 (590049, 'worker'),
 (590051, 'water'),
 (590049, 'father'),
 (1160066, 'apples'),
 (1160062, 'flowers'),
 (1160049, 'worker'),
 (1160051, 'water'),
 (1160049, 'father'),
 (1440066, 'apples'),
 (1440062, 'flowers'),
 (1440049, 'worker'),
 (1440051, 'water'),
 (1440049, 'father'),
 (1470066, 'apples'),
 (1470062, 'flowers'),
 (1470049, 'worker'),
 (1470051, 'water'),
 (1470049, 'father'),
 (1480066, 'apples'),
 (1480062, 'flowers'),
 (1480049, 'worker'),
 (1480051, 'water'),
 (1480049, 'father'),
 (1660066, 'apples'),
 (1660062, 'flowers'),
 (1660049,

**Save Relevant Audio Files**

Steps:
1. Download all the zip files from dropbox, containing recordings of more than 1000 speakers.

2. Delete irrelevant zip files

2. Use GitBash to Unzip everything and do a massive file extraction (more than 50,000 recordings)

3. Delete all irrelevant audio files using the library `os`.

**Step 1: Download Everything**

<img src="./assets/images/zipfiles.jpg" />

There are more than a thousand zip files.

**Step 2: Delete Irrelevant Zip Files**

I used this lines of codes to delete the zip files which I do not need.

```python
# This line of codes will delete the zip files that I don't need from my directory
# FORMAT of Zip Files = "SPEAKER0001.zip"
# Get a list of zip files which I need, and remove the rest which I do not need

# where I keep all the audio files
path = 'C:\\Users\\Andy Chan\\Desktop\\audio_original'

# get the name of the zip file of each speaker
zip_names = []
for f in selected_speakers_2:
    if len(str(f)) == 1:
        zip_names.append(f'SPEAKER000{f}.zip')
    elif len(str(f)) == 2:
        zip_names.append(f'SPEAKER00{f}.zip')
    elif len(str(f)) == 3:
        zip_names.append(f'SPEAKER0{f}.zip')
    else:
        zip_names.append(f'SPEAKER{f}.zip')

# for each file in "path"
for f in os.listdir(path):
    # if file is not in zip_names
    if f not in zip_names:
        full_file_path = os.path.join(path, f)
        # remove the file, keeps the files which are in zip_names
        os.remove(full_file_path)
```

The result is this.

<img src="./assets/images/remaining_zip.jpg" />

With 140 zip files left from my second batch. This is much more easier to work with, compared to hundreds of GB.

**Step 3: Mass Extraction with GitBash**

<img src="./assets/images/unzip.jpg" />

For information on the Git/Linux Command, refer to the excel worksheet in this same repo.

This will give me all the audio files of the all the 140 speakers, which was still about 54,000 audio files.

**Step 4: Delete Irrelevant Audio Files**

I ran the following code which deletes the audio files which I do not need, it will save the ones I need.

```python

# This line of codes will delete the audio files that I don't need from my directory
# FORMAT of Zip Files = "SPEAKER0001.zip"
# Get a list of zip files which I need, and remove the rest which I do not need

# where I keep all the audio files
path = 'C:\\Users\\Andy Chan\\Desktop\\audio_original\\capstone_audios'

# create a list of file names which I am interested in
file_list_2 = [i.tolist() for i, _ in speaker_list_2]

# file path has 9 digits, where the first digit will be zero, followed by speaker code which is 4 digits, and the
# audio code of 3 digits. Hence, they need to pad with zeros for if they have less digits.

audio_paths = []
for f in file_list_2:
    if len(str(f)) == 5:
        audio_paths.append(f'0000{f}.WAV')
    elif len(str(f)) == 6:
        audio_paths.append(f'000{f}.WAV')
    elif len(str(f)) == 7:
        audio_paths.append(f'00{f}.WAV')
    else:
        audio_paths.append(f'0{f}.WAV')

for f in os.listdir(path):
    if f not in audio_paths:
        full_file_path = os.path.join(path, f)
        os.remove(full_file_path)
```

**Extracting Specific Words From Each Audio File**

After the above, I used [Audacity](https://www.audacityteam.org/) to extract the specific audio I needed. Audacity is an audio editing programme which is easy to use. I picked the soundwave that I needed it and normalised it, leaving everything else to the default.

With all the audio files on hand, I can create the train-test set.

**Create Training Set DataFrame For Audio EDA**

In [65]:
# create an empty dataframe

train_df = pd.DataFrame(columns=['id','filepath','duration','class_label'])

In [66]:
path = './assets/audio_train/'

#for each audio file in the path
for n, f in enumerate(os.listdir(path)):
    # get filepath
    filepath = os.path.join(path, f)
    # get duration of each audio file, using Librosa
    audio, sr = librosa.load(filepath)
    duration = librosa.get_duration(audio)
    # get label
    label = f.replace(".wav", "").split("_")[1]
    # get id
    i = f.replace(".wav", "").split("_")[0]
    
    # add each row into the dataframe
    train_df.loc[n] = [i, filepath, duration, label]
    
train_df.head()

Unnamed: 0,id,filepath,duration,class_label
0,10210049,./assets/audio_train/10210049_father.wav,0.457959,father
1,10210049,./assets/audio_train/10210049_worker.wav,0.519501,worker
2,10210051,./assets/audio_train/10210051_water.wav,0.395828,water
3,10210062,./assets/audio_train/10210062_flowers.wav,0.399229,flowers
4,10210066,./assets/audio_train/10210066_apples.wav,0.524354,apples


In [67]:
# check null values to see if it was successful

train_df.isnull().sum()

id             0
filepath       0
duration       0
class_label    0
dtype: int64

In [68]:
train_df['class_label'].value_counts()

father     211
water      210
worker     210
flowers    209
apples     209
Name: class_label, dtype: int64

**Convert to CSV**

In [69]:
train_df.to_csv("datasets/train.csv", index= False)

### Prepare Test Set as Unseen Data

In [70]:
# create an empty dataframe

test_df = pd.DataFrame(columns=['id','filepath','duration','class_label'])

In [76]:
path = './assets/audio_test/'

#for each audio file in the path
for n, f in enumerate(os.listdir(path)):
    # get filepath
    filepath = os.path.join(path, f)
    # get duration of each audio file, using Librosa
    audio, sr = librosa.load(filepath)
    duration = librosa.get_duration(audio)
    # get label
    label = f.replace(".wav", "").split("_")[1]
    # get id
    i = f.replace(".wav", "").split("_")[0]
    
    # add each row into the dataframe
    test_df.loc[n] = [i, filepath, duration, label]
    
test_df.head()

Unnamed: 0,id,filepath,duration,class_label
0,8750049,./assets/audio_test/8750049_father.wav,0.353016,father
1,8750049,./assets/audio_test/8750049_worker.wav,0.531655,worker
2,8750051,./assets/audio_test/8750051_water.wav,0.374467,water
3,8750062,./assets/audio_test/8750062_flowers.wav,0.533379,flowers
4,8750066,./assets/audio_test/8750066_apples.wav,0.503719,apples


In [77]:
test_df.shape

(101, 4)

In [78]:
# check for null values
test_df.isnull().sum()

id             0
filepath       0
duration       0
class_label    0
dtype: int64

In [79]:
test_df['class_label'].value_counts()

flowers    21
apples     20
worker     20
water      20
father     20
Name: class_label, dtype: int64

**Convert to CSV**

In [80]:
# save as csv
test_df.to_csv("datasets/test.csv", index=False)

This notebook had shown you how I determine the scope of my capstone project and how I collected the data based on my limitations. I used various libraries to efficiently parse through huge datasets, allowing me to start my project promptly.