In [1]:
import pandas as pd
import os

### Generate a List of File Folders:

These file folder names are also the titles which correspond to the Wikipedia article that was being read.

All of the data used in this notebook was pulled from https://nats.gitlab.io/swc/

In [2]:
english_folder_list = os.listdir('./data/English/english')
german_folder_list = os.listdir('./data/German/german')

### Pull audio file names from list:

Using a for loop I have generated a list of audio.ogg files using the .pop() function (A few entries have multiple audio files. ) 

With the English Audio files the audio2 always comes last, so I can just filter for the audio1 later. 

In [3]:
english_files = []
for i in english_folder_list:
    try: 
        file = [x for x in os.listdir(f'./data/English/english/{i}/') if x.endswith('ogg')].pop()
    except:
        file = 'None'
    english_files.append(file)

#### Verify lengths

To make sure I didn't have any extraction errors I will compare the length of the generate list to the input list.

In [4]:
len(english_folder_list)

1340

In [5]:
len(english_files)

1340

### Convert to DataFrame:

I cast the files into a DataFrame to store it for later use.

In [6]:
english_file_df = pd.DataFrame()
english_file_df['Title'] = english_folder_list
english_file_df['Filename'] = english_files

#### Check for Nones:

There are a few ```None``` data entries. 

We'll use a mask to drop these values from the DataFrame since this will save me from writing ```try: and except:``` in my later data preparation.

In [7]:
english_file_df['Filename'].value_counts()


audio.ogg     1270
audio2.ogg      64
None             6
Name: Filename, dtype: int64

In [8]:
english_file_df = english_file_df[english_file_df['Filename'] !='None']

#### Repeat the process above for Audio Files ending in 1.ogg

The below steps are the same as the steps above except they are designed to pull only the files ending in 1.ogg which were missing from the original DataFrame.

In [9]:
english_files = []
for i in english_folder_list:
    try: 
        file = [x for x in os.listdir(f'./data/English/english/{i}/') if x.endswith('1.ogg')].pop()
    except:
        file = 'None'
    english_files.append(file)

# Generate Second DataFrame for Audio File ending in 1.ogg
    
english_file_df2 = pd.DataFrame()
english_file_df2['Title'] = english_folder_list
english_file_df2['Filename'] = english_files

# Drop 'None' Rows

english_file_df2 = english_file_df2[english_file_df2['Filename'] != 'None']

### Concat:

Bring the Dataframes together using concatenation. Check the length and reset the index.

In [10]:
english_file_df = pd.concat([english_file_df, english_file_df2])

In [11]:
len(english_file_df)

1398

In [12]:
english_file_df.reset_index(inplace=True)

### Repeat for German:

The beginning starts out the same but you'll see the value_counts are a little different so we're going to do a little meta exploring to check out what's going on.

In [13]:
# Collect Files:

german_files = []
for i in german_folder_list:
    try: 
        file = [x for x in os.listdir(f'./data/German/german/{i}/') if x.endswith('ogg')].pop()
    except:
        file = 'None'
    german_files.append(file)

# Create DataFrame matching titles to audio files

german_file_df = pd.DataFrame()
german_file_df['Title'] = german_folder_list
german_file_df['Filename'] = german_files

# Check the value_counts()

german_file_df['Filename'].value_counts()

audio.ogg     942
audio2.ogg     69
None            3
audio1.ogg      1
Name: Filename, dtype: int64

#### Weird...

There is a single file that read in as 1. But the remainder were two. So we'll have to repeat the steps above to for both file endings ```1.ogg and 2.ogg``` to see what's happening.

But first lets remove the ```'None'``` from the first DataFrame

In [14]:
german_file_df = german_file_df[german_file_df['Filename'] != 'None']

#### Repeat with ```1.ogg```

In [15]:
# Collect Files:

german_files = []
for i in german_folder_list:
    try: 
        file = [x for x in os.listdir(f'./data/German/german/{i}/') if x.endswith('1.ogg')].pop()
    except:
        file = 'None'
    german_files.append(file)

# Create DataFrame:    
    
german_file_df2 = pd.DataFrame()
german_file_df2['Title'] = german_folder_list
german_file_df2['Filename'] = german_files

# Drop ```'None'```:

german_file_df2 = german_file_df2[german_file_df2['Filename'] != 'None']

# Check Values:

german_file_df2['Filename'].value_counts()

audio1.ogg    70
Name: Filename, dtype: int64

#### Wait....

We collected 1 ```audio1.ogg``` file earlier. But we have 70 here. 

Let's look at ```audio2.ogg``` really quick.

In [16]:
# Collect Files:

german_files = []
for i in german_folder_list:
    try: 
        file = [x for x in os.listdir(f'./data/German/german/{i}/') if x.endswith('2.ogg')].pop()
    except:
        file = 'None'
    german_files.append(file)

# Create DataFrame:    
    
german_file_df3 = pd.DataFrame()
german_file_df3['Title'] = german_folder_list
german_file_df3['Filename'] = german_files

# Drop ```'None'```:

german_file_df3 = german_file_df3[german_file_df3['Filename'] != 'None']

# Check Values:

german_file_df3['Filename'].value_counts()

audio2.ogg    69
Name: Filename, dtype: int64

#### Interesting:

So we have 69 ```audio2.ogg``` files. I will reference as '1' and '2' now.

In '1' We collected 70 which indicates there are 70 folders which have a '1' and '2' file. 

However, in '2' we only collected 69. This (to me at least) indicates that one file was name '1' instead of 'blank' accidentally. Since we already have all of '2' in the DataFrame there is no need to concatenate it.

Let's bring the 'blank' and '1' files together.

In [17]:
german_file_df = pd.concat([german_file_df, german_file_df2])

#### Testing:

If my theory is true dropping the duplicate files will only result in the loss of one file. 

In [18]:
len(german_file_df)

1082

In [19]:
len(german_file_df.drop_duplicates())

1081

#### Seems it was correct:

So now we a DataFrame for all of the English and German audio files. Let's go ahead and make that drop above permanent and set some values to represent the languages, since that is what I'll be testing for. 

In [20]:
german_file_df.drop_duplicates(inplace=True)

In [21]:
english_file_df['Language'] = 1
german_file_df['Language'] = 0

file_df = pd.concat([english_file_df, german_file_df])

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  after removing the cwd from sys.path.


### Finalizing:

Now that we have all of the data together let's go ahead and 1.) drop the excess rows that were generated by changing the index 2.) Fully reset the index and 3.) Check to make sure it all looks good.


In [22]:
try:
    file_df.drop('index',axis=1, inplace=True)
except:
    pass

file_df.reset_index(inplace=True)
file_df.drop('index', axis=1, inplace=True)

file_df

Unnamed: 0,Filename,Language,Title
0,audio.ogg,1,Longest_word_in_English
1,audio2.ogg,1,Equal_Protection_Clause
2,audio.ogg,1,Radha
3,audio.ogg,1,Thalassery
4,audio.ogg,1,Lev_Landau
...,...,...,...
2474,audio1.ogg,0,Waschb%c3%a4r
2475,audio1.ogg,0,Augsburg
2476,audio1.ogg,0,Tijuana_No!
2477,audio1.ogg,0,Microsoft_Windows_NT_4.0


### Looks Good!:

Everything looks good! Let's go ahead and save the file so I can reference it in the next notebook!

In [23]:
file_df.to_csv('./data/file_dictionary.csv', index=False)