# Re-evaluating the labels

Despite the large quantity of files available on Google AudioSet, the quality of its labeling is not great. To get a rough idea of the labeling accuracy of our dataset, I randomly sampled 100 files (50 urban/city and 50 rural/nature) and evaluated whether they were correctly labeled.

In [1]:
import pandas as pd
import numpy as np
rural_pd = pd.read_csv('../data/interim/GoogleAudioSet_unbalanced_list/nature_no_music.csv').sample(n = 50, random_state=23, ignore_index=True)
urban_pd = pd.read_csv('../data/interim/GoogleAudioSet_unbalanced_list/city_no_music.csv').sample(n = 50, random_state=23, ignore_index=True)

In [2]:
rural_pd['urls'] = 'https://www.youtube.com/watch?v='+rural_pd['ID']+'&t='+rural_pd['start'].astype(int).astype(str)
rural_pd['urls']

0     https://www.youtube.com/watch?v=-PZtqerYjQA&t=110
1     https://www.youtube.com/watch?v=8erlTkwa8s8&t=190
2     https://www.youtube.com/watch?v=P2lmCMWND1U&t=370
3      https://www.youtube.com/watch?v=5TsbCSErWpw&t=50
4      https://www.youtube.com/watch?v=9aCjBwysKzA&t=19
5      https://www.youtube.com/watch?v=CIiY6wC6RDY&t=40
6      https://www.youtube.com/watch?v=MOohOlY932s&t=10
7      https://www.youtube.com/watch?v=Hqz8-Q498_E&t=10
8     https://www.youtube.com/watch?v=1pXlvIpP_d8&t=140
9      https://www.youtube.com/watch?v=OmwCMN_DsVU&t=30
10     https://www.youtube.com/watch?v=-LB8zfFRTY8&t=30
11     https://www.youtube.com/watch?v=GIt3J2-EIIo&t=30
12    https://www.youtube.com/watch?v=9QboqDLrogk&t=180
13     https://www.youtube.com/watch?v=AIooWHCWQUA&t=20
14     https://www.youtube.com/watch?v=O65AFVa65Lo&t=30
15     https://www.youtube.com/watch?v=Iv2ZRUjXHmA&t=22
16     https://www.youtube.com/watch?v=2zrK-q8ZA-s&t=30
17     https://www.youtube.com/watch?v=1BQ8qH4D4

In [3]:
n = np.nan
rural_pd['label_correct'] = [
    1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 
    1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 
    1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 
    1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1
]
np.nanmean(rural_pd['label_correct'])

0.84

In [4]:
urban_pd['urls'] = 'https://www.youtube.com/watch?v='+urban_pd['ID']+'&t='+urban_pd['start'].astype(int).astype(str)
urban_pd['urls']

0       https://www.youtube.com/watch?v=LX5Zs-58Hic&t=4
1      https://www.youtube.com/watch?v=-dOYOA8FGjo&t=80
2      https://www.youtube.com/watch?v=2hEy39Y7soc&t=30
3     https://www.youtube.com/watch?v=LC5N_WM3e-E&t=250
4      https://www.youtube.com/watch?v=06XMUn9DTKc&t=30
5      https://www.youtube.com/watch?v=IXZVTjIcPfM&t=30
6     https://www.youtube.com/watch?v=Ms-NQjDWTb0&t=110
7     https://www.youtube.com/watch?v=F1uZNiCe-iU&t=530
8      https://www.youtube.com/watch?v=HhSLoGTlK9k&t=30
9     https://www.youtube.com/watch?v=2K3X_NCcm1s&t=170
10      https://www.youtube.com/watch?v=8YeiopqoOs0&t=0
11     https://www.youtube.com/watch?v=AWxQ51rs_yk&t=30
12     https://www.youtube.com/watch?v=Lx9UFYGcAJA&t=30
13     https://www.youtube.com/watch?v=BUMr35cAuaA&t=30
14     https://www.youtube.com/watch?v=9tBXN4ocVUY&t=40
15      https://www.youtube.com/watch?v=38JhniwZwKQ&t=7
16    https://www.youtube.com/watch?v=KvIlnX8MnJM&t=450
17     https://www.youtube.com/watch?v=O6vyB5SAc

In [5]:
n = np.nan
urban_pd['label_correct'] = [
    1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 
    1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 
    1, 1, 1, 1, 1, 0, 0, 1, 1, 0,
    0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 
    1, 1, 1, 1, 0, 0, 1, 0, 1, 0
]
    
np.nanmean(urban_pd['label_correct'])

0.64

**It turns out that the labeling accuracy was roughly between 64-84%. Based on this, I assume that the overall maximum classification accuracy of the current dataset is 80%.**