# A Classification Application For UrbanSound8K Dataset

[UrbanSound8K](https://urbansounddataset.weebly.com/urbansound8k.html) is publically available audio dataset:

> This dataset contains 8732 labeled sound excerpts (<=4s) of urban sounds from 10 classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, enginge_idling, gun_shot, jackhammer, siren, and street_music. The classes are drawn from the [urban sound taxonomy](https://urbansounddataset.weebly.com/taxonomy.html).

The audio files in this dataset are also from [freesound](https://datasets.freesound.org/), but different classes with [FSDKaggle2018](https://www.kaggle.com/c/freesound-audio-tagging) dataset.

As you can find in ["Recognition of Acoustic Events Using
Masked Conditional Neural Networks"](https://arxiv.org/pdf/1802.02617.pdf), state of the art accuracy of this dataset is about 73-74% so far.

- References

    - [1] J. Salamon, C. Jacoby and J. P. Bello, ["A Dataset and Taxonomy for Urban Sound Research"](http://www.justinsalamon.com/uploads/4/3/9/4/4394963/salamon_urbansound_acmmm14.pdf), 22nd ACM International Conference on Multimedia, Orlando USA, Nov. 2014.
    - [2] Fady Medhat, David Chesmore and John Robinson, ["Recognition of Acoustic Events Using
Masked Conditional Neural Networks"](https://arxiv.org/pdf/1802.02617.pdf), 2018.

In [1]:
import sys
sys.path.append('../..')
from lib_train import *
%matplotlib inline

DATAROOT = Path('/mnt/dataset/UrbanSound8K') ## Set folder of your copy
df = pd.read_csv(DATAROOT/'metadata/UrbanSound8K.csv')
folds = list(set(df.fold))

{'sampling_rate': 44100, 'duration': 1, 'hop_length': 347, 'fmin': 20, 'fmax': 22050, 'n_mels': 128, 'n_fft': 2560, 'model': 'mobilenetv2', 'labels': ['dog_bark', 'children_playing', 'car_horn', 'air_conditioner', 'street_music', 'gun_shot', 'siren', 'engine_idling', 'jackhammer', 'drilling'], 'folder': PosixPath('.'), 'n_fold': 1, 'valid_limit': None, 'random_state': 42, 'test_size': 0.01, 'samples_per_file': 5, 'batch_size': 32, 'learning_rate': 0.0001, 'metric_save_ckpt': 'val_acc', 'epochs': 100, 'verbose': 2, 'best_weight_file': 'best_model_weight.h5', 'rt_process_count': 1, 'rt_oversamples': 10, 'pred_ensembles': 10, 'runtime_model_file': 'model/mobilenetv2_fsd2018_41cls.pb', 'label2int': {'dog_bark': 0, 'children_playing': 1, 'car_horn': 2, 'air_conditioner': 3, 'street_music': 4, 'gun_shot': 5, 'siren': 6, 'engine_idling': 7, 'jackhammer': 8, 'drilling': 9}, 'num_classes': 10, 'samples': 44100, 'rt_chunk_samples': 4410, 'mels_onestep_samples': 4410, 'mels_convert_samples': 4851

Using TensorFlow backend.


## Inside this dataset

In [3]:
df[:10]

Unnamed: 0,slice_file_name,fsID,start,end,salience,fold,classID,class
0,100032-3-0-0.wav,100032,0.0,0.317551,1,5,3,dog_bark
1,100263-2-0-117.wav,100263,58.5,62.5,1,5,2,children_playing
2,100263-2-0-121.wav,100263,60.5,64.5,1,5,2,children_playing
3,100263-2-0-126.wav,100263,63.0,67.0,1,5,2,children_playing
4,100263-2-0-137.wav,100263,68.5,72.5,1,5,2,children_playing
5,100263-2-0-143.wav,100263,71.5,75.5,1,5,2,children_playing
6,100263-2-0-161.wav,100263,80.5,84.5,1,5,2,children_playing
7,100263-2-0-3.wav,100263,1.5,5.5,1,5,2,children_playing
8,100263-2-0-36.wav,100263,18.0,22.0,1,5,2,children_playing
9,100648-1-0-0.wav,100648,4.823402,5.471927,2,10,1,car_horn


The following shows that most of data length are 4 seconds.

In [4]:
(df.end - df.start).describe()

count    8732.000000
mean        3.607904
std         0.973570
min         0.054517
25%         4.000000
50%         4.000000
75%         4.000000
max         4.000000
dtype: float64

## Duplication with FSDKaggle2018

As shown below, there are duplicated Freesound ID with FSDKaggle2018 dataset.

___Due to these duplication, we are NOT using FSDKaggle2018 pretrained model to evaluate performance on this dataset. Or performance would be too good to be true.___


In [5]:
fsd = pd.read_csv('~/.kaggle/competitions/freesound-audio-tagging/train_post_competition.csv')

In [6]:
usids = np.array(df.fsID.unique(), dtype=np.int)
fsids = np.array(fsd.freesound_id.unique(), dtype=np.int)
dup_ids = [uid for uid in usids if uid in fsids]
print('Number of duplicated Freesound ID:', len(dup_ids))
print('Duplicated samples are distributed over folds:', df[df.fsID.isin(dup_ids)].fold.unique())

Number of duplicated Freesound ID: 130
Duplicated samples are distributed over folds: [ 5  2 10  1  4  3  8  6  9  7]


## Convert to numpy array files

In [7]:
# Convert data for each fold
for fold in folds:
    cur_df =df[df.fold.isin([fold])]
    Xfiles = [str(DATAROOT/'audio'/('fold%d' % (r.fold))/r.slice_file_name) for i, r in cur_df.iterrows()]
    y = [conf.label2int[r['class']] for i, r in cur_df.iterrows()]
    XX = mels_build_multiplexed_X(conf, Xfiles)
    X, y = mels_demux_XX_y(XX, y)
    np.save('X_fold%d.npy' % fold, X)
    np.save('y_fold%d.npy' % fold, y)