In [39]:
import pandas as pd

## Strongly labelled dataset
**Train set**

Audioset comes as a csv file containing youtube IDs (below segment_id), start/end times of the segment, and a class label ID.

- Youtube IDs identify a youtube URL as: https://www.youtube.com/watch?v={YOUTUBE_ID}
- Class label ID needs to be matched with the ontology dataset to have the verbose description

While for the visually indicated sounds workstream we are not neceassarily interested in the labels, the dataset is still helpful as it provides a backbone list of video-audio links to be downloaded. 

In [44]:
df = pd.read_csv('audioset_data/audioset_train_strong.tsv', sep='\t')
df

Unnamed: 0,segment_id,start_time_seconds,end_time_seconds,label
0,b0RFKhbpFJA_30000,0.000,10.000,/m/03m9d0z
1,b0RFKhbpFJA_30000,4.753,5.720,/m/05zppz
2,b0RFKhbpFJA_30000,0.000,10.000,/m/07pjwq1
3,b0RFKhbpFJA_30000,6.899,7.010,/m/07qjznt
4,b0RFKhbpFJA_30000,8.534,9.156,/t/dd00092
...,...,...,...,...
934816,cq-vfngNXMc_70000,7.836,8.015,/m/07qjznt
934817,cq-vfngNXMc_70000,8.226,8.511,/t/dd00099
934818,cq-vfngNXMc_70000,8.503,8.868,/m/05zppz
934819,cq-vfngNXMc_70000,9.217,9.624,/t/dd00099


In [49]:
df['diff'] = df.end_time_seconds - df.start_time_seconds
df['diff'].describe()

count    934821.000000
mean          2.059805
std           3.221891
min           0.000000
25%           0.220000
50%           0.552000
75%           1.753000
max          10.000000
Name: diff, dtype: float64

In [61]:
segment_ids = df.segment_id.unique()
for video_url in segment_ids:
    video_df = df[df['segment_id']==video_url]
    start_time = video_df.start_time_seconds.min()
    end_time = video_df.end_time_seconds.max()
    print(start_time)
    print(end_time)
    ## DOWNLOAD VIDEO

    for segment in video_df.iterrows():
        segment_id = segment[1][0]
        start_time = segment[1][1]
        end_time = segment[1][2]
        label = segment[1][3]


    break



0.0
10.0
10.0
5.72
10.0
7.01
9.156


  segment_id = segment[1][0]
  start_time = segment[1][1]
  end_time = segment[1][2]
  label = segment[1][3]


Each video has labels for multiple segments. The total number of segments is 139538, while the total number of videos is:

In [5]:
df.segment_id.nunique()

103463

Each segment is labelled with a class that identified by a label id (e.g. /m/07pggtn). The id-name correspondance can be found in the ontology file

In [6]:
onto = pd.read_json('audioset_data/ontology.json')

print(f"Number of classes: {len(onto)}")

onto.head()

Number of classes: 632


Unnamed: 0,id,name,description,citation_uri,positive_examples,child_ids,restrictions
0,/m/0dgw9r,Human sounds,Sounds produced by the human body through the ...,,[],"[/m/09l8g, /m/01w250, /m/09hlz4, /m/0bpl036, /...",[abstract]
1,/m/09l8g,Human voice,The human voice consists of sound made by a hu...,http://en.wikipedia.org/wiki/Human_voice,[],"[/m/09x0r, /m/07p6fty, /m/03qc9zr, /m/02rtxlg,...",[abstract]
2,/m/09x0r,Speech,Speech is the vocalized form of human communic...,http://en.wikipedia.org/wiki/Speech,"[youtu.be/8uI9H5jGRV8?start=30&end=40, youtu.b...","[/m/05zppz, /m/02zsn, /m/0ytgt, /m/01h8n0, /m/...",[]
3,/m/05zppz,"Male speech, man speaking",Speech uttered by an adult male human.,,"[youtu.be/6niRPYpLOpQ?start=30&end=40, youtu.b...",[],[]
4,/m/02zsn,"Female speech, woman speaking",Speech uttered by an adult female human.,,"[youtu.be/4l05nCOnIRg?start=30&end=40, youtu.b...",[],[]


In [7]:
df = pd.merge(df, onto[['id','name','description']], how='left', left_on='label', right_on='id')

df = df.drop(columns=['id'])

df

Unnamed: 0,segment_id,start_time_seconds,end_time_seconds,label,name,description
0,b0RFKhbpFJA_30000,0.000,10.000,/m/03m9d0z,Wind,Sounds caused by the large-scale flow of gases...
1,b0RFKhbpFJA_30000,4.753,5.720,/m/05zppz,"Male speech, man speaking",Speech uttered by an adult male human.
2,b0RFKhbpFJA_30000,0.000,10.000,/m/07pjwq1,Buzz,"The sound of rapid vibration, commonly the win..."
3,b0RFKhbpFJA_30000,6.899,7.010,/m/07qjznt,Tick,A metallic tapping sound.
4,b0RFKhbpFJA_30000,8.534,9.156,/t/dd00092,Wind noise (microphone),The noise produced when a strong air current p...
...,...,...,...,...,...,...
934816,cq-vfngNXMc_70000,7.836,8.015,/m/07qjznt,Tick,A metallic tapping sound.
934817,cq-vfngNXMc_70000,8.226,8.511,/t/dd00099,Generic impact sounds,Sounds of impacts or collisions preferentially...
934818,cq-vfngNXMc_70000,8.503,8.868,/m/05zppz,"Male speech, man speaking",Speech uttered by an adult male human.
934819,cq-vfngNXMc_70000,9.217,9.624,/t/dd00099,Generic impact sounds,Sounds of impacts or collisions preferentially...


The 10 most present labels are:

In [8]:
df.groupby('name').count()[['segment_id']].sort_values('segment_id', ascending=False).head(10)

Unnamed: 0_level_0,segment_id
name,Unnamed: 1_level_1
Generic impact sounds,131725
"Male speech, man speaking",101509
Tick,57020
"Female speech, woman speaking",36065
Breathing,31606
Music,30930
Mechanisms,27411
Background noise,21162
Wind noise (microphone),18834
"Bird vocalization, bird call, bird song",16732


In [9]:
df.groupby('name').count()[['segment_id']].sort_values('segment_id', ascending=False).tail(10)

Unnamed: 0_level_0,segment_id
name,Unnamed: 1_level_1
Wildfire,2
Wobble,2
Zing,1
Puff,1
"Outside, urban or manmade",1
Deformable shell,1
"Inside, small room",1
Human locomotion,1
Sonic boom,1
Grind,1


One issue with the dataset: a segment can have multiple labels (e.g. Male speech, man speaking)

## Original dataset

this dataset is the original (from 2017) audioset dataset. This version is not strongly labelled, but can still serve our purpose. In this dataset, 1 row = 1 video

In [33]:
df1 = pd.read_csv('audioset_data/eval_segments.csv')

In [34]:
df1

Unnamed: 0,# YTID,start_seconds,end_seconds,positive_labels,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11
0,--4gqARaEJE,0.0,10.0,"""/m/068hy",/m/07q6cd_,/m/0bt9lr,"/m/0jbk""",,,,,
1,--BfvyPmVMo,20.0,30.0,"""/m/03l9g""",,,,,,,,
2,--U7joUcTCo,0.0,10.0,"""/m/01b_21""",,,,,,,,
3,--i-y1v8Hy8,0.0,9.0,"""/m/04rlf",/m/09x0r,/t/dd00004,"/t/dd00005""",,,,,
4,-0BIyqJj9ZU,30.0,40.0,"""/m/07rgt08",/m/07sq110,"/t/dd00001""",,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
20366,zyF8TGSRvns,150.0,160.0,"""/m/0dwsp",/m/0dwtp,/m/0f8s22,"/m/0j45pbj""",,,,,
20367,zz35Va7tYmA,30.0,40.0,"""/m/012f08",/m/07q2z82,/m/07qmpdm,"/m/0k4j""",,,,,
20368,zzD_oVgzKMc,30.0,40.0,"""/m/07pn_8q""",,,,,,,,
20369,zzNdwF40ID8,70.0,80.0,"""/m/04rlf","/m/0790c""",,,,,,,


In [37]:
df1 = pd.read_csv('audioset_data/balanced_train_segments.csv')

In [38]:
df1

Unnamed: 0,# YTID,start_seconds,end_seconds,positive_labels,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11
0,--PJHxphWEs,30.0,40.0,"""/m/09x0r","/t/dd00088""",,,,,,,
1,--ZhevVpy1s,50.0,60.0,"""/m/012xff""",,,,,,,,
2,--aE2O5G5WE,0.0,10.0,"""/m/03fwl",/m/04rlf,"/m/09x0r""",,,,,,
3,--aO5cdqSAg,30.0,40.0,"""/t/dd00003","/t/dd00005""",,,,,,,
4,--aaILOrkII,200.0,210.0,"""/m/032s66","/m/073cg4""",,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
22155,zyqg4pYEioQ,20.0,30.0,"""/m/09x0r","/m/0llzx""",,,,,,,
22156,zz0ddNfz0h0,30.0,40.0,"""/m/012f08",/m/03cl9h,/m/07yv9,"/m/0k4j""",,,,,
22157,zz8TGV83nkE,80.0,90.0,"""/m/012f08",/m/02mk9,/m/04_sv,"/m/07yv9""",,,,,
22158,zzlK8KDqlr0,370.0,380.0,"""/m/01m2v",/m/07qc9xj,/m/09x0r,"/t/dd00125""",,,,,
