
1. [Introduction](#1)  
2. [Exploratory Data Analysis](#2)  
    * [Importing Libraries](#21)
    * [Load Bird Species Dataset](#22)
    * [Bird Species Analysis](#23)
    * [Recordings by geographical location](#231)
    * [Samples by Country](#24)
    * [Samples by Date](#25)
    * [Birds Seen](#26)
    * [Pitch](#27)
    * [Sampling Rate](#28)
    * [Volume](#29)
    * [Channels](#210)
    * [Recordists](#211)
    * [Ratings](#212)
    * [Bird seen by Country](#213)
3. [Audio Data analysis](#3)   
     * [Playing audio](#31)
     * [Visualizing audio in 2D](#32)
     * [Spectrogram analysis](#33)
4. [Feature Extraction](#4)    
     * [Spectral Centroid](#41)
     * [Spectral Bandwidth](#42)
     * [Spectral Rolloff](#43)
     * [Zero-Crossing Rate](#44)
     * [Mel-Frequency Cepstral Coefficients(MFCCs)](#45)
     * [Chroma feature](#46)
5. [Compare sound features](#5)


# 1. Introduction<a id="1"></a> <br>

Do you hear the birds chirping outside your window? Over 10,000 bird species occur in the world, and they can be found in nearly every environment, from untouched rainforests to suburbs and even cities. Birds play an essential role in nature. They are high up in the food chain and integrate changes occurring at lower levels. As such, birds are excellent indicators of deteriorating habitat quality and environmental pollution. However, it is often easier to hear birds than see them. 

With proper sound detection and classification, researchers could automatically intuit factors about an area’s quality of life based on a changing bird population.

There are already many projects underway to extensively monitor birds by continuously recording natural soundscapes over long periods. However, as many living and nonliving things make noise, the analysis of these datasets is often done manually by domain experts. These analyses are painstakingly slow, and results are often incomplete. Data science may be able to assist, so researchers have turned to large crowdsourced databases of focal recordings of birds to train AI models. Unfortunately, there is a domain mismatch between the training data (short recording of individual birds) and the soundscape recordings (long recordings with often multiple species calling at the same time) used in monitoring applications. This is one of the reasons why the performance of the currently used AI models has been subpar.

## Objective  
To identify a wide variety of bird vocalizations in soundscape recordings. Due to the complexity of the recordings, they contain weak labels. There might be anthropogenic sounds (e.g., airplane overflights) or other bird and non-bird (e.g., chipmunk) calls in the background, with a particular labeled bird species in the foreground. Bring new ideas to build effective detectors and classifiers for analyzing complex soundscape recordings.

So let us use the dataset of <b> Cornell Lab of Ornithology’s Center for Conservation Bioacoustics (CCB)</b> to do a complete exploratory data analysis and finding the insights about data and based on the findings come up with AI model that can achieve the above objective.


# Exploratory Data Analysis<a id='2'></a>
## <font size='4' color='blue'>Importing Libraries</font><a id='21'></a>

In [None]:
!pip install librosa
!pip install pandas-bokeh
!pip install chart_studio
!pip install pydub

In [None]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
import IPython.display as ipd
import plotly.express as px
import librosa.display
import pandas as pd
import numpy as  np
import librosa
import warnings
import IPython
import os
import wave
import pandas_profiling 
import pandas_bokeh
from bokeh.models.widgets import DataTable, TableColumn
from bokeh.models import ColumnDataSource
import matplotlib.pyplot as plt
import plotly.graph_objs as go
import chart_studio.plotly as py
import plotly.figure_factory as ff
from plotly.offline import iplot
import cufflinks
from IPython.display import IFrame
from tqdm import tqdm_notebook
import IPython as ipy
import IPython.display as ipyd
import folium
from folium.plugins import HeatMap, HeatMapWithTime
import plotly.express as px
import re
from pydub import AudioSegment
from scipy.io import wavfile as wav
import struct
from scipy.io import wavfile as wav
from colorama import Fore, Back, Style
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
import datetime as dt
from datetime import datetime   
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

import warnings
warnings.filterwarnings('ignore')
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl')
plt.style.use('fivethirtyeight')
plt.show()
pandas_bokeh.output_notebook()
pd.set_option('plotting.backend', 'pandas_bokeh')
warnings.filterwarnings(action='ignore')

## <font size='4' color='blue'>Load Bird songs Dataset</font><a id='22'></a>

In [None]:
train = pd.read_csv("../input/birdsong-recognition/train.csv")
media_path = '/kaggle/input/birdsong-recognition/train_audio/'

In [None]:
train.info()

In [None]:
def bird_sound_plotter(full_path,data):   
    rate, wave_sample = wav.read(full_path)
    wave_file = open(full_path,"rb")
    riff_fmt = wave_file.read(36)
    bit_depth_string = riff_fmt[-2:]
    bit_depth = struct.unpack("H",bit_depth_string)[0]
    print(Fore.CYAN+data['title'].upper())
    print('_'*len(data['title']))
    print('')
    print('Scientific Name:',data['sci_name'])
    print("Recorded in {} country ".format(data['country']))
    print('Recordist: ',data['author'])
    print('Number of Channels: ',wave_sample.shape[1] if len(wave_sample.shape)>1 else 1)
    print('Number of Samples: ',len(wave_sample))
    print('Rating: ',data['rating'])
    print('Sampling rate: ',rate,'Hz')
    print('Bit depth: ',bit_depth)
    print('Duration: ',wave_sample.shape[0]/rate,' second')
    plt.figure(figsize=(12, 4))
    plt.plot(wave_sample)
    return ipd.Audio(full_path)

def plot_bird_sound_wave(sp):
    train_sound_data = train[train['species']==sp]
    idx = np.random.choice(train_sound_data.index,1)[0]
    bird_sound_data = train_sound_data.loc[idx,:]
    src = os.path.join('/kaggle/input/birdsong-recognition/train_audio/',bird_sound_data['ebird_code'],bird_sound_data['filename'])
    bird_sound_mp3 = AudioSegment.from_mp3(src)
    filename=bird_sound_data['filename'].split('.')[0]+'.wav'
    bird_sound_mp3.export(filename,format='wav')
    return bird_sound_plotter(filename,bird_sound_data)

## <font size='4' color='blue'>Bird Species Analysis</font><a id='23'></a>

Let us find out from the dataset how many bird species exist and what are they

In [None]:
ebird = train['ebird_code'].value_counts().index.to_list()
ebird_code_path = 'https://ebird.org/species/'
species = [ebird_code_path+i for i in ebird]

In [None]:
IFrame(species[16], width=1200, height=600)

In [None]:
IFrame(species[100], width=1200, height=600)

In [None]:
print("There are {} unique species of birds in train dataset".format(train.species.nunique()))

In [None]:
train['species'].value_counts().sort_values(ascending = False).iplot(kind='bar',color='#85500BF')

## <font size='4' color='blue'>Recordings by Geographical Location</font><a id='231'></a>

In [None]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
bird_song_geographical_location = user_secrets.get_secret("bird_song")
bird_song_map = user_secrets.get_secret("location")
train.latitude = train.latitude.str.replace('Not specified','nan').astype(np.float16)
train.longitude = train.longitude.str.replace('Not specified','nan').astype(np.float16)

In [None]:
px.set_mapbox_access_token(bird_song_map)
fig = px.scatter_mapbox(train,
                lat='latitude',
                lon='longitude',
                size='duration',
                color='rating',
                hover_name='species',
                hover_data=['country','elevation','duration'],
                color_continuous_scale=px.colors.sequential.Rainbow,
                mapbox_style='stamen-terrain',
                zoom=0.5)
fig.update_geos(fitbounds="locations", visible=True)
fig.update_geos(projection_type="mercator")
fig.update_layout(height=1000,width=1200,margin={"r":100,"t":200,"l":0,"b":0})
fig.update_layout(title='<b>Recording Locations</b>',template='seaborn',
                  hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell"))
fig.show()

## <font size='4' color='blue'>Samples by Country</font><a id='24'></a>

In [None]:
country_counts = train.country.value_counts().sort_index(ascending=False) 
df1 = pd.DataFrame({"Count": country_counts},index=train.country)

In [None]:
p_hbar = df1.plot_bokeh(
    kind="barh",
    x=country_counts.index,
    xlabel="Count",
    ylabel="Country",
    title="Distribution of Bird Species Across Country", 
    alpha=0.4,
    figsize=(800,300),
    legend = "top_right",
    show_figure=False)
pandas_bokeh.plot_grid([[p_hbar]],plot_width=1100,plot_height=1000)

## <font size='4' color='blue'>Date of Recordings</font><a id='25'></a>

In [None]:
plt.figure(figsize=(100, 120))
train['date'].value_counts().sort_index().plot(color='blue',alpha=.8)

## <font size='4' color='blue'>Birds Seen</font><a id='26'></a>

In [None]:
train['bird_seen'].value_counts()

labels = train['bird_seen'].value_counts().index
values = train['bird_seen'].value_counts().values
colors=['#9793bf','#bf3fbf']

fig = go.Figure(data=[go.Pie(labels=labels, values=values, textinfo='label+percent',
                             insidetextorientation='radial',marker=dict(colors=colors))])
fig.update_layout(title='Bird Seen')
fig.show()

## <font size='4' color='blue'>Pitch</font><a id='27'></a>

In [None]:
train['pitch'].value_counts()
labels = train['pitch'].value_counts().index
values = train['pitch'].value_counts().values
colors=['#0093bf','#af3fbf']

fig = go.Figure(data=[go.Pie(labels=labels, values=values, textinfo='label+percent',
                             insidetextorientation='radial',marker=dict(colors=colors))])
fig.update_layout(title='Pitch',annotations=[dict(text='Pitch', x=0.51, y=0.5, font_size=20, showarrow=False)])
fig.update_traces(hole=.4, hoverinfo="label+percent+name")
fig.update_traces(hoverinfo='label+percent', textinfo='value', textfont_size=20,
                  marker=dict(colors=colors, line=dict(color='#000000', width=2)))
fig.show()

## <font size='4' color='blue'>Sampling Rate</font><a id='28'></a>

Sampling rate (audio) or sampling frequency defines the number of samples per second.

In [None]:
train['sampling_rate'].value_counts().sort_values(ascending = False).iplot(kind='bar',color='#09055BF')

## <font size='4' color='blue'>Volume</font><a id='29'></a>

In [None]:
train['volume'].value_counts()
labels = train['volume'].value_counts().index
values = train['volume'].value_counts().values
colors=['#9993bf','#df3abf']

fig = go.Figure(data=[go.Pie(labels=labels, values=values, textinfo='label+percent',
                             insidetextorientation='radial',marker=dict(colors=colors))])
fig.update_layout(title='Pitch',annotations=[dict(text='Volume', x=0.51, y=0.5, font_size=18, showarrow=False)])
fig.update_traces(hole=.4, hoverinfo="label+percent+name")
fig.update_traces(hoverinfo='label+percent', textinfo='value', textfont_size=20,
                  marker=dict(colors=colors, line=dict(color='#000000', width=2)))
fig.show()

## <font size='4' color='blue'>Channels</font><a id='210'></a>
Channel is the passage way a signal or data is transported.One Channel is usually referred to as mono, while more Channels could either indicate stereo, surround sound and the like.

In [None]:
channels = train.groupby('channels',as_index=False)['title'].count().sort_values('channels')
fig = go.Figure()
fig.add_trace(go.Bar(x=channels['channels'],y=channels['title'],marker_line_color='red',marker_line_width=2.5,text=channels['title'],textposition='auto'))
fig.update_layout(template='seaborn',height=600,title='Channels',paper_bgcolor='rgb(255,255,255)',plot_bgcolor='rgb(255,255,255)',
                 xaxis=dict(title='Channels',nticks=20,mirror=True,linewidth=1,linecolor='green'),
                 yaxis=dict(title='Counts',mirror=False,linewidth=1,linecolor='black',gridcolor='darkgrey'))
fig.show()

## <font size='4' color='blue'>Recordist</font><a id='211'></a>
Let us find out the number of people who provided the recordings

In [None]:
train['recordist'].nunique()

#### Now lets say view top 25 recordists and their contributions

In [None]:
train['recordist'].value_counts()[:25].sort_values().iplot(kind='barh',color='#89000BF')

## <font size='4' color='blue'>Ratings</font><a id='212'></a>
Let us find out the ratings

In [None]:
ratings = train.groupby('rating',as_index=False)['title'].count().sort_values('rating')
fig = go.Figure()
fig.add_trace(go.Bar(x=ratings['rating'],y=ratings['title'],marker_line_color='red',marker_line_width=2.5,text=ratings['title'],textposition='auto'))
fig.update_layout(template='seaborn',height=600,title='Ratings',paper_bgcolor='rgb(255,255,255)',plot_bgcolor='rgb(255,255,255)',
                 xaxis=dict(title='Ratings',nticks=20,mirror=True,linewidth=1,linecolor='green'),
                 yaxis=dict(title='Counts',mirror=False,linewidth=1,linecolor='black',gridcolor='darkgrey'))
fig.show()

## <font size='4' color='blue'>Bird Seen by Country</font><a id='213'></a>

In [None]:
bird_seen_countries = train.groupby(['country','bird_seen'],as_index=False).agg({'title':'count','rating':'mean'})\
    .sort_values('title',ascending=False).reset_index()
bird_seen_countries = bird_seen_countries.loc[:50,:]
seen_color = {'yes':'rgb(130, 17, 193)','no':'rgb(229, 58, 156)'}
fig = go.Figure()
for seen in ['yes','no']:
    fig.add_trace(go.Bar(name=seen,y=bird_seen_countries[bird_seen_countries['bird_seen']==seen]['country'],
                         x=bird_seen_countries[bird_seen_countries['bird_seen']==seen]['title'],orientation='h',
                         marker_line_color='black',marker_line_width=1.5,
                         text=np.round(bird_seen_countries[bird_seen_countries['bird_seen']==seen]['rating'],2),textposition='inside',
                         marker_color=seen_color[seen]))
fig.update_layout(height=1500,template='seaborn',paper_bgcolor='rgb(255,255,255)',plot_bgcolor='rgb(255,255,255)',barmode='stack',
                  hovermode='y unified',width=1200,
                 xaxis=dict(title='No of Recordings',type='log',mirror='allticks',linewidth=2,linecolor='black',
                            showgrid=True,gridcolor='darkgray'),
                 yaxis=dict(title='Country',mirror=True,linewidth=2,linecolor='black',tickfont=dict(size=12)),
                 legend=dict(title='<b>Bird seen in Country</b>',x=0.71,y=0.95,bgcolor='rgba(255, 255, 255, 0)',
                             bordercolor='rgba(255, 255, 255, 0)'),
                 title='<b>Number of Recordings & Average Ratings per Country [Top 50]</b>')
fig.show()

# Audio Data Analysis<a id='3'></a>

## <font size='4' color='blue'>Playing Audio</font><a id='31'></a>

There are about 264 bird species in the dataset and for each species multiple recordings are present.
We will be demonstrating the randomly bird chirps recording from the dataset and its sound plot.

### Snow Bunting
![](https://res-2.cloudinary.com/ebirdr/image/upload/s--GEPz7XJt--/f_auto,q_auto,t_full/2463-snow-bunting.jpg)


In [None]:
plot_bird_sound_wave('Snow Bunting')

### Caspian Tern
![](https://i.pinimg.com/originals/09/b5/0b/09b50b4dce31e02d1f93df92c0079984.jpg)


In [None]:
plot_bird_sound_wave('Caspian Tern')

### Barn Swallow
![](https://www.allaboutbirds.org/guide/assets/photo/68123021-480px.jpg)

In [None]:
plot_bird_sound_wave('Barn Swallow')

## <font size='4' color='blue'>Visualizing Audio in 2D</font><a id='32'></a>

In [None]:
train = pd.read_csv("../input/birdsong-recognition/train.csv")

In [None]:
path="../input/birdsong-recognition/train_audio/"
birds=train.ebird_code.unique()[:20]
file=train[train.ebird_code==birds[0]]['filename'][0]

In [None]:
plt.figure(figsize=(10,20 ))
for i in range(0,20):
    file=train[train.ebird_code==birds[i]]['filename'].values[0]
    audio_path=os.path.join(path,birds[i],file)
    plt.subplot(20,1,i+1)
    x , sr = librosa.load(audio_path)
    librosa.display.waveplot(x, sr=sr)
    plt.gca().set_title(birds[i])
    plt.gca().get_xaxis().set_visible(False)

## <font size='4' color='blue'>Spectrogram Analysis</font><a id='33'></a>

![](https://www.researchgate.net/profile/Phillip_Lobel/publication/267827408/figure/fig2/AS:295457826852866@1447454043380/Spectrograms-and-Oscillograms-This-is-an-oscillogram-and-spectrogram-of-the-boatwhistle.png)

**What is a spectrogram?**
A spectrogram is a visual way of representing the signal strength, or “loudness”, of a signal over time at various frequencies present in a particular waveform.  Not only can one see whether there is more or less energy at, for example, 2 Hz vs 10 Hz, but one can also see how energy levels vary over time.  In other sciences spectrograms are commonly used to display frequencies of sound waves produced by humans, machinery, animals, whales, jets, etc., as recorded by microphones.  In the seismic world, spectrograms are increasingly being used to look at frequency content of continuous signals recorded by individual or groups of seismometers to help distinguish and characterize different types of earthquakes or other vibrations in the earth. 

**How do you read a spectrogram?**

Spectrograms are basically two-dimensional graphs, with a third dimension represented by colors. Time runs from left (oldest) to right (youngest) along the horizontal axis. Each of our volcano and earthquake sub-groups of spectrograms shows 10 minutes of data with the tic marks along the horizontal axis corresponding to 1-minute intervals.  The vertical axis represents frequency, which can also be thought of as pitch or tone, with the lowest frequencies at the bottom and the highest frequencies at the top.  The amplitude (or energy or “loudness”) of a particular frequency at a particular time is represented by the third dimension, color, with dark blues corresponding to low amplitudes and brighter colors up through red corresponding to progressively stronger (or louder) amplitudes.
![](https://s3.amazonaws.com/pnsn-cms-uploads/attachments/000/000/583/original/6dd1240572ba9085af145892a1b4c1eacce3a651)
Above the spectrogram is the raw seismogram, drawn using the same horizontal time axis as the spectrogram (including the same tick marks), with the vertical axis representing wave amplitude. This plot is analogous to webicorder-style plots (or seismograms) that can be accessed via other parts of our website.  Collectively, the spectrogram-seismogram combination is a very powerful visualization tool, as it allows you to see raw waveforms for individual events and also the strength or “loudness” at various frequencies. The frequency content of an event can be very important in determining what produced the signal.

In [None]:
plt.figure(figsize=(30,20))
for i in range(0,5):
    file=train[train.ebird_code==birds[i]]['filename'].values[0]
    audio_path=os.path.join(path,birds[i],file)
    plt.subplot(5,1,i+1)
    x , sr = librosa.load(audio_path)
    x = librosa.stft(x)
    Xdb = librosa.amplitude_to_db(abs(x))
    librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
    plt.gca().set_title(birds[i])
    plt.gca().get_xaxis().set_visible(False)
    plt.colorbar()

# 4. Audio Features<a id='4'></a>
## <font size='4' color='blue'>Spectral Centroid</font><a id='41'></a>
The spectral centroid is a measure used in digital signal processing to characterise a spectrum. It indicates where the center of mass of the spectrum is located. Perceptually, it has a robust connection with the impression of brightness of a sound.

## <font size='4' color='blue'>Spectral Bandwidth</font><a id='42'></a>
The spectral bandwidth is defined as the width of the band of light at one-half the peak maximum (or full width at half maximum [FWHM]) and is represented by the two vertical red lines and λSB on the wavelength axis.

## <font size='4' color='blue'>Spectral Rolloff</font><a id='43'></a>
A feature extractor that extracts the Spectral Rolloff Point. This is a measure measure of the amount of the right-skewedness of the power spectrum.
The spectral rolloff point is the fraction of bins in the power spectrum at which 85% of the power is at lower frequencies.

## <font size='4' color='blue'>Zero-Crossing Rate</font><a id='44'></a>
By looking at different speech and audio waveforms, we can see that depending on the content, they vary a lot in their smoothness. For example, voiced speech sounds are more smooth than unvoiced ones. Smoothness is thus a informative characteristic of the signal.

A very simple way for measuring smoothness of a signal is to calculate the number of zero-crossing within a segment of that signal. A voice signal oscillates slowly - for example, a 100 Hz signal will cross zero 100 per second - whereas an unvoiced fricative can have 3000 zero crossing per second.

To calculate of the zero-crossing rate of a signal you need to compare the sign of each pair of consecutive samples. In other words, for a length N signal you need O(N) operations. Such calculations are also extremely simple to implement, which makes the zero-crossing rate an attractive measure for low-complexity applications. However, there are also many drawbacks with the zero-crossing rate:

The number of zero-crossings in a segment is an integer number. A continuous-valued measure would allow more detailed analysis.
Measure is applicable only on longer segments of the signal, since short segments might not have any or just a few zero crossings.
To make the measure consistent, we must assume that the signal is zero-mean. You should therefore subtract the mean of each segment before calculating the zero-crossings rate.
An alternative to the zero-crossing rate is to calculate the autocorrelation at lag-1. It can be estimated also from short segments, it is continuous-valued and arithmetic complexity is also O(N).



## <font size='4' color='blue'>Mel-Frequency Cepstral Coefficients(MFCCs)</font><a id='45'></a>

**Mel Frequency Cepstral Coefficient (MFCC) tutorial**
The first step in any automatic speech recognition system is to extract features i.e. identify the components of the audio signal that are good for identifying the linguistic content and discarding all the other stuff which carries information like background noise, emotion etc.

The main point to understand about speech is that the sounds generated by a human are filtered by the shape of the vocal tract including tongue, teeth etc. This shape determines what sound comes out. If we can determine the shape accurately, this should give us an accurate representation of the phoneme being produced. The shape of the vocal tract manifests itself in the envelope of the short time power spectrum, and the job of MFCCs is to accurately represent this envelope. This page will provide a short tutorial on MFCCs.

Mel Frequency Cepstral Coefficents (MFCCs) are a feature widely used in automatic speech and speaker recognition. They were introduced by Davis and Mermelstein in the 1980's, and have been state-of-the-art ever since. Prior to the introduction of MFCCs, Linear Prediction Coefficients (LPCs) and Linear Prediction Cepstral Coefficients (LPCCs) (click here for a tutorial on cepstrum and LPCCs) and were the main feature type for automatic speech recognition (ASR), especially with HMM classifiers. This page will go over the main aspects of MFCCs, why they make a good feature for ASR, and how to implement them.

Steps at a Glance 
We will give a high level intro to the implementation steps, then go in depth why we do the things we do. Towards the end we will go into a more detailed description of how to calculate MFCCs.

* Frame the signal into short frames.
* For each frame calculate the periodogram estimate of the power spectrum.
* Apply the mel filterbank to the power spectra, sum the energy in each filter.
* Take the logarithm of all filterbank energies.
* Take the DCT of the log filterbank energies.
* Keep DCT coefficients 2-13, discard the rest.
* There are a few more things commonly done, sometimes the frame energy is appended to each feature vector. Delta and Delta-Delta features are usually also appended. Liftering is also commonly applied to the final features.

## <font size='4' color='blue'>Chroma feature</font><a id='46'></a>

# 5. Compare Sound Features<a id='5'></a>

# Birdcall recordings Music - EDA 
![](https://storage.googleapis.com/pr-newsroom-wp/1/2020/03/Header.png)
In continuation of previous kernel about spotify music data extraction -Part 1 
https://www.kaggle.com/pavansanagapati/spotify-music-api-data-extraction-part1

We now will use the data extracted from Birdcall recordings to perform two steps as follows

#### 1. Explore the Audio Features and analyze
#### 2. Build a Machine Learning Model 

## 1. Explore the Audio Features and analyze

In [None]:
#Import Libraries
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
from sklearn.model_selection import cross_val_predict
from sklearn import metrics
from sklearn import svm
%matplotlib inline
import pandas_profiling 

#### Let us first analyse at high level the data in the spotify music dataframe that we build by accessing the spotify data as shown in part 1 of this kernel https://www.kaggle.com/pavansanagapati/spotify-music-api-data-extraction-part1.

In [None]:
# Adapted for Birdcall Dataset
# Previous dataset removed


In [None]:
# Adapted for Birdcall Dataset
# Previous dataset removed


In [None]:
# Adapted for Birdcall Dataset
# Previous dataset removed


In [None]:
# Adapted for Birdcall Dataset
# Previous dataset removed


Let us now add few more dataframes available datasets in kaggle for our deeper analysis

In [None]:
# Adapted for Birdcall Dataset
# Previous dataset removed


In [None]:
# Adapted for Birdcall Dataset
# Previous dataset removed


In [None]:
# Adapted for Birdcall Dataset
# Previous dataset removed


#### **Important Note**: Considered only those columns which are related to audio features as follows :

**Acousticness :** A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.

**Danceability** : Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

**Energy** : Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.

**Instrumentalness**: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.

**Liveness**: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.

**Loudness**: he overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.

**Speechiness**: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.

**Valence**: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

**Tempo**: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

In [None]:
# Create data frame with features
def features(df,who):
    if who == 1:
         features = df.loc[: ,['acousticness', 'danceability','energy','instrumentalness','liveness', 'loudness','speechiness', 'tempo','valence']]         
    elif who == 0 :   
          features = df.loc[:,['acousticness', 'danceability', 'energy', 'instrumentalness','liveness', 'loudness', 'speechiness', 'tempo', 'valence','popularity']]           
    else:
        return 'Error'
    return features

In [None]:
# Adapted for Birdcall Dataset
# Previous dataset removed


In [None]:
# Adapted for Birdcall Dataset
# Previous dataset removed


In [None]:
# Adapted for Birdcall Dataset
# Previous dataset removed


In [None]:
# Adapted for Birdcall Dataset
# Previous dataset removed


Now let create a dictionary in which the keys are the artists of both dataframes and the values are the total of songs for each singer or group.

In [None]:
# Adapted for Birdcall Dataset
# Previous dataset removed


In [None]:
# Adapted for Birdcall Dataset
# Previous dataset removed


### Visualise the data:
We will plot a Bar chart and a Radar Chart showing the means of the features.

In [None]:
# Adapted for Birdcall Dataset
# Previous dataset removed


In [None]:
# Adapted for Birdcall Dataset
# Previous dataset removed


In [None]:
# Adapted for Birdcall Dataset
# Previous dataset removed


In [None]:
# Adapted for Birdcall Dataset
# Previous dataset removed


The standard deviation of the audio features themselves do not give us much information ( as we can see in the plots below), we can sum them up and calculate the mean of the standard deviation of the lists.

In [None]:
# Adapted for Birdcall Dataset
# Previous dataset removed



### Correlation Between Variables

We will correlate the feature **valence** which describes the musical positiveness with **danceability** and **energy**.


#### Valence and Energy
The correlation between valence and energy shows us that there is a conglomeration of songs with high energy and a low level of valence. This means that many of my energetic songs sound more negative with feelings of sadness, anger and depression ( NF takes special place here haha). whereas when we look at the grays dots we can see that as the level of valence - positive feelings increase, the energy of the songs also increases. Although her data is split , we can identify this pattern which indicates a kind of 'linear' correlation between the variables.

In [None]:
# Adapted for Birdcall Dataset
# Previous dataset removed


In [None]:
# Adapted for Birdcall Dataset
# Previous dataset removed


#### Valence and Danceability

In [None]:
# Adapted for Birdcall Dataset
# Previous dataset removed


In [None]:
# Adapted for Birdcall Dataset
# Previous dataset removed


## 2. The Machine Learning Approach
I will be using different algorithms as I improve this kernel notebook to improve the model accuracy.So please keep watching this space on a frequent basis.

Removing Features
The first step is to preprocess our data set in order to have a dataframe with numerical values in all of the columns. So let's start off dropping all features which are not relevant to our model such as id, album, name, uri, popularity and track_number and separate the target from other artist dataframe. We can easily do that by building the function feature_elimination which receives a list with the features we want to drop as a parameter.

Notice that after its removal, we still have a categorical feature (artist). So, we'll have to deal with that in the second step. Also, important to mention that we have two slightly balanced classes which indicate whose list the song belongs to.

In [None]:
# Adapted for Birdcall Dataset
# Previous dataset removed


In [None]:
# Adapted for Birdcall Dataset
# Previous dataset removed


In [None]:
# Adapted for Birdcall Dataset
# Previous dataset removed


In [None]:
# Adapted for Birdcall Dataset
# Previous dataset removed


In [None]:
target.head()

In [None]:
# Let us observe how the data is ? Is it balanced or not .Let us see.
target.value_counts()

So it is well balanced dataset

In [None]:
# Adapted for Birdcall Dataset
# Previous dataset removed


#### Label Encoder
The second task is to transform all categocal data (artists names) into numeric data. Why do we have to do that? Well, the ML algorithm only accepts numerical data, hence, the reason why we have to use the class LabelEncoder to encode each artist name into a specific number. The encoding process is shown below.

In [None]:
# Adapted for Birdcall Dataset
# Previous dataset removed


In [None]:
# Adapted for Birdcall Dataset
# Previous dataset removed


# URBAN Sound CLASSIFICATION


### Introduction
When we get started with data science, we start with simple projects like Loan Prediction problem or Big Mart Sales Prediction. These problems have structured data arranged neatly in a tabular format i.e we are spoon-fed the hardest part in data science pipeline.The datasets in real life are much more complex and unstructured format like audio/image, collect it from various sources and arrange it in a format which is ready for processing. 


I have choosen an unstructured data as this problem of bird call classification as it represents huge under-exploited opportunity. It is closer to how we communicate and interact as humans. It also contains a lot of useful & powerful information. For example, if a person speaks; you not only get what he / she says but also what were the emotions of the person from the voice.Also the body language of the person can show you many more features about a person, because actions speak louder than words! So in short, unstructured data is complex but processing it can reap easy rewards.


#### So what is audio data really mean ? 

Lets understand this with some theory before we actually jump in the real problem and its solution.

Directly or indirectly, you are always in contact with audio. Your brain is continuously processing and understanding audio data and giving you information about the environment. A simple example can be your conversations with people which you do daily. This speech is discerned by the other person to carry on the discussions. Even when you think you are in a quiet environment, you tend to catch much more subtle sounds, like the rustling of leaves or the splatter of rain. This is the extent of your connection with audio.

So in order to catch this audio floating around us there are devices which record in computer readable format. Examples of these formats are

- wav (Waveform Audio File) format
- mp3 (MPEG-1 Audio Layer 3) format
- WMA (Windows Media Audio) format

Audio typically looks like a wave like format of data, where the amplitude of audio change with respect to time. This can be pictorial represented as follows.

![](sound.png)


Real Time Applications of Audio Processing include but not limited

- Indexing music collections according to their audio features.
- Recommending music for radio channels
- Similarity search for audio files (aka Shazam)
- Speech processing and synthesis – generating artificial voice for conversational agents 

#### Data Handling in audio domain

Audio data has a couple of preprocessing steps which have to be followed namely,

- Firstly Load the data into a machine understandable format. 
    For this, we simply take values after every specific time steps. For example; in a 2 second audio file, we extract values at half a second. This is called ***sampling of audio data***, and the rate at which it is sampled is called the ***sampling rate***.
    In this approach we have disadvantage i.e  When we sample an audio data, we require much more data points to represent the whole data and also, the sampling rate should be as high as possible.To offset this we can look at second approach.

- The second approach of representing audio data is by converting it into a different domain of data representation, namely the ***frequency domain*** which require lesser computational space is required. . 

Now let us get more idea on this in detail

![](time_freq.png)

Here, we separate one audio signal into 3 different pure signals, which can now be represented as three unique values in frequency domain.

There are a few more ways in which audio data can be represented, for example. using MFCs (Mel-Frequency cepstrums. PS: We will cover this in the later article). These are nothing but different ways to represent the data.

Now the next step is to extract features from this audio representations, so that our algorithm can work on these features and perform the task it is designed for. Here’s a visual representation of the categories of audio features that can be extracted.

![](audio-features.png)


After extracting these features, it is then sent to the machine learning model for further analysis.

Now enough theory.Lets jump into solving the Urban Sound Classifcation Problem

### Objective

The automatic classification of environmental sound is a growing research field with multiple applications to largescale, content-based multimedia indexing and retrieval. In particular, the sonic analysis of urban environments is the subject of increased interest, partly enabled by multimedia sensor networks, as well as by large quantities of online multimedia content depicting urban scenes.

However, while there is a large body of research in related areas such as speech, music and bioacoustics, work on the analysis of urban acoustic environments is relatively scarce.Furthermore, when existent, it mostly focuses on the classification of auditory scene type, e.g. street, park, as opposed to the identification of sound sources in those scenes, e.g.car horn, engine idling, bird tweet. 



There are primarily two major challenges with bird call research namely

- Lack of labeled audio data. Previous work has focused on audio from carefully produced movies or television tracks from specific environments such as elevators or office spaces and on commercial or proprietary datasets . The large effort involved in manually annotating real-world data means datasets based on field recordings tend to be relatively small (e.g. the event detection dataset of the IEEE AASP Challenge consists of 24 recordings per each of 17 classes).

- Lack of common vocabulary when working on bird calls.This means the classification of sounds into semantic groups may vary from study to study, making it hard to compare results

so the objective of this notebook is to address the above two mentioned challenges.


### Data

The dataset is called UrbanSound and contains 8732 labeled sound excerpts (<=4s) of bird calls from 10 classes: -
The dataset contains 8732 sound excerpts (<=4s) of bird calls from 10 classes, namely:

- Air Conditioner
- Car Horn
- Children Playing
- Dog bark
- Drilling
- Engine Idling
- Gun Shot
- Jackhammer
- Siren
- Street Music

The attributes of data are as follows:

ID – Unique ID of sound excerpt

Class – type of sound

The evaluation metric for this problem is "Accuracy Score"

#### Source

- Source of the dataset : https://drive.google.com/drive/folders/0By0bAi7hOBAFUHVXd1JCN3MwTEU
- Source of research document : https://serv.cusp.nyu.edu/projects/urbansounddataset/salamon_urbansound_acmmm14.pdf


Now let me look at a glance a sample sound excerpt from the dataset

In [None]:
import IPython.display as ipd
ipd.Audio('../input/ultrasound-dataset/train/Train/2022.wav')

To load the audio files into the jupyter notebook ass a numpy array I have used 'librosa' library in python by using the pip command as follows

 ***pip install librosa***

In [None]:
!pip install librosa

In [None]:
import os
import pandas as pd
import librosa
import librosa.display
import glob
%pylab inline
from sklearn.preprocessing import LabelEncoder
import numpy as np
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, MaxPooling2D
from keras.optimizers import Adam
from sklearn import metrics 

Now let us load a sample audio file using librosa

In [None]:
data,sampling_rate = librosa.load('../input/ultrasound-dataset/train/Train/2010.wav')
plt.figure(figsize=(12,4))
librosa.display.waveplot(data,sr=sampling_rate)

Now let us visually inspect data and see if we can find patterns in the data

In [None]:
train = pd.read_csv('../input/ultrasound-dataset/train/train.csv')
i = random.choice(train.index)

audio_name = train.ID[i]
path = os.path.join('../input/ultrasound-dataset/train/', 'Train', str(audio_name) + '.wav')

print('Class: ', train.Class[i])
x, sr = librosa.load('../input/ultrasound-dataset/train/Train/' + str(train.ID[i]) + '.wav')

plt.figure(figsize=(12, 4))
librosa.display.waveplot(x, sr=sr)

As you can see the air conditioner class is shown as random class and we can see its pattern.Let us again see another class by using the same code to randomly select another class and observe its pattern

In [None]:
i = random.choice(train.index)
audio_name = train.ID[i]
path = os.path.join('../input/ultrasound-dataset/train/', 'Train', str(audio_name) + '.wav')
print('Class: ', train.Class[i])
x, sr = librosa.load('../input/ultrasound-dataset/train/Train/' + str(train.ID[i]) + '.wav')
plt.figure(figsize=(12, 4))
librosa.display.waveplot(x, sr=sr)

Let us see the class distributions for this problem

In [None]:
print(train.Class.value_counts(normalize=True)) #distribution of data

It appears that jackhammer has more count than any other classes

Now let us see how we can leverage the concepts we learned above to solve the problem. We will follow these steps to solve the problem.

- Step 1: Load audio files & Extract features
- Step 2: Convert the data to pass it in our deep learning model
- Step 3: Run a deep learning model and get results

#### Step 1: Load audio files & Extract features

Let us create a function to load audio files and extract features

In [None]:
def parser(row):
    file_name = os.path.join(os.path.abspath('../input/ultrasound-dataset/train/'),'Train',str(row.ID)+'.wav')
    try:
        # here kaiser_fast is a technique used for faster extraction
        X,sample_rate = librosa.load(file_name,res_type='kaiser_fast')
        # we extract mfcc feature from data
        mfccs = np.mean(librosa.feature.mfcc(y=X,sr=sample_rate,n_mfcc=40).T,axis=0)
    except Exception as e:
        print('Error encountered while parsing the file:',file_name)
        
        return 'None', 'None'
    
    feature = mfccs
    
    label = row.Class
    #print(file_name)
    print(feature)
    print(label)
    return pd.Series([feature, label],index=['feature','label'])

In [None]:
temp = train.apply(parser,axis =1)
temp.columns = ['feature', 'label']

#### Step 2: Convert the data to pass it in our deep learning model


In [None]:
X = np.array(temp.feature.tolist())
y = np.array(temp.label.tolist())

label_encoder = LabelEncoder()
print(temp.label.dtype)

In [None]:
y = np_utils.to_categorical(label_encoder.fit_transform(y))   

## If you like this kernel greatly appreciate to <font color='red'>UPVOTE</font>.