<a href="https://www.kaggle.com/code/abhijitdarekar001/eda-birdlef-2024?scriptVersionId=170302639" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# 📂 Dataset Description

📁 test_soundscapes - The training data consists of short recordings of individual bird calls. These files have been downsampled to 32 kHz where applicable to match the test set audio and converted to the ogg format.
<br>📁 train_audio - The test_soundscapes directory will be populated with approximately 1,100 audio recordings to be used for scoring. They are 4 minutes long and in ogg audio format. 
<br>📁 unlabeled_soundscapes -  Unlabeled audio data from the same recording locations as the test soundscapes.
<br>📃 eBird_Taxonomy_v2021.csv - Meta Data Required for Training
<br>📃 train_metadata.csv - Required MetaData for Model Traning
<br>📃 sample_submission.csv - Format to Submit a File.
- `row_id` : A slug of `[soundscape_id]_[end_time]` for the prediction.
- `[bird_id]` : There are 182 bird ID columns. You will need to predict the probability of the presence of each bird for each row.


# 📚 Loading Libraries 

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import seaborn as sns
import plotly.express as px

from IPython.display import display, Audio , display_html

# 📊 Loading Data


In [2]:
eBird  = pd.read_csv('/kaggle/input/birdclef-2024/eBird_Taxonomy_v2021.csv')
trainign_data = pd.read_csv("/kaggle/input/birdclef-2024/train_metadata.csv")

In [3]:
display(trainign_data.info())
display(trainign_data.head(5))


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24459 entries, 0 to 24458
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   primary_label     24459 non-null  object 
 1   secondary_labels  24459 non-null  object 
 2   type              24459 non-null  object 
 3   latitude          24081 non-null  float64
 4   longitude         24081 non-null  float64
 5   scientific_name   24459 non-null  object 
 6   common_name       24459 non-null  object 
 7   author            24459 non-null  object 
 8   license           24459 non-null  object 
 9   rating            24459 non-null  float64
 10  url               24459 non-null  object 
 11  filename          24459 non-null  object 
dtypes: float64(3), object(9)
memory usage: 2.2+ MB


None

Unnamed: 0,primary_label,secondary_labels,type,latitude,longitude,scientific_name,common_name,author,license,rating,url,filename
0,asbfly,[],['call'],39.2297,118.1987,Muscicapa dauurica,Asian Brown Flycatcher,Matt Slaymaker,Creative Commons Attribution-NonCommercial-Sha...,5.0,https://www.xeno-canto.org/134896,asbfly/XC134896.ogg
1,asbfly,[],['song'],51.403,104.6401,Muscicapa dauurica,Asian Brown Flycatcher,Magnus Hellström,Creative Commons Attribution-NonCommercial-Sha...,2.5,https://www.xeno-canto.org/164848,asbfly/XC164848.ogg
2,asbfly,[],['song'],36.3319,127.3555,Muscicapa dauurica,Asian Brown Flycatcher,Stuart Fisher,Creative Commons Attribution-NonCommercial-Sha...,2.5,https://www.xeno-canto.org/175797,asbfly/XC175797.ogg
3,asbfly,[],['call'],21.1697,70.6005,Muscicapa dauurica,Asian Brown Flycatcher,vir joshi,Creative Commons Attribution-NonCommercial-Sha...,4.0,https://www.xeno-canto.org/207738,asbfly/XC207738.ogg
4,asbfly,[],['call'],15.5442,73.7733,Muscicapa dauurica,Asian Brown Flycatcher,Albert Lastukhin & Sergei Karpeev,Creative Commons Attribution-NonCommercial-Sha...,4.0,https://www.xeno-canto.org/209218,asbfly/XC209218.ogg


Each row shows sicentific name , common_name and abbrivation for that bird, apart from this it also captures the author( The person who captured the recording) with respective location (long/lat). 

Each recording dispalys a unique call of the bird it can be `call`,`male`,`adult` or `fight call`.

#### Checking if any null values present in dataset.

In [4]:
trainign_data.isna().sum()

primary_label         0
secondary_labels      0
type                  0
latitude            378
longitude           378
scientific_name       0
common_name           0
author                0
license               0
rating                0
url                   0
filename              0
dtype: int64

The Features latiture and longitude have `378` null each.<br>
<b> Evan Features Secondry_Name has null in the form of `[]`, we need to convert them to `na`.

<b> For all the birds, we have respective calls `type`, `filename`.</b><br>
    
We will deal with them later.
    

### Exploring EBird Data

In [5]:
display_html("Information \n",raw=True)
display(eBird.info())
display_html("First Few Lines",raw=True)
display(eBird.head())
display_html("Displaying Null Values",raw =True)
display(eBird.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16753 entries, 0 to 16752
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   TAXON_ORDER       16753 non-null  int64 
 1   CATEGORY          16753 non-null  object
 2   SPECIES_CODE      16753 non-null  object
 3   PRIMARY_COM_NAME  16753 non-null  object
 4   SCI_NAME          16753 non-null  object
 5   ORDER1            16751 non-null  object
 6   FAMILY            16740 non-null  object
 7   SPECIES_GROUP     216 non-null    object
 8   REPORT_AS         3876 non-null   object
dtypes: int64(1), object(8)
memory usage: 1.2+ MB


None

Unnamed: 0,TAXON_ORDER,CATEGORY,SPECIES_CODE,PRIMARY_COM_NAME,SCI_NAME,ORDER1,FAMILY,SPECIES_GROUP,REPORT_AS
0,1,species,ostric2,Common Ostrich,Struthio camelus,Struthioniformes,Struthionidae (Ostriches),Ostriches,
1,6,species,ostric3,Somali Ostrich,Struthio molybdophanes,Struthioniformes,Struthionidae (Ostriches),,
2,7,slash,y00934,Common/Somali Ostrich,Struthio camelus/molybdophanes,Struthioniformes,Struthionidae (Ostriches),,
3,8,species,grerhe1,Greater Rhea,Rhea americana,Rheiformes,Rheidae (Rheas),Rheas,
4,14,species,lesrhe2,Lesser Rhea,Rhea pennata,Rheiformes,Rheidae (Rheas),,


TAXON_ORDER             0
CATEGORY                0
SPECIES_CODE            0
PRIMARY_COM_NAME        0
SCI_NAME                0
ORDER1                  2
FAMILY                 13
SPECIES_GROUP       16537
REPORT_AS           12877
dtype: int64

There are total of `16537` entires in data. <br>
The features  `SPECIES_GROUP` and `REPORT_AS` are mostly `null`. Feature `FAMILY` and `ORDER1` have null in less numbers.

We will drop feature `SPECIES_GROUP` and `REPORT_AS`.

In [6]:
eBird.drop(['REPORT_AS','SPECIES_GROUP'],axis=1,inplace=True)

# 📈 Visulaization

In [7]:
px.bar(trainign_data.common_name.value_counts().reset_index(),y ='common_name',x='count',title='Count of Different Birds in Dataset.')

In [9]:
px.bar(eBird.FAMILY.value_counts().reset_index(),y='FAMILY',x='count',title = "Count of Bird's Family")

For visualization of features `longitudes` and `latitudes` we can  drop `nan` values. 

In [110]:
fig = px.scatter_geo(trainign_data[trainign_data['longitude'].notnull()],
                    lat='latitude',
                    lon='longitude',
                    title = "Audio Recordings Gathered Locations",
                    color ="common_name",projection='hammer')

fig.show()





# Thank you