# Scraping data from the web

In this exercise we will work with the [xeno-canto](https://xeno-canto.org/) archive of bird recordings. 

### Requirements

- Download and install [Insomnia](https://insomnia.rest/download). It's a tool that helps to quickly test an [API](https://en.wikipedia.org/wiki/API#Web_APIs) on the web.

A basic search URL looks like:

````
https://www.xeno-canto.org/api/2/recordings?
 
````

As described in the [API documentation](https://xeno-canto.org/explore/api) you can pass several parameters in order to filter your search. This parameters are added to the end of the URL as shown in the screenshot from Insomnia:

![alt text](https://github.com/fleshgordo/cocreate22/raw/main/img/insomnia_query.jpg "Title")

The URL looks like this:
````
https://www.xeno-canto.org/api/2/recordings?query=sparrow
 
````

The response to this query is in the format of JavaScript Object Notation (JSON). One can observe the number of Recordings ```numRecordings``` (16488 in total) and the number of species ```numSpecies``` (125). Since there are many results the API serves only the first page which means the first 500 results to this query. If you want to access the results from 500-1000 you'll have to add a second parameter ```page=2```. The URL would look like:

````
https://www.xeno-canto.org/api/2/recordings?query=sparrow&page=2
 
````

The entry recordings is a list that contains all recordings related to the search. Within Insomnia you can now look at the structure of the data to get a better understanding of its architecture. You can also filter the response with the filter function on the bottom bar:

![alt text](https://github.com/fleshgordo/cocreate22/raw/main/img/insomnia_filter.jpg "Title")

Only show countries:

````
$.recordings[*].cnt
````

Or show just the remarks for the recordings:

````
$.recordings[*].rmk
````

Furter query parameters (such as time, geolocation, search terms) are possible. Look for sparrows in Switzerland that have only quality A tag:

The URL looks like this:
````
https://www.xeno-canto.org/api/2/recordings?query=sparrow cnt:Switzerland q:A
 
````

Using the filter again we can obtain the links to the spectrogram files of the recordings:

````
$.recordings[*].sono.full
````

You can copy the link and open it in a browser.

In the next section we will automate our query requests and automatically fetch the files.


# Scraping data with python 

### Scrape from xeno-canto

We will use a python library called [requests](https://requests.readthedocs.io/en/latest/) with the slogan **HTTP for Humans** to open a URL and to "imitate" a browser. First, we make make sure our API URL is correct and gives a response:

In [None]:
#params= "cnt:'brazil'"
params = "cnt:switzerland loc:basel"
url="https://www.xeno-canto.org/api/2/recordings?query="+params
print(url)


https://www.xeno-canto.org/api/2/recordings?query=cnt:switzerland loc:basel


In [None]:
import requests

Next step. __requests__ has a function ```get()```that expects two arguments, i.e. the URL that should be called and the some information on the header. In our case the response is in [JSON](https://en.wikipedia.org/wiki/JSON) format, hence the headers Content-Type needs to declared. The response itself will be stored in a variable called ```resp```

In [None]:
r = requests.get(url, headers={"Content-Type":"json"})
resp = r.json()
print(resp)

{'numRecordings': '150', 'numSpecies': '50', 'page': 1, 'numPages': 1, 'recordings': [{'id': '383063', 'gen': 'Tachymarptis', 'sp': 'melba', 'ssp': '', 'en': 'Alpine Swift', 'rec': 'Peter Ertl', 'cnt': 'Switzerland', 'loc': 'Basel, Basel-Stadt, Basel-Stadt', 'lat': '47.5734', 'lng': '7.5767', 'alt': '260', 'type': 'call', 'url': '//xeno-canto.org/383063', 'file': 'https://xeno-canto.org/383063/download', 'file-name': 'XC383063-alpensegler.mp3', 'sono': {'small': '//xeno-canto.org/sounds/uploaded/ZGGFYJIGKA/ffts/XC383063-small.png', 'med': '//xeno-canto.org/sounds/uploaded/ZGGFYJIGKA/ffts/XC383063-med.png', 'large': '//xeno-canto.org/sounds/uploaded/ZGGFYJIGKA/ffts/XC383063-large.png', 'full': '//xeno-canto.org/sounds/uploaded/ZGGFYJIGKA/ffts/XC383063-full.png'}, 'lic': '//creativecommons.org/licenses/by-nc-sa/4.0/', 'q': 'A', 'length': '0:13', 'time': '09:30', 'date': '2017-08-14', 'uploaded': '2017-08-14', 'also': [''], 'rmk': 'call from a nest', 'bird-seen': 'unknown', 'playback-used

### Download remote files 
For reference, read the introduction into [working with files in Google Colab](https://neptune.ai/blog/google-colab-dealing-with-files )

Specific information on working with external data in Colab: [Local Files, Drive, Sheets, and Cloud Storage](https://colab.research.google.com/notebooks/io.ipynb) 

In [None]:
fileurl = resp["recordings"][0]["sono"]["small"]
download = requests.get("https:" + fileurl)
file = open("sample_data/sample_image.png", "wb")
file.write(download.content)
file.close()

## Pandas

Pandas is another python library heavily used in data analysis. According to its developers it claims to be a fast, powerful, flexible and easy to use open source data analysis and manipulation tool. 

Pandas can import many different file formats (CSV, JSON, etc..), clean up data and re-export them to other formats. 

As usual we will first import the library with the import statement:

In [None]:
import pandas as pd

We will use pandas to manipulate the data we retrieve by scraping the xenocanto archive. 
But first, we will briefly look into how panda works.

### Panda Series and DataFrames

Pandas library comes with 2 data structures: Series and DataFrames.
Series is a one-dimensional structure, with keys and value pairs. 


In [None]:
pd.Series?

DataFrame is a two-dimensional data structure. You can think a DataFrame like an Excel spreadsheet (i.e. it is 2-dimensional and it has rows and columns). It can be made by combining two or more Series. 

In [None]:
pd.DataFrame?

We can make a pandas Sereis from a list of most common birds in the Basel area, already sorted by popularity: 

In [153]:
common_birds = ['Common Redstart', 'Eurasian Blackcap', 
                'Black Redstart', 'Common Whitethroat', 'Lesser Whitethroat', 
                'Great Spotted Woodpecker', 'European Stonechat', 'Grey Wagtail', 
                'Yellowhammer', 'Tawny Owl', 'Red-backed Shrike', 
                'Northern Raven', 'Eurasian Tree Sparrow', "Western Bonelli's Warbler", 
                'Short-toed Treecreeper', 'Common Nightingale']
s1 = pd.Series(common_birds) 
s1

0               Common Redstart
1             Eurasian Blackcap
2                Black Redstart
3            Common Whitethroat
4            Lesser Whitethroat
5      Great Spotted Woodpecker
6            European Stonechat
7                  Grey Wagtail
8                  Yellowhammer
9                     Tawny Owl
10            Red-backed Shrike
11               Northern Raven
12        Eurasian Tree Sparrow
13    Western Bonelli's Warbler
14       Short-toed Treecreeper
15           Common Nightingale
dtype: object

We can imagine that an ornithologist friend can give us the [scientific name](https://birdsoftheworld.org/bow/key-to-scientific-names/sciname-parts) for each for these bird kinds:

In [154]:
bird_species = ['Phoenicurus phoenicurus', 'Sylvia atricapilla',
       'Phoenicurus ochruros', 'Sylvia communis', 'Sylvia curruca',
       'Dendrocopos major', 'Saxicola rubicola', 'Motacilla cinerea',
       'Emberiza citrinella', 'Strix aluco', 'Lanius collurio',
       'Corvus corax', 'Passer montanus', 'Phylloscopus bonelli',
       'Certhia brachydactyla', 'Luscinia megarhynchos',]
print(bird_species)	

['Phoenicurus phoenicurus', 'Sylvia atricapilla', 'Phoenicurus ochruros', 'Sylvia communis', 'Sylvia curruca', 'Dendrocopos major', 'Saxicola rubicola', 'Motacilla cinerea', 'Emberiza citrinella', 'Strix aluco', 'Lanius collurio', 'Corvus corax', 'Passer montanus', 'Phylloscopus bonelli', 'Certhia brachydactyla', 'Luscinia megarhynchos']


We could combine these two lists in a python dictionary using a special funciton zip() that does just that

In [155]:
birds_dict2 = dict(zip(common_birds, bird_species))
print("Common and latin names for birds in Basel : " +  str(birds_dict2))

Common and latin names for birds in Basel : {'Common Redstart': 'Phoenicurus phoenicurus', 'Eurasian Blackcap': 'Sylvia atricapilla', 'Black Redstart': 'Phoenicurus ochruros', 'Common Whitethroat': 'Sylvia communis', 'Lesser Whitethroat': 'Sylvia curruca', 'Great Spotted Woodpecker': 'Dendrocopos major', 'European Stonechat': 'Saxicola rubicola', 'Grey Wagtail': 'Motacilla cinerea', 'Yellowhammer': 'Emberiza citrinella', 'Tawny Owl': 'Strix aluco', 'Red-backed Shrike': 'Lanius collurio', 'Northern Raven': 'Corvus corax', 'Eurasian Tree Sparrow': 'Passer montanus', "Western Bonelli's Warbler": 'Phylloscopus bonelli', 'Short-toed Treecreeper': 'Certhia brachydactyla', 'Common Nightingale': 'Luscinia megarhynchos'}


Or, we can create a pandas Series and use it like a dictionary:

In [156]:
s2 = pd.Series(bird_species, index=common_birds)
s2

Common Redstart              Phoenicurus phoenicurus
Eurasian Blackcap                 Sylvia atricapilla
Black Redstart                  Phoenicurus ochruros
Common Whitethroat                   Sylvia communis
Lesser Whitethroat                    Sylvia curruca
Great Spotted Woodpecker           Dendrocopos major
European Stonechat                 Saxicola rubicola
Grey Wagtail                       Motacilla cinerea
Yellowhammer                     Emberiza citrinella
Tawny Owl                                Strix aluco
Red-backed Shrike                    Lanius collurio
Northern Raven                          Corvus corax
Eurasian Tree Sparrow                Passer montanus
Western Bonelli's Warbler       Phylloscopus bonelli
Short-toed Treecreeper         Certhia brachydactyla
Common Nightingale             Luscinia megarhynchos
dtype: object

In [None]:
# what is the latin name for Common Nightingale?
s2.loc['Common Nightingale']

'collybita'

In [None]:
# what is the 3rd most common bird in Basel area (accroding to our first list)?
s2[[2]]

Black Redstart    ochruros
dtype: object

This is fine, but it will become difficult to manage if we would like to include additional data on each bird, such as its' colour, description, a picture or sound it makes? 

### Querying and manipulating a DataFrame

Let's create a DataFrame from our data on birds names, and add more information on each bird:

In [None]:
s1 = pd.Series(common_birds) 
s2 = pd.Series(bird_species)
birds_df = pd.concat([s1, s2], axis=1) 
birds_df

Unnamed: 0,0,1
0,Common Redstart,Phoenicurus phoenicurus
1,Eurasian Blackcap,atricapilla
2,Black Redstart,ochruros
3,Common Whitethroat,communis
4,Great Spotted Woodpecker,major
5,Lesser Whitethroat,curruca
6,European Stonechat,rubicola
7,Northern Raven,citrinella
8,Short-toed Treecreeper,collurio
9,Tawny Owl,montanus


We can give more meaningful names to the columns, such as 'common name' and 'bird species'. 

In [None]:
birds_df=birds_df.rename(columns={0: "common_name", 1: "bird_species"})
birds_df

Unnamed: 0,common_name,bird_species
0,Common Redstart,Phoenicurus phoenicurus
1,Eurasian Blackcap,atricapilla
2,Black Redstart,ochruros
3,Common Whitethroat,communis
4,Great Spotted Woodpecker,major
5,Lesser Whitethroat,curruca
6,European Stonechat,rubicola
7,Northern Raven,citrinella
8,Short-toed Treecreeper,collurio
9,Tawny Owl,montanus


Let's add a column for further remarks, containing empty value for now, which we will fill later

In [None]:
birds_df['rmk'] = pd.Series(dtype='str')
birds_df

Unnamed: 0,common_name,bird_species,rmk
0,Common Redstart,Phoenicurus phoenicurus,
1,Eurasian Blackcap,atricapilla,
2,Black Redstart,ochruros,
3,Common Whitethroat,communis,
4,Great Spotted Woodpecker,major,
5,Lesser Whitethroat,curruca,
6,European Stonechat,rubicola,
7,Northern Raven,citrinella,
8,Short-toed Treecreeper,collurio,
9,Tawny Owl,montanus,


Write some content to the Common Whitethroat column:

In [None]:
# in column .common_name select value in list['Common Whitethroat'] and write the remark in "rmk" column of that row
birds_df.loc[birds_df.common_name.isin(['Common Whitethroat']), "rmk"] = 'very nice bird'
birds_df

Unnamed: 0,common_name,bird_species,rmk
0,Common Redstart,Phoenicurus phoenicurus,
1,Eurasian Blackcap,atricapilla,
2,Black Redstart,ochruros,
3,Common Whitethroat,communis,very nice bird
4,Great Spotted Woodpecker,major,
5,Lesser Whitethroat,curruca,
6,European Stonechat,rubicola,
7,Northern Raven,citrinella,
8,Short-toed Treecreeper,collurio,
9,Tawny Owl,montanus,


With these methods you can slowly build up a data set for your bird observations. But this will take a long time. We already have a lot of data on birds that we scraped from the xenocanto archive. 

### Exploring Xenocanto archive with Pandas

We will now "load" all the retrieved data into a pandas Dataframe. 

In [148]:
df = pd.DataFrame(resp["recordings"])
df.head()

Unnamed: 0,id,gen,sp,ssp,en,rec,cnt,loc,lat,lng,...,lic,q,length,time,date,uploaded,also,rmk,bird-seen,playback-used
0,383063,Tachymarptis,melba,,Alpine Swift,Peter Ertl,Switzerland,"Basel, Basel-Stadt, Basel-Stadt",47.5734,7.5767,...,//creativecommons.org/licenses/by-nc-sa/4.0/,A,0:13,09:30,2017-08-14,2017-08-14,[],call from a nest,unknown,unknown
1,530514,Gallinula,chloropus,,Common Moorhen,Samuel Büttler,Switzerland,"Basel, Basel-Stadt, Basel-Stadt",47.5401,7.5965,...,//creativecommons.org/licenses/by-nc-sa/4.0/,C,0:07,05:00,2020-02-27,2020-02-27,[],,no,no
2,719785,Milvus,migrans,,Black Kite,Nicolas Martinez,Switzerland,"Brislach, Laufen District, Basel-Landschaft",47.4312,7.5357,...,//creativecommons.org/licenses/by-nc-sa/4.0/,B,1:16,08:15,2022-04-27,2022-04-27,[],,yes,no
3,530385,Tyto,alba,,Western Barn Owl,Samuel Büttler,Switzerland,"Basel, Basel-Stadt, Basel-Stadt",47.5399,7.5964,...,//creativecommons.org/licenses/by-nc-sa/4.0/,B,0:07,03:50,2020-02-25,2020-02-26,[],,no,no
4,534666,Bubo,bubo,,Eurasian Eagle-Owl,Jaro Schacht,Switzerland,Basel-Land,,,...,//creativecommons.org/licenses/by-nc-sa/4.0/,B,0:20,20:30,2020-03-14,2020-03-15,[],,yes,unknown


The following is a detailed description of the fields of this object:

* **id:** the catalogue number of the recording on xeno-canto
* **gen:** the generic name of the species
* **sp:** the specific name (epithet) of the species
* **ssp:** the subspecies name (subspecific epithet)
* **en:** the English name of the species
* **rec:** the name of the recordist
* **cnt:** the country where the recording was made
* **loc:** the name of the locality
* **lat:** the latitude of the recording in decimal coordinates
* **lng:** the longitude of the recording in decimal coordinates
* **type:** the sound type of the recording (e.g. 'call', 'song', etc). This is generally a comma-separated list of sound types.
* **url:** the URL specifying the details of this recording
* **file:** the URL to the audio file
* **file-name:** the original file name of the audio file
* **sono:** an object with the urls to the four versions of sonograms
* **lic:** the URL describing the license of this recording
* **q:** the current quality rating for the recording
* **length:** the length of the recording in minutes
* **time:** the time of day that the recording was made
* **date:** the date that the recording was made
* **uploaded:** the date that the recording was uploaded to xeno-canto
* **also:** an array with the identified background species in the recording
* **rmk:** additional remarks by the recordist
* **bird-seen:** was the recorded bird visually identified? (yes/no)
* **playback-used:** was playback used to lure the bird? (yes/no)


Let's find the most frequently observed bird, according to its common english name:

In [149]:
df['Count_en'] = df.groupby(['en'])['en'].transform('count')
df.sort_values('Count_en', ascending=False)

Unnamed: 0,id,gen,sp,ssp,en,rec,cnt,loc,lat,lng,...,q,length,time,date,uploaded,also,rmk,bird-seen,playback-used,Count_en
75,652106,Phoenicurus,phoenicurus,phoenicurus,Common Redstart,Nicolas Martinez,Switzerland,"Riehen, Basel-Stadt, Basel-Stadt",47.5785,7.6684,...,A,2:39,07:00,2021-05-26,2021-05-26,"[Cyanistes caeruleus, Troglodytes troglodytes,...",,yes,no,53
91,552256,Phoenicurus,phoenicurus,Mixed Singer,Common Redstart,Nicolas Martinez,Switzerland,"Seltisberg, Liestal, Basel-Landschaft",47.4612,7.7151,...,B,0:36,00:00,2020-04-29,2020-04-29,[],Recorder: Bahar Sezer;\r\nMixed singer (phenot...,yes,no,53
85,570026,Phoenicurus,phoenicurus,phoenicurus,Common Redstart,Nicolas Martinez,Switzerland,"Basel, Basel-Stadt, Basel-Stadt",47.5649,7.6235,...,B,1:21,06:30,2020-06-11,2020-06-19,[Delichon urbicum],,unknown,unknown,53
86,555181,Phoenicurus,phoenicurus,Mixed Singer,Common Redstart,Nicolas Martinez,Switzerland,"Beton Christen AG (near Pratteln), Arlesheim,...",47.5212,7.6689,...,B,2:54,08:30,2020-05-08,2020-05-08,"[Apus apus, Sylvia atricapilla]",Mixed Singer with many imitations including Ch...,yes,yes,53
87,553453,Phoenicurus,phoenicurus,Mixed Singer,Common Redstart,Nicolas Martinez,Switzerland,"Seltisberg, Liestal, Basel-Landschaft",47.4612,7.7151,...,B,0:50,10:00,2020-05-01,2020-05-03,[],recorder: Bahar Sezer\r\nsame bird as XC552256...,yes,no,53
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49,537549,Regulus,ignicapilla,,Common Firecrest,Samuel Büttler,Switzerland,"Basel, Basel-Stadt, Basel-Stadt",47.5401,7.5965,...,B,0:02,03:00,2020-03-21,2020-03-23,[],,no,no,1
127,719781,Anthus,campestris,,Tawny Pipit,Nicolas Martinez,Switzerland,"Brislach, Laufen District, Basel-Landschaft",47.4328,7.5484,...,A,0:23,07:45,2022-04-27,2022-04-27,[],,yes,no,1
128,296795,Pyrrhula,pyrrhula,pyrrhula,Eurasian Bullfinch,Lüthi Thomas,Switzerland,"Eptingen, Waldenburg, Basel-Landschaft",47.3635,7.8072,...,A,0:48,09:57,2015-12-24,2015-12-24,[Coccothraustes coccothraustes],,no,no,1
129,719783,Linaria,cannabina,,Common Linnet,Nicolas Martinez,Switzerland,"Brislach, Laufen District, Basel-Landschaft",47.4312,7.5357,...,A,2:12,08:15,2022-04-27,2022-04-27,"[Milvus migrans, Certhia brachydactyla, Turdus...",,yes,no,1


In [147]:
# list all bird names in order of frequency
df.sort_values('Count_en', ascending=False)['en'].unique()

array(['Common Redstart', 'Identity unknown', 'Eurasian Blackcap',
       'Black Redstart', 'Common Whitethroat', 'Lesser Whitethroat',
       'Great Spotted Woodpecker', 'European Stonechat', 'Grey Wagtail',
       'Yellowhammer', 'Tawny Owl', 'Red-backed Shrike', 'Northern Raven',
       'Eurasian Tree Sparrow', "Western Bonelli's Warbler",
       'Short-toed Treecreeper', 'Common Nightingale',
       'Common Chiffchaff', 'Common Blackbird',
       'European Green Woodpecker', 'Redwing', 'Spotted Flycatcher',
       'European Robin', 'Marsh Tit', 'Lesser Spotted Woodpecker',
       'Middle Spotted Woodpecker', 'Soundscape', 'Common Moorhen',
       'Eurasian Eagle-Owl', 'Western Barn Owl', 'Black Kite',
       'Eurasian Wryneck', 'Eurasian Nuthatch', 'Cirl Bunting',
       'House Sparrow', 'Woodlark', 'Rook', 'Eurasian Jay',
       'Eurasian Golden Oriole', 'Willow Warbler',
       'White-throated Dipper', 'Great Reed Warbler',
       'European Crested Tit', 'Eurasian Reed Warbler', 

In [157]:
# Combine Genus+Epithet and 
# list all bird species in order of 'en' frequency
df['bird_species'] = df['gen']+' '+df['sp']
# yes, some things are just incredibly easy with pandas :)
df.sort_values('Count_en', ascending=False)['bird_species'].unique()

array(['Phoenicurus phoenicurus', 'Mystery mystery', 'Sylvia atricapilla',
       'Phoenicurus ochruros', 'Sylvia communis', 'Sylvia curruca',
       'Dendrocopos major', 'Saxicola rubicola', 'Motacilla cinerea',
       'Emberiza citrinella', 'Strix aluco', 'Lanius collurio',
       'Corvus corax', 'Passer montanus', 'Phylloscopus bonelli',
       'Certhia brachydactyla', 'Luscinia megarhynchos',
       'Phylloscopus collybita', 'Turdus merula', 'Picus viridis',
       'Turdus iliacus', 'Muscicapa striata', 'Erithacus rubecula',
       'Poecile palustris', 'Dryobates minor', 'Dendrocoptes medius',
       'Sonus naturalis', 'Gallinula chloropus', 'Bubo bubo', 'Tyto alba',
       'Milvus migrans', 'Jynx torquilla', 'Sitta europaea',
       'Emberiza cirlus', 'Passer domesticus', 'Lullula arborea',
       'Corvus frugilegus', 'Garrulus glandarius', 'Oriolus oriolus',
       'Phylloscopus trochilus', 'Cinclus cinclus',
       'Acrocephalus arundinaceus', 'Lophophanes cristatus',
       'Ac

In [162]:
# are most frequent birds spotted by most active contributors?

# show most frequent birds and who spotted them
df.sort_values('Count_en', ascending=False)[['en','rec','Count_en']]

Unnamed: 0,en,rec,Count_en
75,Common Redstart,Nicolas Martinez,53
91,Common Redstart,Nicolas Martinez,53
85,Common Redstart,Nicolas Martinez,53
86,Common Redstart,Nicolas Martinez,53
87,Common Redstart,Nicolas Martinez,53
...,...,...,...
49,Common Firecrest,Samuel Büttler,1
127,Tawny Pipit,Nicolas Martinez,1
128,Eurasian Bullfinch,Lüthi Thomas,1
129,Common Linnet,Nicolas Martinez,1


In [161]:
# show most frequent contributors and their birds
df['Count_rec'] = df.groupby(['rec'])['rec'].transform('count')
df.sort_values('Count_rec', ascending=False)[['en','rec','Count_rec']]

Unnamed: 0,en,rec,Count_rec
75,Common Redstart,Nicolas Martinez,87
63,Black Redstart,Nicolas Martinez,87
73,Common Redstart,Nicolas Martinez,87
72,Common Redstart,Nicolas Martinez,87
71,Common Redstart,Nicolas Martinez,87
...,...,...,...
12,Great Spotted Woodpecker,Simon Birrer,2
9,Lesser Spotted Woodpecker,Simon Birrer,2
80,Common Redstart,Thomas Lüthi,2
144,Identity unknown,Ronny Pfüller,1


In [114]:
# show all birds spotted on 2022-08-18:
df.loc[df.date.isin(['2022-08-18'])]

Unnamed: 0,id,gen,sp,ssp,en,rec,cnt,loc,lat,lng,...,q,length,time,date,uploaded,also,rmk,bird-seen,playback-used,Frequency
149,743847,Mystery,mystery,,Identity unknown,Samuel Büttler,Switzerland,"Falkensteinerstrasse (near Basel), Basel-Stad...",47.5396,7.6013,...,E,0:06,04:30,2022-08-18,2022-08-19,[],Nocmig,no,no,16
146,743795,Mystery,mystery,,Identity unknown,Samuel Büttler,Switzerland,"Falkensteinerstrasse (near Basel), Basel-Stad...",47.5396,7.6013,...,D,0:04,02:30,2022-08-18,2022-08-18,[],Could this be a Temminck's Stint?,no,no,16


In [None]:
# which birds were spotted on most active spotting days?

# count dates

# return all birds spotted on three most active dates

### Download audio recording

To download audio files, we need to find links to audio files, which in case of this dataset are not explicitly given. 

According to the documentation, there are four relevant columns in the dataframe: 'url', 'file', 'file-name', 'sono'. We will look at those for entries for one bird, for exmaple in row 4:

In [166]:
df.loc[4][['en','url', 'file', 'file-name', 'sono']]

en                                          Eurasian Eagle-Owl
url                                    //xeno-canto.org/534666
file                    https://xeno-canto.org/534666/download
file-name                                XC534666-Uhu J.01.MP3
sono         {'small': '//xeno-canto.org/sounds/uploaded/JG...
Name: 4, dtype: object

On the [xenocanto website](https://xeno-canto.org/534666), we see that the actual URL to the audio file is formed as a combination of the 'sono' URL + file name. We can observe in the 'sono' entry four different links in a 'dictionary' object:

In [167]:
df.loc[4]['sono']

{'small': '//xeno-canto.org/sounds/uploaded/JGJWSKLOEW/ffts/XC534666-small.png',
 'med': '//xeno-canto.org/sounds/uploaded/JGJWSKLOEW/ffts/XC534666-med.png',
 'large': '//xeno-canto.org/sounds/uploaded/JGJWSKLOEW/ffts/XC534666-large.png',
 'full': '//xeno-canto.org/sounds/uploaded/JGJWSKLOEW/ffts/XC534666-full.png'}

We can use this in combination with entries in 'file-name' column to create downloadable URLs for audio files which we will store on the computer, with local file paths stored in the DataFrame for later use.

In [176]:
df['sono'].apply(lambda x: x.get('small')[:43] if x else None)

0      //xeno-canto.org/sounds/uploaded/ZGGFYJIGKA
1      //xeno-canto.org/sounds/uploaded/XOIKQWVJSA
2      //xeno-canto.org/sounds/uploaded/CVLRXNQXVL
3      //xeno-canto.org/sounds/uploaded/XOIKQWVJSA
4      //xeno-canto.org/sounds/uploaded/JGJWSKLOEW
                          ...                     
145    //xeno-canto.org/sounds/uploaded/XOIKQWVJSA
146    //xeno-canto.org/sounds/uploaded/XOIKQWVJSA
147    //xeno-canto.org/sounds/uploaded/XOIKQWVJSA
148    //xeno-canto.org/sounds/uploaded/XOIKQWVJSA
149    //xeno-canto.org/sounds/uploaded/XOIKQWVJSA
Name: sono, Length: 150, dtype: object

In [177]:
# add a column with correct URLs to audio_download: 'https:' + sono-small[:43] +'/' + file-name
df['audio_download'] = 'https:'+df['sono'].apply(lambda x: x.get('small')[:43] if x else None)+'/'+df['file-name']

In [179]:
# look at the newly create url for the bird in row 4
df.loc[4]['audio_download']

'https://xeno-canto.org/sounds/uploaded/JGJWSKLOEW/XC534666-Uhu J.01.MP3'

We could have also done this in a 'loop', but this is actually more complicated to exectue properly with pandas (looping through indexes and preserving states)

In [None]:
'''urls = []
file_paths = []
for index in df.index:
  base=df.sono[index]['small'][:43]
  #print(df.sono[0]['small'])
  filename=df['file-name'][index].replace(" ", "%20")
  #print(df['file-name'][0])
  url='http:'+base+'/'+filename
  urls.append(url)'''

Finally, we should download (some of) these audio files. 
In the Colab environment, we can store files in the virtual (and ephemeral) /sample_data folde, or write them our local harddrive, or upload to our personal Google drive. In this exercise we will work with the /sample_data folder. 

We will create a folder for each USER ID, and store all recordings made by this user in that folder. In oreder to create folders, we need a special python library for passing executable commands to the operating system, OS.

In [180]:
import os

In [188]:
df['audio_download'][2][50:]

'XC719785-milmig_20220427_breitenbach_ZOOM0309_Tr1.mp3'

In [189]:
# this will download only the first 5 files
file_paths = []
for i in range(0,5):
  url = df['audio_download'][i]
  download = requests.get(url)
  folder = url[39:49]
  filename = url[50:].replace(" ", "_")
  # check if folder already there
  try:
    os.mkdir('sample_data/'+folder)
  except FileExistsError: 
    pass 
  file = open('sample_data/'+folder+'/'+filename, 'wb') 
  try: 
    file.write(download.content)
    file.close()
    file_paths.append('sample_data/'+folder+'/'+filename)
  except FileExistsError:
    pass

In [190]:
# when we have all the paths in the dataframe, we can add that list as a new column
file_paths

['sample_data/ZGGFYJIGKA/XC383063-alpensegler.mp3',
 'sample_data/XOIKQWVJSA/XC530514-Teichhuhn1_27_2_20_(online-audio-converter.com).mp3',
 'sample_data/CVLRXNQXVL/XC719785-milmig_20220427_breitenbach_ZOOM0309_Tr1.mp3',
 'sample_data/XOIKQWVJSA/XC530385-Schleiereule250220_(online-audio-converter.com).mp3',
 'sample_data/JGJWSKLOEW/XC534666-Uhu_J.01.MP3']

In [None]:
# repeat for downloading images (same code, just simpler)