<a href="https://colab.research.google.com/github/fleshgordo/cocreate22/blob/main/Copy_of_002_scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scraping data from the web

In this exercise we will work with the [xeno-canto](https://xeno-canto.org/) archive of bird recordings. 

### Requirements

- Download and install [Insomnia](https://insomnia.rest/download). It's a tool that helps to quickly test an [API](https://en.wikipedia.org/wiki/API#Web_APIs) on the web.

A basic search URL looks like:

````
https://www.xeno-canto.org/api/2/recordings?
 
````

As described in the [API documentation](https://xeno-canto.org/explore/api) you can pass several parameters in order to filter your search. This parameters are added to the end of the URL as shown in the screenshot from Insomnia:

![alt text](https://github.com/fleshgordo/cocreate22/raw/main/img/insomnia_query.jpg "Title")

The URL looks like this:
````
https://www.xeno-canto.org/api/2/recordings?query=sparrow
 
````

The response to this query is in the format of JavaScript Object Notation (JSON). One can observe the number of Recordings ```numRecordings``` (16488 in total) and the number of species ```numSpecies``` (125). Since there are many results the API serves only the first page which means the first 500 results to this query. If you want to access the results from 500-1000 you'll have to add a second parameter ```page=2```. The URL would look like:

````
https://www.xeno-canto.org/api/2/recordings?query=sparrow&page=2
 
````

The entry recordings is a list that contains all recordings related to the search. Within Insomnia you can now look at the structure of the data to get a better understanding of its architecture. You can also filter the response with the filter function on the bottom bar:

![alt text](https://github.com/fleshgordo/cocreate22/raw/main/img/insomnia_filter.jpg "Title")

Only show countries:

````
$.recordings[*].cnt
````

Or show just the remarks for the recordings:

````
$.recordings[*].rmk
````

Furter query parameters (such as time, geolocation, search terms) are possible. Look for sparrows in Switzerland that have only quality A tag:

The URL looks like this:
````
https://www.xeno-canto.org/api/2/recordings?query=sparrow cnt:Switzerland q:A
 
````

Using the filter again we can obtain the links to the spectrogram files of the recordings:

````
$.recordings[*].sono.full
````

You can copy the link and open it in a browser.

In the next section we will automate our query requests and automatically fetch the files.


## Scraping data with python 

First we are importing some addtional python packages which we will use for this scraping exercise.

In [None]:
# if no module 'mplleaflet', install it using pip
# then run the code above again
!pip install mplleaflet

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting mplleaflet
  Downloading mplleaflet-0.0.5.tar.gz (37 kB)
Building wheels for collected packages: mplleaflet
  Building wheel for mplleaflet (setup.py) ... [?25l[?25hdone
  Created wheel for mplleaflet: filename=mplleaflet-0.0.5-py3-none-any.whl size=28582 sha256=12f788f07d18f269fd2ae21cddb1ed64389ab1db308f6049958fe077a3820ad9
  Stored in directory: /root/.cache/pip/wheels/6b/f5/21/cdd12e476182b4b0b98326cdb9efa02ddbd5d87ca5de051c84
Successfully built mplleaflet
Installing collected packages: mplleaflet
Successfully installed mplleaflet-0.0.5


# Scrape from xeno-canto

We will use a python library called [requests](https://requests.readthedocs.io/en/latest/) with the slogan **HTTP for Humans** to open a URL and to "imitate" a browser. First, we make make sure our API URL is correct and gives a response:

In [None]:
#params= "cnt:'brazil'"
params = "cnt:switzerland loc:basel"
url="https://www.xeno-canto.org/api/2/recordings?query="+params
print(url)


https://www.xeno-canto.org/api/2/recordings?query=cnt:switzerland loc:basel


In [None]:
import requests

Next step. __requests__ has a function ```get()```that expects two arguments, i.e. the URL that should be called and the some information on the header. In our case the response is in [JSON](https://en.wikipedia.org/wiki/JSON) format, hence the headers Content-Type needs to declared. The response itself will be stored in a variable called ```resp```

In [None]:
r = requests.get(url, headers={"Content-Type":"json"})
resp = r.json()
print(resp)

{'numRecordings': '150', 'numSpecies': '50', 'page': 1, 'numPages': 1, 'recordings': [{'id': '383063', 'gen': 'Tachymarptis', 'sp': 'melba', 'ssp': '', 'en': 'Alpine Swift', 'rec': 'Peter Ertl', 'cnt': 'Switzerland', 'loc': 'Basel, Basel-Stadt, Basel-Stadt', 'lat': '47.5734', 'lng': '7.5767', 'alt': '260', 'type': 'call', 'url': '//xeno-canto.org/383063', 'file': 'https://xeno-canto.org/383063/download', 'file-name': 'XC383063-alpensegler.mp3', 'sono': {'small': '//xeno-canto.org/sounds/uploaded/ZGGFYJIGKA/ffts/XC383063-small.png', 'med': '//xeno-canto.org/sounds/uploaded/ZGGFYJIGKA/ffts/XC383063-med.png', 'large': '//xeno-canto.org/sounds/uploaded/ZGGFYJIGKA/ffts/XC383063-large.png', 'full': '//xeno-canto.org/sounds/uploaded/ZGGFYJIGKA/ffts/XC383063-full.png'}, 'lic': '//creativecommons.org/licenses/by-nc-sa/4.0/', 'q': 'A', 'length': '0:13', 'time': '09:30', 'date': '2017-08-14', 'uploaded': '2017-08-14', 'also': [''], 'rmk': 'call from a nest', 'bird-seen': 'unknown', 'playback-used

### Download files 
For reference, read the introduction into [working with files in Google Colab](https://neptune.ai/blog/google-colab-dealing-with-files )

Specific information on working with external data in Colab: [Local Files, Drive, Sheets, and Cloud Storage](https://colab.research.google.com/notebooks/io.ipynb) 

In [None]:
fileurl = resp["recordings"][0]["sono"]["small"]
download = requests.get("https:" + fileurl)
file = open("sample_data/sample_image.png", "wb")
file.write(download.content)
file.close()


## Pandas

Pandas is another python library heavily used in data analysis. According to its developers it claims to be a fast, powerful, flexible and easy to use open source data analysis and manipulation tool. 

Pandas can import many different file formats (CSV, JSON, etc..), clean up data and re-export them to other formats. 

As usual we will first import the library with the import statement:

In [None]:
import pandas as pd

We will use pandas to manipulate the data we retrieve by scraping the xenocanto archive. 
But first, we will briefly look into how panda works.

## Panda Series and DataFrames

Pandas library comes with 2 data structures: Series and DataFrames.
Series is a one-dimensional structure, with keys and value pairs. 


In [None]:
pd.Series?

DataFrame is a two-dimensional data structure. You can think a DataFrame like an Excel spreadsheet (i.e. it is 2-dimensional and it has rows and columns). It can be made by combining two or more Series. 

In [None]:
pd.DataFrame?

We can make a pandas Sereis from a list of 13 most common birds in the Basel area: 

In [None]:
common_birds = ['Common Redstart', 'Eurasian Blackcap', 'Black Redstart', 'Common Whitethroat', 'Great Spotted Woodpecker', 'Lesser Whitethroat', 'European Stonechat',
 'Northern Raven', 'Short-toed Treecreeper', 'Tawny Owl', 'Yellowhammer', 'Red-backed Shrike', 'Common Nightingale']
s1 = pd.Series(common_birds) 
s1

0              Common Redstart
1            Eurasian Blackcap
2               Black Redstart
3           Common Whitethroat
4     Great Spotted Woodpecker
5           Lesser Whitethroat
6           European Stonechat
7               Northern Raven
8       Short-toed Treecreeper
9                    Tawny Owl
10                Yellowhammer
11           Red-backed Shrike
12          Common Nightingale
dtype: object

We can imagine that an ornithologist friend can give us the [scientific name](https://birdsoftheworld.org/bow/key-to-scientific-names/sciname-parts) for each for these bird kinds:

In [None]:
bird_species = ['Phoenicurus phoenicurus', 'atricapilla', 'ochruros', 'communis', 'major', 'curruca', 'rubicola', 'citrinella', 'collurio', 'montanus', 'aluco', 'corax', 'collybita']

We could combine these two lists in a python dictionary, using the following 'naive' code **(method 1)** or using a special funciton zip() that does just that (method 2)

In [None]:
# method 1
birds_dict1 = {}
for name in common_birds:
    for species in bird_species:
        birds_dict1[name] = species
        bird_species.remove(species)
        break  
print("Common and latin names for birds in Basel : " +  str(birds_dict1))

Common and latin names for birds in Basel : {'Common Redstart': 'Phoenicurus phoenicurus', 'Eurasian Blackcap': 'atricapilla', 'Black Redstart': 'ochruros', 'Common Whitethroat': 'communis', 'Great Spotted Woodpecker': 'major', 'Lesser Whitethroat': 'curruca', 'European Stonechat': 'rubicola', 'Northern Raven': 'citrinella', 'Short-toed Treecreeper': 'collurio', 'Tawny Owl': 'montanus', 'Yellowhammer': 'aluco', 'Red-backed Shrike': 'corax', 'Common Nightingale': 'collybita'}


In [None]:
# method 2
birds_dict2 = dict(zip(common_birds, bird_species))
print("Common and latin names for birds in Basel : " +  str(birds_dict2))

Common and latin names for birds in Basel : {'Common Redstart': 'Phoenicurus phoenicurus', 'Eurasian Blackcap': 'atricapilla', 'Black Redstart': 'ochruros', 'Common Whitethroat': 'communis', 'Great Spotted Woodpecker': 'major', 'Lesser Whitethroat': 'curruca', 'European Stonechat': 'rubicola', 'Northern Raven': 'citrinella', 'Short-toed Treecreeper': 'collurio', 'Tawny Owl': 'montanus', 'Yellowhammer': 'aluco', 'Red-backed Shrike': 'corax', 'Common Nightingale': 'collybita'}


We can create a pandas Series and use it like a dictionary:

In [None]:
s2 = pd.Series(bird_species, index=common_birds)
s2

Common Redstart             Phoenicurus phoenicurus
Eurasian Blackcap                       atricapilla
Black Redstart                             ochruros
Common Whitethroat                         communis
Great Spotted Woodpecker                      major
Lesser Whitethroat                          curruca
European Stonechat                         rubicola
Northern Raven                           citrinella
Short-toed Treecreeper                     collurio
Tawny Owl                                  montanus
Yellowhammer                                  aluco
Red-backed Shrike                             corax
Common Nightingale                        collybita
dtype: object

In [None]:
# what is the latin name for Common Nightingale?
s2.loc['Common Nightingale']

'collybita'

In [None]:
# what is the 3rd most common bird in Basel area?
s2[[2]]

Black Redstart    ochruros
dtype: object

This is fine, but it will become difficult to manage if we would like to include additional data on each bird, such as its' colour, description, a picture or sound it makes? 

### Querying and manipulating a DataFrame

Let's create a DataFrame from our data on birds names, and add more information on each bird:

In [None]:
s1 = pd.Series(common_birds) 
s2 = pd.Series(bird_species)
birds_df = pd.concat([s1, s2], axis=1) 
birds_df

Unnamed: 0,0,1
0,Common Redstart,Phoenicurus phoenicurus
1,Eurasian Blackcap,atricapilla
2,Black Redstart,ochruros
3,Common Whitethroat,communis
4,Great Spotted Woodpecker,major
5,Lesser Whitethroat,curruca
6,European Stonechat,rubicola
7,Northern Raven,citrinella
8,Short-toed Treecreeper,collurio
9,Tawny Owl,montanus


We can give more meaningful names to the columns, such as 'common name' and 'bird species'. 

In [None]:
birds_df=birds_df.rename(columns={0: "common_name", 1: "bird_species"})
birds_df

Unnamed: 0,common_name,bird_species
0,Common Redstart,Phoenicurus phoenicurus
1,Eurasian Blackcap,atricapilla
2,Black Redstart,ochruros
3,Common Whitethroat,communis
4,Great Spotted Woodpecker,major
5,Lesser Whitethroat,curruca
6,European Stonechat,rubicola
7,Northern Raven,citrinella
8,Short-toed Treecreeper,collurio
9,Tawny Owl,montanus


Let's add a column for further remarks, containing empty value for now, which we will fill later

In [None]:
birds_df['rmk'] = pd.Series(dtype='str')
birds_df

Unnamed: 0,common_name,bird_species,rmk
0,Common Redstart,Phoenicurus phoenicurus,
1,Eurasian Blackcap,atricapilla,
2,Black Redstart,ochruros,
3,Common Whitethroat,communis,
4,Great Spotted Woodpecker,major,
5,Lesser Whitethroat,curruca,
6,European Stonechat,rubicola,
7,Northern Raven,citrinella,
8,Short-toed Treecreeper,collurio,
9,Tawny Owl,montanus,


Write some content to the Common Whitethroat column:

In [None]:
# in column .common_name select value in list['Common Whitethroat'] and write the remark in "rmk" column of that row
birds_df.loc[birds_df.common_name.isin(['Common Whitethroat']), "rmk"] = 'very nice bird'
birds_df

Unnamed: 0,common_name,bird_species,rmk
0,Common Redstart,Phoenicurus phoenicurus,
1,Eurasian Blackcap,atricapilla,
2,Black Redstart,ochruros,
3,Common Whitethroat,communis,very nice bird
4,Great Spotted Woodpecker,major,
5,Lesser Whitethroat,curruca,
6,European Stonechat,rubicola,
7,Northern Raven,citrinella,
8,Short-toed Treecreeper,collurio,
9,Tawny Owl,montanus,


With these methods you can slowly build up a data set for your bird observations. But this will take a long time. We already have a lot of data on brids that we scraped from the xenocanto archive. 

We will now "load" all the retrieved data into a pandas Dataframe. Think of a Dataframe like an Excel table (i.e. it is 2-dimensional and it has rows and columns)

In [None]:
df = pd.DataFrame(resp["recordings"])
df.head()

Unnamed: 0,id,gen,sp,ssp,en,rec,cnt,loc,lat,lng,...,lic,q,length,time,date,uploaded,also,rmk,bird-seen,playback-used
0,383063,Tachymarptis,melba,,Alpine Swift,Peter Ertl,Switzerland,"Basel, Basel-Stadt, Basel-Stadt",47.5734,7.5767,...,//creativecommons.org/licenses/by-nc-sa/4.0/,A,0:13,09:30,2017-08-14,2017-08-14,[],call from a nest,unknown,unknown
1,530514,Gallinula,chloropus,,Common Moorhen,Samuel Büttler,Switzerland,"Basel, Basel-Stadt, Basel-Stadt",47.5401,7.5965,...,//creativecommons.org/licenses/by-nc-sa/4.0/,C,0:07,05:00,2020-02-27,2020-02-27,[],,no,no
2,719785,Milvus,migrans,,Black Kite,Nicolas Martinez,Switzerland,"Brislach, Laufen District, Basel-Landschaft",47.4312,7.5357,...,//creativecommons.org/licenses/by-nc-sa/4.0/,B,1:16,08:15,2022-04-27,2022-04-27,[],,yes,no
3,530385,Tyto,alba,,Western Barn Owl,Samuel Büttler,Switzerland,"Basel, Basel-Stadt, Basel-Stadt",47.5399,7.5964,...,//creativecommons.org/licenses/by-nc-sa/4.0/,B,0:07,03:50,2020-02-25,2020-02-26,[],,no,no
4,534666,Bubo,bubo,,Eurasian Eagle-Owl,Jaro Schacht,Switzerland,Basel-Land,,,...,//creativecommons.org/licenses/by-nc-sa/4.0/,B,0:20,20:30,2020-03-14,2020-03-15,[],,yes,unknown


The following is a detailed description of the fields of this object:

* **id:** the catalogue number of the recording on xeno-canto
* **gen:** the generic name of the species
* **sp:** the specific name (epithet) of the species
* **ssp:** the subspecies name (subspecific epithet)
* **en:** the English name of the species
* **rec:** the name of the recordist
* **cnt:** the country where the recording was made
* **loc:** the name of the locality
* **lat:** the latitude of the recording in decimal coordinates
* **lng:** the longitude of the recording in decimal coordinates
* **type:** the sound type of the recording (e.g. 'call', 'song', etc). This is generally a comma-separated list of sound types.
* **url:** the URL specifying the details of this recording
* **file:** the URL to the audio file
* **file-name:** the original file name of the audio file
* **sono:** an object with the urls to the four versions of sonograms
* **lic:** the URL describing the license of this recording
* **q:** the current quality rating for the recording
* **length:** the length of the recording in minutes
* **time:** the time of day that the recording was made
* **date:** the date that the recording was made
* **uploaded:** the date that the recording was uploaded to xeno-canto
* **also:** an array with the identified background species in the recording
* **rmk:** additional remarks by the recordist
* **bird-seen:** was the recorded bird visually identified? (yes/no)
* **playback-used:** was playback used to lure the bird? (yes/no)


Let's find the most frequently observed birds, according to their common english name:

In [None]:
df['Frequency'] = df.groupby('en')['en'].transform('count')
df.sort_values('Frequency', inplace=True, ascending=False)
df

Unnamed: 0,id,gen,sp,ssp,en,rec,cnt,loc,lat,lng,...,q,length,time,date,uploaded,also,rmk,bird-seen,playback-used,Frequency
75,652106,Phoenicurus,phoenicurus,phoenicurus,Common Redstart,Nicolas Martinez,Switzerland,"Riehen, Basel-Stadt, Basel-Stadt",47.5785,7.6684,...,A,2:39,07:00,2021-05-26,2021-05-26,"[Cyanistes caeruleus, Troglodytes troglodytes,...",,yes,no,53
113,305979,Phoenicurus,phoenicurus,Hybrid,Common Redstart,Nicolas Martinez,Switzerland,"Wenslingen, Sissach, Basel-Landschaft",47.4403,7.9082,...,C,0:04,09:00,2015-04-16,2016-03-07,[],Hybrid Common Redstart - Black Redstart P. pho...,yes,yes,53
70,723980,Phoenicurus,phoenicurus,,Common Redstart,Peter Ertl,Switzerland,"Reinach, Arlesheim, Basel-Landschaft",47.5153,7.5862,...,A,2:19,09:30,2022-05-14,2022-05-14,[],,yes,no,53
69,725355,Phoenicurus,phoenicurus,phoenicurus,Common Redstart,Nicolas Martinez,Switzerland,"Riehen, Basel-Stadt, Basel-Stadt",47.5835,7.6648,...,A,1:21,07:50,2022-05-18,2022-05-19,"[Troglodytes troglodytes, Sylvia atricapilla, ...",,yes,no,53
65,728226,Phoenicurus,phoenicurus,phoenicurus,Common Redstart,Nicolas Martinez,Switzerland,"Basel, Basel-Stadt, Basel-Stadt",47.5654,7.6321,...,A,0:24,08:10,2022-05-30,2022-05-31,[],"both adults calling, presumably near a nest or...",yes,no,53
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
55,724904,Muscicapa,striata,,Spotted Flycatcher,Nicolas Martinez,Switzerland,"Riehen, Basel-Stadt, Basel City",47.5955,7.6448,...,B,0:29,09:10,2022-05-17,2022-05-17,"[Erithacus rubecula, Sylvia atricapilla, Phyll...",,yes,no,1
54,530387,Turdus,iliacus,,Redwing,Samuel Büttler,Switzerland,"Basel, Basel-Stadt, Basel-Stadt",47.5401,7.5965,...,B,0:14,04:00,2020-02-25,2020-02-26,[Turdus philomelos],,no,no,1
14,724900,Picus,viridis,viridis,European Green Woodpecker,Nicolas Martinez,Switzerland,"Riehen, Basel-Stadt, Basel-Stadt",47.5966,7.6467,...,A,0:05,08:50,2022-05-17,2022-05-17,[],,yes,no,1
53,530386,Turdus,merula,,Common Blackbird,Samuel Büttler,Switzerland,"Basel, Basel-Stadt, Basel-Stadt",47.5401,7.5965,...,C,0:08,03:30,2020-02-25,2020-02-26,[],,no,no,1


In [None]:
df_ordered = df.groupby(['gen'])['en'].count().reset_index(name='Count').sort_values(['Count'], ascending=False)
#df_ordered
df_ordered['gen'].tolist()


['Phoenicurus',
 'Sylvia',
 'Mystery',
 'Phylloscopus',
 'Dendrocopos',
 'Saxicola',
 'Corvus',
 'Emberiza',
 'Passer',
 'Motacilla',
 'Luscinia',
 'Lanius',
 'Acrocephalus',
 'Strix',
 'Certhia',
 'Turdus',
 'Regulus',
 'Sonus',
 'Pyrrhula',
 'Poecile',
 'Picus',
 'Tachymarptis',
 'Sitta',
 'Milvus',
 'Oriolus',
 'Muscicapa',
 'Anthus',
 'Lullula',
 'Lophophanes',
 'Linaria',
 'Jynx',
 'Garrulus',
 'Gallinula',
 'Erithacus',
 'Dryobates',
 'Dendrocoptes',
 'Cinclus',
 'Bubo',
 'Tyto']