<a href="https://colab.research.google.com/github/fleshgordo/scrapinghub/blob/main/002_scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scraping data from the web

In this exercise we will work with the [xeno-canto](https://xeno-canto.org/) archive of bird recordings.

### Requirements

- Download and install [Insomnium](https://github.com/ArchGPT/insomnium). It's a tool that helps to quickly test an [API](https://en.wikipedia.org/wiki/API#Web_APIs) on the web.

A basic search URL looks like:

````
https://www.xeno-canto.org/api/2/recordings?

````

As described in the [API documentation](https://xeno-canto.org/explore/api) you can pass several parameters in order to filter your search. This parameters are added to the end of the URL as shown in the screenshot from Insomnia:

![alt text](https://github.com/fleshgordo/cocreate22/raw/main/img/insomnia_query.jpg "Title")

The URL looks like this:
````
https://www.xeno-canto.org/api/2/recordings?query=sparrow

````

The response to this query is in the format of JavaScript Object Notation (JSON). One can observe the number of Recordings ```numRecordings``` (16488 in total) and the number of species ```numSpecies``` (125). Since there are many results the API serves only the first page which means the first 500 results to this query. If you want to access the results from 500-1000 you'll have to add a second parameter ```page=2```. The URL would look like:

````
https://www.xeno-canto.org/api/2/recordings?query=sparrow&page=2

````

The entry recordings is a list that contains all recordings related to the search. Within Insomnia you can now look at the structure of the data to get a better understanding of its architecture. You can also filter the response with the filter function on the bottom bar:

![alt text](https://github.com/fleshgordo/cocreate22/raw/main/img/insomnia_filter.jpg "Title")

Only show countries:

````
$.recordings[*].cnt
````

Or show just the remarks for the recordings:

````
$.recordings[*].rmk
````

Furter query parameters (such as time, geolocation, search terms) are possible. Look for sparrows in Switzerland that have only quality A tag:

The URL looks like this:
````
https://www.xeno-canto.org/api/2/recordings?query=sparrow cnt:Switzerland q:A

````

Using the filter again we can obtain the links to the spectrogram files of the recordings:

````
$.recordings[*].sono.full
````

You can copy the link and open it in a browser.

In the next section we will automate our query requests and automatically fetch the files.


# Scraping data with python

### Scrape from xeno-canto

We will use a python library called [requests](https://requests.readthedocs.io/en/latest/) with the slogan **HTTP for Humans** to open a URL and to "imitate" a browser. First, we make make sure our API URL is correct and gives a response:

In [None]:
#params= "cnt:'brazil'"
params = "cnt:switzerland loc:basel"
url="https://www.xeno-canto.org/api/2/recordings?query="+params
print(url)


In [None]:
import requests

Next step. __requests__ has a function ```get()```that expects two arguments, i.e. the URL that should be called and the some information on the header. In our case the response is in [JSON](https://en.wikipedia.org/wiki/JSON) format, hence the headers Content-Type needs to declared. The response itself will be stored in a variable called ```resp```

In [None]:
r = requests.get(url, headers={"Content-Type":"json"})
resp = r.json()
print(resp)

### Download remote files
For reference, read the introduction into [working with files in Google Colab](https://neptune.ai/blog/google-colab-dealing-with-files )

Specific information on working with external data in Colab: [Local Files, Drive, Sheets, and Cloud Storage](https://colab.research.google.com/notebooks/io.ipynb)

In [None]:
fileurl = resp["recordings"][0]["sono"]["small"]
download = requests.get("https:" + fileurl)
file = open("sample_data/sample_image.png", "wb")
file.write(download.content)
file.close()

### NASA images over time

With the NASA planetary API you can get satellite images for a specific timeframe. Look at the [documentation](https://api.nasa.gov/) for further parameters. Be sure to provide an API key!

In [None]:
lat = "47.1227347"
lon = "8.1855324"
date = "2020-05-01"
api_key = ""
dim = "0.1"

url = f"https://api.nasa.gov/planetary/earth/imagery?lon={lon}&lat={lat}&date={date}&dim={dim}&api_key={api_key}"
print(url)

A new syntax was used to create the string. It's called [literal f-string interpolation](https://www.geeksforgeeks.org/formatted-string-literals-f-strings-python/) and helps to create strings and variables by wrapping a {} around them. This should help making more complicated string concatinations easier to read. See what the resulting url looks like and the string literals should be self-explaining.

We now create a list of dates for which we are going to create URLs as well:

In [None]:
dates = ["2020-05-01","2019-05-01","2018-05-01","2017-05-01","2016-05-01"]

Creating a loop for each entry in this list, looks like:

In [None]:
for date in dates:
  print(date)

Try to combine the loop with the URL example of the satellite imagery. Your results should look like this:



```
https://api.nasa.gov/planetary/earth/imagery?lon=8.1855324&lat=47.1227347&date=2020-05-01&dim=0.1&api_key=APIKEY
https://api.nasa.gov/planetary/earth/imagery?lon=8.1855324&lat=47.1227347&date=2020-06-01&dim=0.1&api_key=APIKEY
https://api.nasa.gov/planetary/earth/imagery?lon=8.1855324&lat=47.1227347&date=2020-07-01&dim=0.1&api_key=APIKEY
https://api.nasa.gov/planetary/earth/imagery?lon=8.1855324&lat=47.1227347&date=2020-08-01&dim=0.1&api_key=APIKEY
etc...
```



In [None]:
# try it yourself here

Now, we want to actually save these requests with filename and ideally store the date also in the filename:

In [None]:
#@title Show result
for date in dates:
  # compose file url
  url = f"https://api.nasa.gov/planetary/earth/imagery?lon={lon}&lat={lat}&date={date}&dim={dim}&api_key={api_key}"
  print(f"downloading from: {url}") # show filename (optional)
  download = requests.get(url) # this will download the image file
  file = open(f"sample_data/{date}-sursee.jpg", "wb") # create a filehandler and give path
  file.write(download.content) # write image content to filename
  file.close() # close filehandler

### Get satellite images from specific locations

We are preparing a `locations.csv` file with a couple of location points (lat,long) seperated with a comma. Make sure you store a locations.csv file in your sample_data directory. Downloading with `wget` directly into the google drive:

In [None]:
!wget https://raw.githubusercontent.com/fleshgordo/scrapinghub/main/data/locations.csv -O sample_data/locations.csv

In [None]:
with (open("sample_data/locations.csv") as f):
  locations = f.readlines()[1:] # skip first line with [1:]

date = "2016-01-01"
for location in locations: # read only first five entries
  place = location.strip().split(",") # remove newlines and split the csv file into a list with split()
  lat = place[2]
  lon = place[1]
  #print(f"lat: {place[2]} lon: {place[1]} ")
  url = f"https://api.nasa.gov/planetary/earth/imagery?lon={lon}&lat={lat}&date={date}&dim={dim}&api_key={api_key}"
  print(url)
  download = requests.get(url) # this will download the image file
  file = open(f"sample_data/locations-{place[0]}.jpg", "wb") # create a filehandler and give path
  file.write(download.content) # write image content to filename
  file.close() # close filehandler

Try to replace the locations.csv file with coordinates that might be interesting for you to spot. Can you combine the dates and the locations into one big batch download?

In [None]:
# experiment here

### Fetching satellite images using Google API

In order to fetch satellite views directly as images, a Google API key is needed (this limits the amount of free requests). The API key needs to be enabled for map services. A typical request could look like:

```
https://maps.googleapis.com/maps/api/staticmap?key={key}&center={center}&zoom={zoom}&maptype={maptype}&size={size}"
```

The values inside the {} need to be replaced.

In [None]:
# general params
key = "" # API key THIS IS NEEDED!!!!
zoom = 17
maptype = "satellite"
size = "600x600"
lat = "60.1492271"
lon = "24.9798789"
center = f"{lat},{lon}"

url = f"https://maps.googleapis.com/maps/api/staticmap?key={key}&center={center}&zoom={zoom}&maptype={maptype}&size={size}"
print(url)

To download the file we can again use the `requests.get()` function:

In [None]:
response = requests.get(url) # fetch the image
image_content = response.content # save the response in a variable
image_file = open("test.jpg", "wb") # open a file-handler
image_file.write(image_content) # write the saved image into the file
image_file.close() # close the file-handler

Combining this with `locations.csv` creates following code (in case you don't have the locations file in your drive please make sure you have downloaded it by executing following cell:

In [None]:
!wget https://raw.githubusercontent.com/fleshgordo/scrapinghub/main/data/locations.csv -O sample_data/locations.csv

In [None]:
with (open("sample_data/locations.csv") as f):
  locations = f.readlines()[1:]
  for location in locations:
    place = location.strip().split(",") # remove newlines and split the csv file into a list with split()
    lat = place[2]
    lon = place[1]
    center = f"{lat},{lon}"
    zoom = 17
    maptype = "satellite"
    size = "600x600"
    url = f"https://maps.googleapis.com/maps/api/staticmap?key={key}&center={center}&zoom={zoom}&maptype={maptype}&size={size}"
    print(url)

### Fun with text-to-speech

gtts is a Python library to interface with Google Translate's text-to-speech API. Write spoken mp3 data to a file with any python string you want. As usual, we first need to install the library:

In [None]:
!pip install gtts

In [None]:
from gtts import gTTS
import requests
import json

This example queries the [openweather API](https://openweathermap.org/current) and parses the response text into a JSON object. Further, it creates a sentence with some of the received information to

In [None]:
# API endpoint and parameters
url = "https://api.openweathermap.org/data/2.5/weather"
city = "Emmenbrücke"
params = {
    "q": city,
    "appid": "", # put your api key!!!!!
    "units": "metric"
}

# Send GET request to API endpoint
response = requests.get(url, params=params)
print(response.text)
# Save response to file
#with open("sample_data/weather.json", "w") as file:
#    file.write(response.text)


In [None]:
answer = json.loads(response.text)
print(answer['main'])

# create phrase that is going to be sent to text-to-speech
phrase = f"In {city} it feels like {answer['main']['feels_like']} degrees"

tts = gTTS(text=phrase, lang="en")
tts.save("weather.mp3") # saves the output file to weather.mp3