# Web scraping example 1: Downloading air quality data from Berlin

Here we use the [Berlin air quality measurement network](https://luftdaten.berlin.de/lqi) as an example. If you manually download data from this website, you can only save data at a single station on a single day at each time. It is time consuming to manually download multiple days of data from each site. In the end, you will have a big number of CSV files in your folder. And there is always a chance of making mistakes.

## First let's manually download data at a single station on a single day
Here is the link to a station called "Wedding background": https://luftdaten.berlin.de/station/mc010

If you open it, you will see the most recent air quality data and the "Save data as CSV" botton. If you left-click this button, you will manually download the data. 

## Now let's use Python to download the same data

If you right-click this "Save data as CSV" botton, then select "Inspect", a lot of codes will appear at the bottom of your browser. You will see a highlighted url containing information of the station code, starting date and end date. Then right-click on this url and select "Copy link address", you will get a complete url like this one:

```
https://luftdaten.berlin.de/station/mc010.csv?group=pollution&period=1h&timespan=custom&start%5Bdate%5D=26.12.2021&start%5Bhour%5D=22&end%5Bdate%5D=27.12.2021&end%5Bhour%5D=22
```

I did this on 27/12/2021, so the input dates in the above url were the most recent dates (26-27 December 2021) by default.

In [1]:
# load packages
import pandas as pd
from urllib.request import urlopen

# copy and paste the url from above
target_url = "https://luftdaten.berlin.de/station/mc010.csv?group=pollution&period=1h&timespan=custom&start%5Bdate%5D=26.12.2021&start%5Bhour%5D=22&end%5Bdate%5D=27.12.2021&end%5Bhour%5D=22"

# open the target url
url_response = urlopen(target_url)

# read data from the response (the seperator may not always be ";" for other websites)
data_downloaded = pd.read_csv(url_response,sep=";")

# check the results
display(data_downloaded)

Unnamed: 0,Station,010 Wedding,010 Wedding.1,010 Wedding.2,010 Wedding.3,010 Wedding.4,010 Wedding.5
0,Messkomponente,Feinstaub (PM10),"Feinstaub (PM2,5)",Stickstoffdioxid,Stickstoffmonoxid,Stickoxide,Ozon
1,Einheit,µg/m³,µg/m³,µg/m³,µg/m³,µg/m³,µg/m³
2,Messzeit (Angaben in MESZ bzw. MEZ),Stundenwerte,Stundenwerte,Stundenwerte,Stundenwerte,Stundenwerte,Stundenwerte
3,26.12.2021 22:00,27,25,24,1,26,34
4,26.12.2021 23:00,32,29,25,2,28,31
5,27.12.2021 00:00,32,30,26,1,28,30
6,27.12.2021 01:00,31,29,21,1,23,35
7,27.12.2021 02:00,34,31,22,1,24,34
8,27.12.2021 03:00,38,35,24,1,26,32
9,27.12.2021 04:00,45,41,24,1,26,31


## Learning from the above example, we can edit inputs and download data at a specified station during a custom time period

In [2]:
# provide a station code
station_code = "171"

# provide the starting date and hour
start_y = 2021
start_m = 11
start_d = 21
start_h = 15

# provide the ending date and hour
end_y = 2021
end_m = 12
end_d = 24
end_h = 23

# construct the target url with inputs from above
# in order to display the whole url on GitHub, I used multiple lines here
# you can remove "\" at end of each line and just use a single line
target_url = "https://luftdaten.berlin.de/station/mc"+str(station_code)+ \
             ".csv?group=pollution&period=1h&timespan=custom&start%5Bdate%5D="+ \
             str(start_d)+"."+str(start_m)+"."+str(start_y)+"&start%5Bhour%5D="+str(start_h)+ \
             "&end%5Bdate%5D="+str(end_d)+"."+str(end_m)+"."+str(end_y)+"&end%5Bhour%5D="+str(end_h)

# open the target url
url_response = urlopen(target_url)

# read data from the response (the seperator may not always be ";" for other websites)
data_downloaded = pd.read_csv(url_response,sep=";")

# check the results
display(data_downloaded)

Unnamed: 0,Station,171 Mitte,171 Mitte.1,171 Mitte.2,171 Mitte.3,171 Mitte.4
0,Messkomponente,Feinstaub (PM10),"Feinstaub (PM2,5)",Stickstoffdioxid,Stickstoffmonoxid,Stickoxide
1,Einheit,µg/m³,µg/m³,µg/m³,µg/m³,µg/m³
2,Messzeit (Angaben in MESZ bzw. MEZ),Stundenwerte,Stundenwerte,Stundenwerte,Stundenwerte,Stundenwerte
3,21.11.2021 15:00,12,10,14,1,16
4,21.11.2021 16:00,11,8,14,1,15
...,...,...,...,...,...,...
799,24.12.2021 19:00,12,10,19,2,21
800,24.12.2021 20:00,7,6,9,2,11
801,24.12.2021 21:00,5,3,5,1,7
802,24.12.2021 22:00,4,3,6,1,8


## Wrap all these steps in a function

In [3]:
def download_Berlin_data(station_code,start,end):
    '''Input the station code from Berlin air quality measurement network, the start and end of the study period,
       return data at the target site during the specified time period in a pandas data frame.
    '''
    from urllib.request import urlopen
    import pandas as pd
    # prepare the input information for use in the url
    station_code = str(station_code)
    start_datetime = pd.to_datetime(start)
    end_datetime = pd.to_datetime(end)
    # construct the target url with inputs from above
    # to display the whole url on GitHub I used multiple lines, you can just use a single line
    target_url = "https://luftdaten.berlin.de/station/mc"+str(station_code)+ \
                 ".csv?group=pollution&period=1h&timespan=custom&start%5Bdate%5D="+ \
                 str(start_datetime.day)+"."+str(start_datetime.month)+"."+str(start_datetime.year)+ \
                 "&start%5Bhour%5D="+str(start_datetime.hour)+ "&end%5Bdate%5D="+str(end_datetime.day)+"."+ \
                 str(end_datetime.month)+"."+str(end_datetime.year)+"&end%5Bhour%5D="+str(end_datetime.hour)
    # open the target url
    url_response = urlopen(target_url)
    # read data from the response (the seperator may not always be ";" for other websites)
    data_downloaded = pd.read_csv(url_response,sep=";")
    return data_downloaded

In [4]:
# test our function
download_Berlin_data("010","2021-10-10 02:00","2021-10-11 15:00")

Unnamed: 0,Station,010 Wedding,010 Wedding.1,010 Wedding.2,010 Wedding.3,010 Wedding.4,010 Wedding.5
0,Messkomponente,Feinstaub (PM10),"Feinstaub (PM2,5)",Stickstoffdioxid,Stickstoffmonoxid,Stickoxide,Ozon
1,Einheit,µg/m³,µg/m³,µg/m³,µg/m³,µg/m³,µg/m³
2,Messzeit (Angaben in MESZ bzw. MEZ),Stundenwerte,Stundenwerte,Stundenwerte,Stundenwerte,Stundenwerte,Stundenwerte
3,10.10.2021 02:00,27,21,29,1,31,32
4,10.10.2021 03:00,24,18,17,1,19,45
5,10.10.2021 04:00,24,19,15,1,17,46
6,10.10.2021 05:00,25,20,15,1,16,45
7,10.10.2021 06:00,26,20,17,1,19,40
8,10.10.2021 07:00,27,21,17,1,19,38
9,10.10.2021 08:00,28,22,19,1,22,33


## Download data from multiple sites and days all at once

In [5]:
# get data at these three sites during the same time period
station_codes = ["010","027","117"]

Berlin_data = []

for station in station_codes:
    Berlin_data.append(download_Berlin_data(station,"2021-9-10 00:00","2021-10-10 23:00"))

In [6]:
# check results at the "010" site
display(Berlin_data[0])

Unnamed: 0,Station,010 Wedding,010 Wedding.1,010 Wedding.2,010 Wedding.3,010 Wedding.4,010 Wedding.5
0,Messkomponente,Feinstaub (PM10),"Feinstaub (PM2,5)",Stickstoffdioxid,Stickstoffmonoxid,Stickoxide,Ozon
1,Einheit,µg/m³,µg/m³,µg/m³,µg/m³,µg/m³,µg/m³
2,Messzeit (Angaben in MESZ bzw. MEZ),Stundenwerte,Stundenwerte,Stundenwerte,Stundenwerte,Stundenwerte,Stundenwerte
3,10.09.2021 00:00,28,16,35,1,36,60
4,10.09.2021 01:00,28,16,33,1,34,59
...,...,...,...,...,...,...,...
742,10.10.2021 19:00,27,16,34,2,37,54
743,10.10.2021 20:00,31,20,47,1,48,37
744,10.10.2021 21:00,29,20,56,1,58,24
745,10.10.2021 22:00,29,20,60,1,61,19


In [7]:
# check results at the "027" site
display(Berlin_data[1])

Unnamed: 0,Station,027 Marienfelde,027 Marienfelde.1,027 Marienfelde.2,027 Marienfelde.3
0,Messkomponente,Stickstoffdioxid,Stickstoffmonoxid,Stickoxide,Ozon
1,Einheit,µg/m³,µg/m³,µg/m³,µg/m³
2,Messzeit (Angaben in MESZ bzw. MEZ),Stundenwerte,Stundenwerte,Stundenwerte,Stundenwerte
3,10.09.2021 00:00,12,0,12,82
4,10.09.2021 01:00,11,1,12,80
...,...,...,...,...,...
742,10.10.2021 19:00,10,0,11,60
743,10.10.2021 20:00,10,0,10,56
744,10.10.2021 21:00,13,0,14,49
745,10.10.2021 22:00,15,0,16,38


In [8]:
# check results at the "117" site
display(Berlin_data[2])

Unnamed: 0,Station,117 Schildhornstraße,117 Schildhornstraße.1,117 Schildhornstraße.2,117 Schildhornstraße.3,117 Schildhornstraße.4,117 Schildhornstraße.5
0,Messkomponente,Feinstaub (PM10),"Feinstaub (PM2,5)",Stickstoffdioxid,Stickstoffmonoxid,Stickoxide,Kohlenmonoxid
1,Einheit,µg/m³,µg/m³,µg/m³,µg/m³,µg/m³,mg/m³
2,Messzeit (Angaben in MESZ bzw. MEZ),Stundenwerte,Stundenwerte,Stundenwerte,Stundenwerte,Stundenwerte,Stundenwerte
3,10.09.2021 00:00,30,16,48,8,60,0.4
4,10.09.2021 01:00,31,17,34,3,39,0.3
...,...,...,...,...,...,...,...
742,10.10.2021 19:00,37,20,74,26,114,0.7
743,10.10.2021 20:00,39,23,79,40,141,0.8
744,10.10.2021 21:00,41,26,77,45,145,0.8
745,10.10.2021 22:00,40,27,70,36,125,0.8


## These codes work for the [Berlin air quality measurement network](https://luftdaten.berlin.de/lqi), but they may not work for other websites. You can get an idea of web scraping here and develop your codes when needed. The [urllib](https://docs.python.org/3/library/urllib.html) is not the only Python package for web scraping. In some complicated cases, you may want to try [Selenium](https://selenium-python.readthedocs.io/) or [Scrapy](https://scrapy.org/). 