## 1. About the DWD Open Data Portal 

The data of the Climate Data Center (CDC) of the DWD (Deutscher Wetterdienst, German Weather Service) is provided on an **FTP server**. <br> **FTP** stands for _File Transfer Protocol_.

Open the FTP link ftp://opendata.dwd.de/climate_environment/CDC/ in your browser (copy-paste) and find our how it is structured hierarchically.

You can also open the link with **HTTPS** (Hypertext Transfer Protocol Secure): https://opendata.dwd.de/climate_environment/CDC/

We are interested in downloading the metadata of daily precipitation to get information related to their stations

In [1]:
import requests
from bs4 import BeautifulSoup
import os
import re # to use regex expressions 
import tqdm
import pandas as pd
import geopandas as gpd

# URL of the DWD website
url_base = "https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/"
url_temporal_resolution = "daily/"
url_parameter = "more_precip/"
url_subdir = "historical/"
url_full = os.path.join(url_base, url_temporal_resolution, url_parameter, url_subdir)

# Directory to save the downloaded files
download_dir = "../data/original/dwd/" +  url_temporal_resolution + url_parameter + url_subdir

# Create the directory if it doesn't exist
if not os.path.exists(download_dir):
    os.makedirs(download_dir)

print("download dir: ", download_dir)

# Directory and name to save the CSV file to create the layer for sub-task 1.2
outfname = r"../data/Selected_Stations_Nordrhein_Westfalen.csv"

download dir:  ../data/original/dwd/daily/more_precip/historical/


In [2]:
url_full

'https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/daily/more_precip/historical/'

In [3]:
def grab_file(file_url, download_dir):
        # get only the file name from the full url
        file_name = file_url.split("/")[-1]
        # Download the file
        file_path =os.path.join(download_dir, file_name)
        with open(file_path, "wb") as file:
            file.write(requests.get(file_url).content)
        
    

In [4]:
# Send an HTTP request to the URL
response = requests.get(url_full)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the HTML content of the page
    soup = BeautifulSoup(response.content, "html.parser")
    # Look for the metadata file
    links = soup.find_all(href=re.compile("Beschreibung"))
    # Take the url of the file
    file_name = links[0].get("href")
    # Download the file
    grab_file(os.path.join(url_full, file_name), download_dir)
    print(f"Downloaded: {download_dir+file_name}")
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")

Downloaded: ../data/original/dwd/daily/more_precip/historical/RR_Tageswerte_Beschreibung_Stationen.txt


In [5]:
# get station path
file_path = os.path.join(download_dir,file_name)
# read the header of the file
header = open(file_path, encoding="latin").readline().split()
header

['Stations_id',
 'von_datum',
 'bis_datum',
 'Stationshoehe',
 'geoBreite',
 'geoLaenge',
 'Stationsname',
 'Bundesland',
 'Abgabe']

In [6]:
# translation dictionary
translate = \
{'Stations_id':'station_id',
 'von_datum':'date_from',
 'bis_datum':'date_to',
 'Stationshoehe':'altitude',
 'geoBreite': 'latitude',
 'geoLaenge': 'longitude',
 'Stationsname':'name',
 'Bundesland':'state'}

In [7]:
#pd.read_csv?

In [8]:
df_stations_2 = pd.read_fwf(file_path,
                          skiprows=[0,1],
                          names=translate,
                          encoding="latin", 
                          parse_dates=["von_datum","bis_datum"],
                          dtype={"Stations_id":str}
                          #index_col="Stations_id"
                         )
df_stations_2

  df_stations_2 = pd.read_fwf(file_path,


Unnamed: 0,Stations_id,von_datum,bis_datum,Stationshoehe,geoBreite,geoLaenge,Stationsname,Bundesland
1,19120101,1986-06-30,478,47.8413,8.8493,Aach,Baden-Württemberg,Frei
2,19510101,2006-12-31,138,50.8066,6.0996,Aachen (Kläranlage),Nordrhein-Westfalen,Frei
3,18910101,2011-03-31,202,50.7827,6.0941,Aachen,Nordrhein-Westfalen,Frei
4,19510101,1979-10-31,243,50.7683,6.1207,Aachen-Brand,Nordrhein-Westfalen,Frei
6,19821101,2024-09-21,455,48.8361,10.0598,Aalen-Unterrombach,Baden-Württemberg,Frei
...,...,...,...,...,...,...,...,...
19898,20240101,2024-09-21,39,52.4970,13.2820,Berlin-Halensee,Berlin,Frei
19911,19631101,1986-09-30,215,50.9625,10.0663,Gerstungen/Thüringen,Thüringen,Frei
19917,20240325,2024-09-21,153,49.9321,8.0767,Schwabenheim a.d. Selz,Rheinland-Pfalz,
19993,18970507,1935-12-31,4,53.6870,9.6757,Uetersen,Schleswig-Holstein,Frei


In [9]:
# read the stations dataframe
df_stations = pd.read_fwf(file_path,
                          skiprows=2,
                          names=header,
                          encoding="latin", 
                          parse_dates=["von_datum","bis_datum"],
                          dtype={"Stations_id":int} #easier to handle int for later comparison
                          #index_col="Stations_id"
                         )
df_stations

Unnamed: 0,Stations_id,von_datum,bis_datum,Stationshoehe,geoBreite,geoLaenge,Stationsname,Bundesland,Abgabe
0,1,1912-01-01,1986-06-30,478,47.8413,8.8493,Aach,Baden-Württemberg,Frei
1,2,1951-01-01,2006-12-31,138,50.8066,6.0996,Aachen (Kläranlage),Nordrhein-Westfalen,Frei
2,3,1891-01-01,2011-03-31,202,50.7827,6.0941,Aachen,Nordrhein-Westfalen,Frei
3,4,1951-01-01,1979-10-31,243,50.7683,6.1207,Aachen-Brand,Nordrhein-Westfalen,Frei
4,6,1982-11-01,2024-09-21,455,48.8361,10.0598,Aalen-Unterrombach,Baden-Württemberg,Frei
...,...,...,...,...,...,...,...,...,...
6487,19898,2024-01-01,2024-09-21,39,52.4970,13.2820,Berlin-Halensee,Berlin,Frei
6488,19911,1963-11-01,1986-09-30,215,50.9625,10.0663,Gerstungen/Thüringen,Thüringen,Frei
6489,19917,2024-03-25,2024-09-21,153,49.9321,8.0767,Schwabenheim a.d. Selz,Rheinland-Pfalz,
6490,19993,1897-05-07,1935-12-31,4,53.6870,9.6757,Uetersen,Schleswig-Holstein,Frei


Check all the different values in the "state" column. You can use the function <code>.unique()</code> for this.

In [10]:
df_stations.rename(columns=translate,inplace=True)

In [11]:
df_stations.loc[:,"state"].unique()

array(['Baden-Württemberg', 'Nordrhein-Westfalen', 'Hessen', 'Bayern',
       'Niedersachsen', 'Sachsen-Anhalt', 'Rheinland-Pfalz', 'Sachsen',
       'Mecklenburg-Vorpommern', 'Schleswig-Holstein', 'Brandenburg',
       'Thüringen', 'Saarland', 'Berlin', 'Bremen', 'Hamburg', 'Tirol'],
      dtype=object)

Select only stations in NRW which are still active (date_to is later than 2023) and which starting recording information at least in 1950.

In [12]:
#df_stations.query?

In [13]:
# filter stations only in Nordrhein-Westfalen which are active and older than 1950
df_stations_short = df_stations.query("state == 'Nordrhein-Westfalen' and date_to >= 2023 and date_from < 1950")

In [14]:
df_stations_short

Unnamed: 0,station_id,date_from,date_to,altitude,latitude,longitude,name,state,Abgabe
75,79,1931-01-01,2024-09-20,160,50.6718,7.0155,Alfter-Volmershoven,Nordrhein-Westfalen,Frei
104,110,1931-01-01,2024-09-21,65,52.0487,7.4877,Altenberge,Nordrhein-Westfalen,Frei
174,187,1941-01-01,2024-09-21,265,51.4188,7.9126,Arnsberg-Holzen,Nordrhein-Westfalen,Frei
292,325,1941-01-01,2024-09-21,127,51.7948,8.0718,Beckum-Vellern,Nordrhein-Westfalen,Frei
349,389,1931-01-01,2024-09-21,436,51.0148,8.4318,"Berleburg, Bad-Arfeld",Nordrhein-Westfalen,Frei
...,...,...,...,...,...,...,...,...,...
5104,5579,1941-01-01,2024-09-21,273,51.5892,9.0296,Willebadessen-Borlinghausen,Nordrhein-Westfalen,Frei
5143,5619,1931-01-01,2024-09-21,360,51.1637,7.4234,Wipperfürth-Gardeweg,Nordrhein-Westfalen,Frei
5218,5699,1941-01-01,2024-09-21,312,51.5427,8.7784,Wünnenberg-Eilern,Nordrhein-Westfalen,Frei
5236,5717,1937-01-01,2024-09-21,134,51.2256,7.1052,Wuppertal-Buchenhofen,Nordrhein-Westfalen,Frei


In [15]:
df_stations_short.to_csv(outfname,sep =";")
print("CSV file saved in " + outfname )

CSV file saved in ../data/Selected_Stations_Nordrhein_Westfalen.csv


## Exporting the Data as a Geopackage for our TimeSeries


We are interested in location of the stations

In [16]:
gdf = gpd.GeoDataFrame(
    df_stations_short, geometry=gpd.points_from_xy(df_stations_short.longitude, df_stations_short.latitude),
    crs="EPSG:4326"
) 
# used EPSG:4326 because we only have a long./latit. which are angles rather than projected distances from a selected origin

In [17]:
gdf.to_file(r"../data/Nordrhein-Westfalen_Selected_Stations_from_Jupyter.gpkg",driver = "GPKG", layer="selected_stations_points")
print("Geopackage file saved in " + r"../data/Nordrhein-Westfalen_Selected_Stations_from_Jupyter.gpkg" )

Geopackage file saved in ../data/Nordrhein-Westfalen_Selected_Stations_from_Jupyter.gpkg
