# Get Station Description for DWD Climatologial Measuring Stations

## 1. About the DWD Open Data Portal 

The data of the Climate Data Center (CDC) of the DWD (Deutscher Wetterdienst, German Weather Service) is provided on an **FTP server**. <br> **FTP** stands for _File Transfer Protocol_.

Open the FTP link ftp://opendata.dwd.de/climate_environment/CDC/ in your browser (copy-paste) and find our how it is structured hierarchically.

You can also open the link with **HTTPS** (Hypertext Transfer Protocol Secure): https://opendata.dwd.de/climate_environment/CDC/

**Download and read** the document https://opendata.dwd.de/climate_environment/CDC/Readme_intro_CDC_ftp.pdf

**Q1:** In which temporal resolutions are the time series provided?

**Q2:** What is the difference between _historical_ and _recent_ data also with respect to quality control?

**Q3:** Are all meteorological parameters provided at the same temporal resolution?


## 2. Download the Station Meta Data 

We are interested in observations with following properties:

1. The **observations** are taken in Germany.
1. It is **temperature** data.
1. The temporal resolution is **annually** (yearly).
1. Use **historical** data, not recent.


Download the corresonding station meta data file (description) from the FTP server. The file you have to download is named `KL_Jahreswerte_Beschreibung_Stationen.txt`. The elements of the file name denote:

* KL, Klima:     Ensemble of Climate Data, 
* Jahreswerte:   Annual Values, 
* Beschreibung:  Description, 
* Stationen:     Stations

**Q1:** Under with path (directory, folder) on the FTP server do you find the file?

**Q2:** The Python FTP client we use is provided through the library _ftplib_: <br>
https://pythonprogramming.net/ftp-transfers-python-ftplib/ <br>
How to you use it?

**Q3:** Look at the code below. In which folder is the data stored locally? What is are relative and absolute paths?

## FTP Connection

This connection is used to download the metadata text file `KL_Jahreswerte_Beschreibung_Stationen.txt` listing the meteorological stations providing hourly precipitation measurements. The data in the text file is fixed width formatted, i.e. data is nicely arranged in columns. Several of the stations have been already abandoned. Whether the precipitation measurements are still active can be concluded from the column `bis_datum`. Of course you could have downloaded this single file directly by means of your browser or an FTP client but later it becomes clear how download processes can be automized with the ftplib.

### Connection Parameters

FTP servers often provide an anonymous access such that no personal login is required.

In [None]:
server = "opendata.dwd.de"
user   = "anonymous"
passwd = ""

### FTP Directory Definition and Station Description Filename Pattern

In [None]:
# The topic of interest: recent hourly precipitation data. 
topic_dir = "/annual/kl/historical/"

# This is the search pattern common to ALL station description file names. 
station_desc_pattern = "_Beschreibung_Stationen.txt"

# Below this directory tree node all climate data are stored.
climate_data_dir = "/climate_environment/CDC/observations_germany/climate/"
ftp_dir =  climate_data_dir + topic_dir

### Local Directories

In [None]:
# To keep the folders tidy the subdirectory tree of the FTP is replicated.
local_ts_dir = "data/DWD/" + topic_dir # TS stands for "time series". Better add a trailing "/" to make life easier ... 
local_station_dir = local_ts_dir # station info directory.

In [None]:
# Directory trees are created. Ignore errors if they already exist.
import os
os.makedirs(local_ts_dir,exist_ok = True) # it does not complain if the dir already exists.
os.makedirs(local_station_dir,exist_ok = True) # it does not complain if the dir already exists.

### FTP Connect

In [None]:
# Open the FTP session. Log in. If the connection idles for too long it will time out.
import ftplib
ftp = ftplib.FTP(server)
res = ftp.login(user=user, passwd = passwd)
print(res)

In [None]:
# Just check, whether the connection is still open (not having reached a timeout yes)
ret = ftp.cwd(".")

In [None]:
# How to log out.
#ftp.quit()

### FTP Grab File Function

In [None]:
def grabFile(ftpfullname,localfullname):
    try:
        ret = ftp.cwd(".") # A dummy action to check the connection and to provoke an exception if necessary.
        localfile = open(localfullname, 'wb')
        ftp.retrbinary('RETR ' + ftpfullname, localfile.write, 1024)
        localfile.close()
    
    except ftplib.error_perm:
        print("FTP ERROR. Operation not permitted. File not found?")

    except ftplib.error_temp:
        print("FTP ERROR. Timeout.")

    except ConnectionAbortedError:
        print("FTP ERROR. Connection aborted.")



### Generate Pandas Dataframe from FTP Directory Listing

In [None]:
import pandas as pd
import os

# generate a pandas dataframe from a FTP directory listing. 
def gen_df_from_ftp_dir_listing(ftp, ftpdir):
    lines = []
    flist = []
    try:
        # issue the command LIST in the FTP connection 
        res = ftp.retrlines("LIST "+ftpdir, lines.append)
    except:
        print("Error: ftp.retrlines() failed. ftp timeout? Reconnect!")
        return
        
    if len(lines) == 0:
        print("Error: ftp dir is empty")
        return
    
    for line in lines:
#        print(line)
        [ftype, fsize, fname] = [line[0:1], int(line[31:42]), line[56:]]
        
        fext = os.path.splitext(fname)[-1]
        
        if fext == ".zip":
            station_id = int(fname.split("_")[2])
        else:
            station_id = -1 
        
        flist.append([station_id, fname, fext, fsize, ftype])
        
        

    df_ftpdir = pd.DataFrame(flist,columns=["station_id", "name", "ext", "size", "type"])
    return(df_ftpdir)

In [None]:
# Generate a pandas dataframe from the FTP directory listing 
df_ftpdir = gen_df_from_ftp_dir_listing(ftp, ftp_dir)

Read the following output carefully. <br>
Q: What does `station_id = -1` mean? <br>
Q: What does the field `ext` mean? <br>
Q: What is the name of the file describing the stations, i.e. lists the stations with their names, coordinates, and other attributes? 

In [None]:
df_ftpdir.head(10)

### Dataframe with TS Zip Files

Create a dataframe with the names of the zip files only. These zip archives contain the real measurement data. The measured variable (precipitation, temoperature, etc.) is time dependent. A sequence of data over time is called **time series**.

In [None]:
#df_ftpdir["ext"]==".zip"
df_zips = df_ftpdir[df_ftpdir["ext"]==".zip"]
df_zips.set_index("station_id", inplace = True)
df_zips.head(10)

### Download the Station Description File

In [None]:
station_fname = df_ftpdir[df_ftpdir['name'].str.contains(station_desc_pattern)]["name"].values[0]
print(station_fname)

# ALternative
#station_fname2 = df_ftpdir[df_ftpdir["name"].str.match("^.*Beschreibung_Stationen.*txt$")]["name"].values[0]
#print(station_fname2)

In [None]:
print("grab file: " + station_fname + "\nfrom ftp dir: " + ftp_dir)
grabFile(ftp_dir + station_fname, local_station_dir + station_fname)

In [None]:
# extract column names. They are in German (de)
# We have to use codecs because of difficulties with character encoding (German Umlaute)
import codecs

def station_desc_txt_to_csv(txtfile, csvfile):
    file = codecs.open(txtfile,"r","utf-8")
    r = file.readline()
    file.close()
    colnames_de = r.split()
    colnames_de
    
    # German-English dictionary
    translate = \
    {'Stations_id':'station_id',
     'von_datum':'date_from',
     'bis_datum':'date_to',
     'Stationshoehe':'altitude',
     'geoBreite': 'latitude',
     'geoLaenge': 'longitude',
     'Stationsname':'name',
     'Bundesland':'state'}
    
    colnames_en = [translate[h] for h in colnames_de]
    
    # Skip the first two rows and set the column names.
    df = pd.read_fwf(txtfile,skiprows=2,names=colnames_en, parse_dates=["date_from","date_to"],index_col = 0)
    
    # write CSV file with field separator semicolon
    df.to_csv(csvfile, sep = ";")
    return(df)

In [None]:
basename = os.path.splitext(station_fname)[0]
df_stations = station_desc_txt_to_csv(local_station_dir + station_fname, local_station_dir + basename + ".csv")
df_stations.head()

### Select Stations Located in NRW from Station Description Dataframe

In [None]:
station_ids_selected = df_stations[df_stations['state'].str.contains("Nordrhein")].index
station_ids_selected

# HOMEWORK: Create a Map in QGIS!

The **pdf file in the notebook folder** shows the final result you have to produce yourself.

Follow the tutorial http://www.qgistutorials.com/en/docs/3/making_a_map.html

In class we created a vector data layer (point shape file) with the coordinates of the DWD CDC climate stations from a CSV file we generated from the meta data file downloaded from the open data DWD FTP archive (hourly values, precipitation).

Create a map of the DWD climate stations located in NRW. Use a shapefile of the NRW administrative boundaries.

Use the EPSG:28532 coordinate reference system (projection). We will learn later what it is.


The borders of the federal state NRW can be found in a zip archieve containing the official administrative boundaries of NRW.

Download the archieve **dvg1_EPSG25832_Shape.zip** from here: https://www.opengeodata.nrw.de/produkte/geobasis/vkg/dvg/dvg1/ and add the right shape file (.shp) as a vector layer to your QGIS project.

