# Get Station Description for DWD Climatologial Measurung Stations

## 1. About the DWD Open Data Portal 

The data of the Climate Data Center (CDC) of the DWD (Deutscher Wetterdienst, German Weather Service) is provided on an **FTP server**. <br> **FTP** stands for _File Transfer Protocol_.

Open the FTP link ftp://opendata.dwd.de/climate_environment/CDC/ in your browser (copy-paste) and find our how it is structured hierarchically.

You can also open the link with **HTTPS** (Hypertext Transfer Protocol Secure): https://opendata.dwd.de/climate_environment/CDC/

**Download and read** the document https://opendata.dwd.de/climate_environment/CDC/Readme_intro_CDC_ftp.pdf

**Q1:** In which temporal resolutions are the time series provided?

**Q2:** What is the difference between _historical_ and _recent_ data also with respect to quality control?

**Q3:** Are all meteorological parameters provided at the same temporal resolution?


## 2. Download the Station Meta Data 

We are interested in observations with following properties:

1. The observations are taken in Germany.
1. It is climate data.
1. The temporal resolution is annual.
1. Use historial data, nt recent.


Download the corresonding station meta data file (description) from the FTP server. The file you have to download is named `KL_Jahreswerte_Beschreibung_Stationen.txt`. The elements of the file name denote:

* KL: Klima, Climate, 
* Jahreswerte: Annual Values, 
* Beschreibung: Description, 
* Stationen: Stations

**Q1:** Under with path (directory, folder) on the FTP server do you find the file?

**Q2:** The Python FTP client we use is provided through the library _ftplib_: <br>
https://pythonprogramming.net/ftp-transfers-python-ftplib/ <br>
How to you use it?

**Q3:** Look at the code below. In which folder is the data stored locally? What is are relative and absolute paths?

## FTP Connection

### Connection Parameters

In [20]:
server = "opendata.dwd.de"
user   = "anonymous"
passwd = ""

### FTP Directory Definition and Station Description Filename Pattern

In [21]:
# The topic of interest.
topic_dir = "/hourly/precipitation/recent/"

# This is the search pattern common to ALL station description file names 
station_desc_pattern = "_Beschreibung_Stationen.txt"

# Below this directory tree node all climate data are stored.
climate_data_dir = "/climate_environment/CDC/observations_germany/climate/"
ftp_dir =  climate_data_dir + topic_dir

### Local Directories

In [22]:
local_ts_dir = "data/DWD/" + topic_dir # TS stands for "time series". Better add a trailing "/" to make life easier ... 
local_station_dir = local_ts_dir # station info 

In [23]:
import os
os.makedirs(local_ts_dir,exist_ok = True) # it does not complain if the dir already exists.
os.makedirs(local_station_dir,exist_ok = True) # it does not complain if the dir already exists.

### FTP Connect

In [24]:
import ftplib
ftp = ftplib.FTP(server)
res = ftp.login(user=user, passwd = passwd)
print(res)

230 Login successful.


In [25]:
ret = ftp.cwd(".")

In [26]:
# ftp.quit()

### FTP Grab File Function

In [27]:
def grabFile(ftpfullname,localfullname):
    try:
        ret = ftp.cwd(".") # A dummy action to chack the connection and to provoke an exception if necessary.
        localfile = open(localfullname, 'wb')
        ftp.retrbinary('RETR ' + ftpfullname, localfile.write, 1024)
        localfile.close()
    
    except ftplib.error_perm:
        print("FTP ERROR. Operation not permitted. File not found?")

    except ftplib.error_temp:
        print("FTP ERROR. Timeout.")

    except ConnectionAbortedError:
        print("FTP ERROR. Connection aborted.")



### Generate Pandas Dataframe from FTP Directory Listing

In [28]:
import pandas as pd
import os

def gen_df_from_ftp_dir_listing(ftp, ftpdir):
    lines = []
    flist = []
    try:    
        res = ftp.retrlines("LIST "+ftpdir, lines.append)
    except:
        print("Error: ftp.retrlines() failed. ftp timeout? Reconnect!")
        return
        
    if len(lines) == 0:
        print("Error: ftp dir is empty")
        return
    
    for line in lines:
#        print(line)
        [ftype, fsize, fname] = [line[0:1], int(line[31:42]), line[56:]]
#        itemlist = [line[0:1], int(line[31:42]), line[56:]]
#        flist.append(itemlist)
        
        fext = os.path.splitext(fname)[-1]
        
        if fext == ".zip":
            station_id = int(fname.split("_")[2])
        else:
            station_id = -1 
        
        flist.append([station_id, fname, fext, fsize, ftype])
        
        

    df_ftpdir = pd.DataFrame(flist,columns=["station_id", "name", "ext", "size", "type"])
    return(df_ftpdir)

In [29]:
df_ftpdir = gen_df_from_ftp_dir_listing(ftp, ftp_dir)

In [30]:
df_ftpdir.head(10)

Unnamed: 0,station_id,name,ext,size,type
0,-1,BESCHREIBUNG_obsgermany_climate_hourly_precipi...,.pdf,68888,-
1,-1,DESCRIPTION_obsgermany_climate_hourly_precipit...,.pdf,68313,-
2,-1,RR_Stundenwerte_Beschreibung_Stationen.txt,.txt,209079,-
3,20,stundenwerte_RR_00020_akt.zip,.zip,42821,-
4,44,stundenwerte_RR_00044_akt.zip,.zip,43744,-
5,53,stundenwerte_RR_00053_akt.zip,.zip,42366,-
6,71,stundenwerte_RR_00071_akt.zip,.zip,40805,-
7,73,stundenwerte_RR_00073_akt.zip,.zip,42769,-
8,78,stundenwerte_RR_00078_akt.zip,.zip,43786,-
9,87,stundenwerte_RR_00087_akt.zip,.zip,41098,-


### Dataframe with TS Zip Files

In [171]:
#df_ftpdir["ext"]==".zip"
df_zips = df_ftpdir[df_ftpdir["ext"]==".zip"]
df_zips.set_index("station_id", inplace = True)
df_zips.head(10)

Unnamed: 0_level_0,name,ext,size,type
station_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
20,stundenwerte_RR_00020_akt.zip,.zip,42592,-
44,stundenwerte_RR_00044_akt.zip,.zip,43541,-
53,stundenwerte_RR_00053_akt.zip,.zip,42169,-
71,stundenwerte_RR_00071_akt.zip,.zip,42374,-
73,stundenwerte_RR_00073_akt.zip,.zip,43207,-
78,stundenwerte_RR_00078_akt.zip,.zip,43570,-
87,stundenwerte_RR_00087_akt.zip,.zip,40990,-
91,stundenwerte_RR_00091_akt.zip,.zip,41719,-
103,stundenwerte_RR_00103_akt.zip,.zip,41790,-
118,stundenwerte_RR_00118_akt.zip,.zip,42604,-


### Download the Station Description File

In [172]:
station_fname = df_ftpdir[df_ftpdir['name'].str.contains(station_desc_pattern)]["name"].values[0]
print(station_fname)

# ALternative
#station_fname2 = df_ftpdir[df_ftpdir["name"].str.match("^.*Beschreibung_Stationen.*txt$")]["name"].values[0]
#print(station_fname2)

RR_Stundenwerte_Beschreibung_Stationen.txt


In [173]:
print("grab file: " + station_fname + "\nfrom ftp dir: " + ftp_dir)
grabFile(ftp_dir + station_fname, local_station_dir + station_fname)

grab file: RR_Stundenwerte_Beschreibung_Stationen.txt
from ftp dir: /climate_environment/CDC/observations_germany/climate//hourly/precipitation/recent/


In [174]:
# extract column names. They are in German (de)
# We have to use codecs because of difficulties with character encoding (German Umlaute)
import codecs

def station_desc_txt_to_csv(txtfile, csvfile):
    file = codecs.open(txtfile,"r","utf-8")
    r = file.readline()
    file.close()
    colnames_de = r.split()
    colnames_de
    
    translate = \
    {'Stations_id':'station_id',
     'von_datum':'date_from',
     'bis_datum':'date_to',
     'Stationshoehe':'altitude',
     'geoBreite': 'latitude',
     'geoLaenge': 'longitude',
     'Stationsname':'name',
     'Bundesland':'state'}
    
    colnames_en = [translate[h] for h in colnames_de]
    
    # Skip the first two rows and set the column names.
    df = pd.read_fwf(txtfile,skiprows=2,names=colnames_en, parse_dates=["date_from","date_to"],index_col = 0)
    
    # write csv
    df.to_csv(csvfile, sep = ";")
    return(df)

In [175]:
basename = os.path.splitext(station_fname)[0]
df_stations = station_desc_txt_to_csv(local_station_dir + station_fname, local_station_dir + basename + ".csv")
df_stations.head()

Unnamed: 0_level_0,date_from,date_to,altitude,latitude,longitude,name,state
station_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
3,1995-09-01,2011-04-01,202,50.7827,6.0941,Aachen,Nordrhein-Westfalen
20,2004-08-14,2020-01-21,432,48.922,9.9129,Abtsgmünd-Untergröningen,Baden-Württemberg
44,2007-04-01,2020-01-21,44,52.9336,8.237,Großenkneten,Niedersachsen
53,2005-10-01,2020-01-21,60,52.585,13.5634,Ahrensfelde,Brandenburg
71,2004-10-22,2020-01-01,759,48.2156,8.9784,Albstadt-Badkap,Baden-Württemberg


### Select Stations Located in NRW from Station Description Dataframe

In [176]:
station_ids_selected = df_stations[df_stations['state'].str.contains("Nordrhein")].index
station_ids_selected

Int64Index([    3,   216,   326,   389,   390,   554,   555,   599,   603,
              613,   617,   644,   796,   871,   902,   934,   989,  1024,
             1046,  1078,  1241,  1246,  1300,  1303,  1327,  1590,  1595,
             1766,  2027,  2110,  2254,  2473,  2483,  2497,  2629,  2667,
             2703,  2810,  2947,  2968,  2999,  3028,  3031,  3081,  3098,
             3215,  3321,  3339,  3499,  3540,  3591,  3795,  3913,  4063,
             4127,  4150,  4154,  4313,  4368,  4371,  4400,  4488,  4692,
             4741,  4849,  5064,  5347,  5360,  5468,  5480,  5513,  5619,
             5699,  5717,  5719,  5733,  6197,  6264,  6276,  6313,  6337,
             7106,  7330,  7344,  7374,  7378, 13669, 13670, 13671, 13696,
            13700, 13713, 15000],
           dtype='int64', name='station_id')