Destatis is the Federal Statistical Office of Germany. Of course, it publishes a lot of datasets containing a wide range of data, from area sizes to international econonomic indicators in its database called Genesis.
Unfortunately, the downloadable
csv files don't comply with common standards of a tidy, ready-to-use machine-readable dataset:
- The tables have double, triple, quadruple, quintuple ... headers.
- Every file includes copyright information on the end of the file.
- positive numeric valus have a
The problems exists throughout the federal system of different statistical offices. Therefore
destatiscleanr works on data of regionalstatistik.de and other statistics offices, too.
The consequence of these messy files is time-consuming data cleaning. Everytime you want to use data from Destatis you have to do the same (or at least very similar) tasks. This package helps by doing four things:
- it imports the file by taking care of German peculiarities concerning encoding and decimal marks
- it deletes the copyright and metadata part
- it combines multiline headers to a regular column name
- it converts numeric values to
Ideally, you can start your analysis right after calling
The package can be installed with
csv file from the official Destatis/Genesis database and provide its path to the
df <- destatiscleanr("path/to/destatis_file.csv")
A short example to illustrate the advantage of the package is the table for Verbraucherpreise, German for consumer prices aka inflation.
The column name
na_na derives from the fact that the column names are built from the rows four and five in the original "Verbraucherpreise" table - and these are empty, therefore
The goal is to jump start the analysis of Destatis data. This comes with two caveats: the automatic creation of column names and the handling of missing values.
Be aware that the automatic renaming of columns doesn't work perfectly. The column names are probably not as specific as you wish. The package combines multline headers to a unique column name, including a name and unit. So you can definitly start doing your analysis without any hassle immidiately. It may be that you have to adjust at least some column names.
NA value can have many different meanings, like
- means no data available and
... the value will be reported later. This distinctions aren't represented in the cleaned data by
destatiscleanr: Every missing value, no matter the reason, is an
Possible reasons for missing values:
The package wiesbaden offers a way to get Destatis data directly from the database.
Unfortunately, this is a paid service for the main database of Destatis. Destatis offers it API now as a free service (See documentation here). Just like Regionalstatistik.de it can be accessed now as a free registered user.
- more dynamic creation of
- Clever guessing of year/date column
Shiny app to offer it non r users