-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
101 lines (62 loc) · 4.21 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
---
title: "README"
author: "Bas Machielsen"
date: "5/3/2020"
output:
html_document:
keep_md: yes
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Introduction
`hdng` is a package facilitating the use of the [HDNG Database](https://datasets.iisg.amsterdam/dataverse/HDNG?q=&types=files&sort=dateSort&order=desc&page=1), a database containing quantative information in various aspects of Dutch municipalities from about 1800 to the 1980's. This package aids the user in finding their way through the database, allowing them to search for wanted variables, browse within categories, and finally, make a query to extract the wanted data. Below is a short demonstration of the package's functionality. The package consists of three (basic) functions:
- `hdng_names_cats`: allows the user to browse through categories. An empty function call gives the basic categories, and a function call containing one or more specific categories returns a dataframe with all variables, and times at which they are available. Fully customizable in various aspects (see function documentation).
- `closest_matches`: gives the user the closest matches (according to string distance) of one or more particular searches. This allows the user to quickly find variables of interest without having to browse through categories. Also supports 'within-category' search. The particular kind of string distance can be customized.
- `hdng_data_get`: function allowing the user to extract the data. It takes as input variable codes, and returns a data.frame as output containing the variables. Many options customizable.
## Demonstration
The package can be installed and loaded via:
```{r install, eval = FALSE}
devtools::install_github("basm92/hdng")
```
```{r library, message = FALSE}
library(hdng)
```
Depending on your machine, the installation might take a while because the data is lazy loading (it is a lot of data).
## Search for your variables
Generally, you can use `hdng_names_cats()` to look for specific data. An empty call gives you all available categories, wheras a call with one of the categories gives you a data.frame containing the availability of all variables within the time frame you specified:
```{r categories}
hdng_names_cats()
```
```{r categories 2}
hdng_names_cats("Beroepen") %>%
head()
hdng_names_cats("Beroepen", from = 1870 ,to = 1899) %>%
head()
```
You can also apply a 'less strict' filter to the data. By default, the query filters the data to variables that are available at least once in the indicated period. You can override this option by specifying `show.only.available = FALSE`.
```{r cat 3}
hdng_names_cats("Welvaart", from = 1880, to = 1940)
hdng_names_cats("Welvaart", from = 1880, to = 1940, show.only.available = F)
```
You can also search for a specific term using the function `closest_matches`. By default, it returns variable names. If you want to see names, set `variable.names` to `TRUE`.
```{r demo closest matches}
closest_matches("aantal autos", "Welvaart", n = 20)
closest_matches("aantal autos", "Welvaart", variable.names = TRUE)
```
## Execute the query
Finally, you enter a number of variable names into `hdng_get_data` to retrieve a data.frame consisting of the actual variables. Please note that variable names are case-sensitive.
```{r}
#Example 1
variables <- closest_matches("aantal autos", "Welvaart", n = 3)
hdng_get_data(variables, gemeenten = c("Amsterdam", "Rotterdam",
"Utrecht", "'s Gravenhage"),
col.names = TRUE)
#Example 2
hdng_names_cats("Beroepen") -> query
query$var[1:10] -> variables2 #Select only 10 variables
hdng_get_data(variables2, gemeenten = c("Groningen", "Leeuwarden", "Assen", "Zwolle"),
col.names = TRUE)
```
## Future work
I plan to include a function that takes as input variable names (or closest matches to them), and as output variable codes, so as not to frustrate the user to always look for variable codes to enter as arguments to `hdng_data_get`. Alternatively, I might start to work on an extract function that takes names, not codes as input. I also want to integrate searches on the basis of province. Let me know if you have any suggestions. Thank you for reading!