/
data_structure.Rmd
136 lines (89 loc) · 6.97 KB
/
data_structure.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
title: "Detecting the structure of data tables"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Detecting the structure of data tables}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
library(countries)
```
The package introduces several functions that identify special columns in country data. These functions can be used to automate basic tasks and detect data formats. Here is an overview of a few functionalities.
## Example dataset
In this vignette we will use [a dataset of country policies](https://fbellelli.com/EDB-data.html) as an example for the functions in the package. This dataset contains 41 columns and has a rather complex structure. Every row is identified by a combination of Country name (column 21), Year (Column 9), Policy measure number (column 1), and HS code (column 39) --- which is a code identifying traded commodities. This dataset is an extension based on the [WTO environmental databse (EDB)](https://edb.wto.org/). More information on the dataset can be found [here](https://www.wto.org/english/res_e/reser_e/ersd202203_e.pdf).
```{r}
#download data
temp <- tempfile()
download.file("https://fbellelli.com/data%20files/Extended%20EDB%20by%20measure-country-HS%20chapter.zip", temp, mode="wb")
#unzip and load it in R
library(data.table)
example <- fread(
unzip(temp,"Extended EDB by measure-country-HS chapter/Extended EDB (by measure-country-HS chapter).csv"),
stringsAsFactors=FALSE,
encoding = "Latin-1")
```
```{r include=FALSE}
#Delete downloaded data from disk
unlink("Extended EDB by measure-country-HS chapter", recursive = TRUE)
```
## Finding columns containing country and time information
The functions `find_countrycol()` and `find_timecol()` can be used to return respectively all columns containing country names and date/year information.
In the example below, the function `find_countrycol()` is used to examine our example dataset. The argument `min_share` is used to indicate whether to look for an entire column of Country names or any column that contains Country names. This is a numeric value indicating the minimum share of entries in the column that are country names.
```{r}
#Columns of country names
find_countrycol(example)
#Return any column containing a country name
find_countrycol(example, min_share=0)
```
Similarly, we can look for time columns in the following way:
```{r}
# Date and year columns
find_timecol(example)
```
By default, the function will return the name of the column. However, the argument `return_index` can be used to get the indices of the column in the table.
```{r}
find_countrycol(example, return_index = TRUE)
find_timecol(example, return_index = TRUE)
```
Computations may take long on large tables with many columns. To speed up operations, these functions operate by evaluating the table on a random sample of rows. The sample size can be adjusted with the argument `sample_size`. It also possible to request to use the entire table by passing the value `NA`. Finally, it is also possible to request not to return columns containing `NA` values by specifying: `allow_NA = FALSE`.
These two functions are based on `is_country()` and `is_date()`, which can be used to test if a string is a Country name and a date. An overview of the first function is provided in the vignette *[Dealing with country names](https://fbellelli.github.io/countries/articles/dealing_with_names.html)*, here we have a look at `is_date()´.
## Testing dates
`is_date()` takes a string vector as argument and outputs a boolean vector indicating whether the strings are dates. The argument `formats` can be used to specify the formats to be checked. Standard R notation can be used for date elements (see the table below).
```{r}
test <- c("December 2022", "20/01/1970", "Banana", "12.13.2000")
is_date(test)
is_date(test, formats=c("%d/%m/%Y"))
```
```{r echo=FALSE}
tab <- data.frame(Symbol= c("%d","%a","%A","%m","%b","%B","%y","%Y"),
Definition = c("Day number","Abbreviated day name","Full day name", "Month number","Abbreviated month name","Full month name","Year 2-digits","Year 4-digits"),
Example=c("13","Mon","Monday","2","Feb","February","22","2022"))
knitr::kable(tab)
```
## Finding table keys
`find_keycol()` is a function that can be used to automate the search of key columns in datasets. A table key is a column or set of columns that uniquely identifies the entries in the table. This function was specifically designed for country data, so it will prioritise the search of country and time columns, which are often keys in country data.
As shown below, the function correctly identifies the keys for the complex dataset introduced at the beginning of this article.
```{r}
find_keycol(example, allow_NA = TRUE)
```
The function outputs a vector of column names that uniquely identifies the entries in the table. In addition, it also indicates whether the column contains *country*, *time* or *other* information. Just like the other *find* function in this page, it is possible to request the column index instead of the name by passing `return_index = TRUE`.
Due to the potentially very high number of column combinations in a table, the function only tests the more likely options. The function will first look for country and time columns. Next, left-most columns are prioritised. Also, the function will only look for the most common country data formats (e.g. cross-sectional, time-series, panel data, dyadic, etc.) and only searches for up to two additional key columns of type `other`. If no key is found at the end of the search, the function returns a `NULL` value.
Computation can be slow on large datasets. Therefore, the function evaluates only a random sample of rows from the table. The sample size can be controlled with `sample_size`. Moreover, it is also possible to restrict the search to a subset of columns by providing a vector of column names or indices in the argument `search_only`.
```{r}
#This will only check if a key is found among the first three columns
find_keycol(example, allow_NA = TRUE, search_only = 1:3)
```
Finally, the function also permits the user to specify whether columns containing `NA` values are allowed to be considered as table keys. The default is `allow_NA = FALSE` because keys should typically not contain missing values. However, in our example dataset the column `Tentative.HS.chapters.match` contains blank entries which are read in R as `NA`, therefore the default would return `NULL`. The argument `allow_NA` is present to deal with these edge cases.
```{r}
find_keycol(example, allow_NA = FALSE)
```
If the user wants to test a specific set of columns, the package also provides the function `is_keycol()` to check whether the provided columns uniquely identify the entries in the dataset.
```{r}
is_keycol(example, c("COUNTRIES","Year"), allow_NA = TRUE)
is_keycol(example, c("COUNTRIES","Year", "Nr","Tentative HS chapters match"), allow_NA = TRUE)
```