-
Notifications
You must be signed in to change notification settings - Fork 4
/
tableschema.r.R
178 lines (178 loc) · 9.81 KB
/
tableschema.r.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
#' Table Schema Package
#' @description Table class for working with data and schema
#' @docType package
#'
#' @section Introduction:
#'
#' Table Schema is a simple language- and implementation-agnostic way to declare a schema for tabular data.
#' Table Schema is well suited for use cases around handling and validating tabular data in text formats
#' such as CSV, but its utility extends well beyond this core usage, towards a range of applications
#' where data benefits from a portable schema format.
#'
#' @section Concepts:
#' #
#' @section Tabular data:
#' Tabular data consists of a set of rows. Each row has a set of fields (columns).
#' We usually expect that each row has the same set of fields and thus we can talk about
#' the fields for the table as a whole.
#'
#' In case of tables in spreadsheets or CSV files we often interpret the first row as a header row,
#' giving the names of the fields. By contrast, in other situations, e.g. tables in SQL databases,
#' the field names are explicitly designated.
#'
#' @section Physical and logical representation:
#'
#' In order to talk about the representation and processing of tabular data from text-based sources,
#' it is useful to introduce the concepts of the \emph{physical} and the \emph{logical} representation of data.
#'
#' The \emph{physical representation} of data refers to the representation of data as text on disk,
#' for example, in a CSV or JSON file. This representation may have some type information
#' (JSON, where the primitive types that JSON supports can be used) or not
#' (CSV, where all data is represented in string form).
#'
#' The \emph{logical representation} of data refers to the "ideal" representation of the data
#' in terms of primitive types, data structures, and relations, all as defined by the specification.
#' We could say that the specification is about the logical representation of data,
#' as well as about ways in which to handle conversion of a physical representation to a logical one.
#'
#' In this document, we'll explicitly refer to either the \emph{physical} or \emph{logical} representation
#' in places where it prevents ambiguity for those engaging with the specification, especially implementors.
#'
#' For example, \code{constraints} should be tested on the logical representation of data,
#' whereas a property like \code{missingValues} applies to the physical representation of the data.
#'
#'
#' @section Descriptor:
#'
#' A Table Schema is represented by a descriptor. The descriptor \code{MUST} be a JSON \code{object}
#' (JSON is defined in \href{https://www.ietf.org/rfc/rfc4627.txt}{RFC 4627}).
#'
#' It \code{MUST} contain a property \code{fields}. \code{fields} \code{MUST} be an array/list where each entry
#' in the array/list is a field descriptor (as defined below). The order of elements in \code{fields} array/list
#' \code{MUST} be the order of fields in the CSV file. The number of elements in \code{fields} array/list
#' \code{SHOULD} be exactly the same as the number of fields in the CSV file.
#'
#' The descriptor \code{MAY} have the additional properties set out below and \code{MAY} contain
#' any number of other properties (not defined in this specification).
#'
#'
#' @section Field Descriptors:
#' See \code{\link{Field}} Class
#'
#' @section Types and Formats:
#' See \code{\link{Types}} Class
#'
#' @section Constraints:
#' See \code{\link{Constraints}} Class
#'
#' @section Other Properties:
#' In additional to field descriptors, there are the following "table level" properties.
#'
#' @section Missing Values:
#'
#' Many datasets arrive with missing data values, either because a value was not collected or
#' it never existed. Missing values may be indicated simply by the value being empty in other
#' cases a special value may have been used e.g. -, NaN, 0, -9999 etc.
#'
#' \code{missingValues} dictates which string values should be treated as null values.
#' This conversion to null is done before any other attempted type-specific string conversion.
#' The default value \code{list("")} means that empty strings will be converted to null before any other
#' processing takes place. Providing the empty list means that no conversion to null will
#' be done, on any value.
#'
#' \code{missingValues} \code{MUST} be a list where each entry is a string.
#'
#' \strong{Why strings}: \code{missingValues} are strings rather than being the data type of the particular field.
#' This allows for comparison prior to casting and for fields to have missing value which are not
#' of their type, for example a \code{number} field to have missing values indicated by -.
#'
#' Examples:
#' \itemize{
#' \item{missingValues = list("")}
#' \item{missingValues = list("-")}
#' \item{missingValues = list("NaN", "-")}
#' }
#'
#' @section Primary Key:
#' A primary key is a field or set of fields that uniquely identifies each row in the table.
#'
#' The \code{primaryKey} entry in the schema \code{object} is optional. If present it specifies the primary key for this table.
#'
#' The \code{primaryKey}, if present, \code{MUST} be:
#' \itemize{
#' \item{Either: an array of strings with each string corresponding to one of the field \code{name} values in the \code{fields} array
#' (denoting that the primary key is made up of those fields). It is acceptable to have an array with a single value
#' (indicating just one field in the primary key). Strictly, order of values in the array does not matter.
#' However, it is RECOMMENDED that one follow the order the fields in the \code{fields} has as client applications may
#' utitlize the order of the primary key list (e.g. in concatenating values together).
#' }
#' \item{Or: a single string corresponding to one of the field \code{name} values in the \code{fields} array/list
#' (indicating that this field is the primary key). Note that this version corresponds to the array form
#' with a single value (and can be seen as simply a more convenient way of specifying a single field primary key).
#' }
#' }
#'
#'
#' @section Foreign Keys:
#' A foreign key is a reference where values in a field (or fields) on the table ('resource' in data package terminology)
#' described by this Table Schema connect to values a field (or fields) on this or a separate table (resource).
#' They are directly modelled on the concept of foreign keys in SQL.
#'
#' The \code{foreignKeys} property, if present, \code{MUST} be a list Each entry in the array must be a \code{foreignKey}.
#' A \code{foreignKey} \code{MUST} be a object and \code{MUST} have the following properties:
#'
#' \itemize{
#' \item{\code{fields} - \code{fields} is a string or array specifying the field or fields on this resource that form the source part
#' of the foreign key. The structure of the string or array is as per \code{primaryKey} above.}
#'
#' \item{\code{reference} - \code{reference} \code{MUST} be a object. The object
#' \itemize{
#'
#' \item{\code{MUST} have a property \code{resource} which is the name of the resource within the current data package
#' (i.e. the data package within which this Table Schema is located). For self-referencing foreign keys,
#' i.e. references between fields in this Table Schema, the value of \code{resource} \code{MUST} be \code{""} (i.e. the empty string).}
#'
#' \item{\code{MUST} have a property \code{fields} which is a string if the outer \code{fields} is a string,
#' else an array of the same length as the outer \code{fields}, describing the field (or fields) references
#' on the destination resource. The structure of the string or array is as per \code{primaryKey} above.}
#' }
#'
#' }
#' }
#'
#'
#' \strong{Comment}: Foreign Keys create links between one Table Schema and another Table Schema,
#' and implicitly between the data tables described by those Table Schemas. If the foreign key is referring to
#' another Table Schema how is that other Table Schema discovered? The answer is that a Table Schema will
#' usually be embedded inside some larger descriptor for a dataset, in particular as the schema for a resource
#' in the resources array of a href{http://frictionlessdata.io/specs/data-package/}{Data Package}.
#' It is the use of Table Schema in this way that permits a meaningful use of a non-empty \code{resource}
#' property on the foreign key.
#'
#'
#' @section Details:
#'
#' \href{https://CRAN.R-project.org/package=jsonlite}{Jsolite package} is internally used to convert json data to list objects. The input parameters of functions could be json strings,
#' files or lists and the outputs are in list format to easily further process your data in R environment and exported as desired.
#' More details about handling json you can see jsonlite documentation or vignettes \href{https://CRAN.R-project.org/package=jsonlite}{here}.
#'
#' \href{https://CRAN.R-project.org/package=future}{Future package} is also used to load and create Table and Schema classes asynchronously.
#' To retrieve the actual result of the loaded Table or Schema you have to use \code{\link[future]{value}} function to the variable you stored the loaded Table/Schema.
#' More details about future package and sequential and parallel processing you can find \href{https://CRAN.R-project.org/package=future}{here}.
#'
#' Examples section of each function show how to use \code{jsonlite} and \code{future} packages with \code{tableschema.r}.
#'
#' Term array refers to json arrays which if converted in R will be \code{\link[base:list]{list objects}}.
#'
#' @section Language:
#' The key words \code{MUST}, \code{MUST NOT}, \code{REQUIRED}, \code{SHALL}, \code{SHALL NOT},
#' \code{SHOULD}, \code{SHOULD NOT}, \code{RECOMMENDED}, \code{MAY}, and \code{OPTIONAL}
#' in this package documents are to be interpreted as described in \href{https://www.ietf.org/rfc/rfc2119.txt}{RFC 2119}.
#'
#'
#' @name tableschema.r-package
#'
#' @seealso
#' \href{https://specs.frictionlessdata.io//table-schema/}{Table Schema Specifications}
#'
NULL