/
data_quacker.ex
251 lines (190 loc) · 8.37 KB
/
data_quacker.ex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
defmodule DataQuacker do
@moduledoc """
DataQuacker is a library which aims at helping validating, transforming and parsing non-sandboxed data.
The most common example for such data, and the original idea behind this project, is CSV files.
The scope of this library is not, however, in any way limited to CSV files.
This library ships by default with two adapters: `DataQuacker.Adapters.CSV` for CSV files,
and `DataQuacker.Adapters.Identity` for "in-memory data".
Any other data source may be used with the help of a third party adapters; see: `DataQuacker.Adapter`.
This library is comprised of three main components:
- `DataQuacker`, which provides the `parse/4` function to parse data using a schema
- `DataQuacker.Schema`, which a DSL for declaratively defining schemas which describe the mapping between the source data and the desired output
- `DataQuacker.Adapters.CSV` and `DataQuacker.Adapters.Identity`, which extract data from sources into a format required by the `parse/4` function
> Note: If you find anything missing from or unclear in the documentation, please do not hesitate to open an issue on the project's [Github repository](https://github.com/fiodorbaczynski/data_quacker).
## Testing
The tests for parsing data which is external or non-sandboxed are often difficult to implement well,
since that data may need to change over time.
For example, editing CSV files used for tests, when the requirements change,
can be tedious.
For this reason, using a different adapter, which takes Elixir data as the input, for tests is recommend.
In integration tests for this library the `DataQuacker.Adapters.Identity` adapter is used.
The easiest way to switch out adapters in tests is to put the desired adapter in the `test.exs` config.
You can find out how to do this under the "Options" section in the documentation for the `parse/4` function.
## Examples
> Note: Most of the "juice", like transforming, validating, nesting, skipping, etc., is in the `DataQuacker.Schema` module, so the more complex and interesting examples also live there. Please take a look at its documentation for more in-depth examples.
> Note: A fully working implementation of these examples can be found in the tests inside the "examples" directory.
Given the following table of ducks in a pond, in the form of a CSV file:
| Type | Colour | Age |
|:--------:|:--------------:|-----|
| Mallard | green | 3 |
| Domestic | white | 2 |
| Mandarin | multi-coloured | 4 |
we want to have a list of maps with `:type`, `:colour` and `:age` as the keys.
This can be achieved by creating the following schema and parser modules:
Schema
```elixir
defmodule PondSchema do
use DataQuacker.Schema
schema :pond do
field :type do
source("type")
end
field :colour do
# make the "u" optional
# in case we get an American data source :)
source(~r/colou?r/i)
end
field :age do
source("age")
end
end
end
```
Parser
```
defmodule PondParser do
def parse(file_path) do
DataQuacker.parse(
file_path,
PondSchema.schema_structure(:pond),
nil
)
end
end
```
```elixir
iex> PondParser.parse("path/to/file.csv")
iex> {:ok, [
iex> {:ok, %{type: "Mandarin", colour: "multi-coloured", age: "4"}},
iex> {:ok, %{type: "Domestic", colour: "white", age: "2"}},
iex> {:ok, %{type: "Mallard", colour: "green", age: "3"}},
iex> ]}
```
Using this schema and parser we get a tuple of `:ok` or `:error`, and a list of rows,
each of which is also a tuple of `:ok` or `:error`, but with a map as the second element.
The topmost `:ok` or `:error` indicates whether *all* rows are valid,
and those for individual rows indicate whether that particular row is valid
> Note: The rows in the result are in the reverse order compared to the source rows. This is because for large lists reversing may be an expensive operation, which is often redundant, for example if the result is supposed to be inserted in a database.
Now suppose we also want to validate that the type is one in a list of types we know,
and get the age in the form of an integer.
We need to make some changes to our schema
```elixir
defmodule PondSchema do
use DataQuacker.Schema
schema :pond do
field :type do
validate(fn type -> type in ["Mallard", "Domestic", "Mandarin"] end)
source("type")
end
field :colour do
# make the "u" optional
# in case we get an American data source :)
source(~r/colou?r/i)
end
field :age do
transform(fn age_str ->
case Integer.parse(str) do
{age_int, _} -> {:ok, age_int}
:error -> :error
end
end)
source("age")
end
end
end
```
Using the same input file the output is now:
```elixir
iex> PondParser.parse("path/to/file.csv")
iex> {:ok, [
iex> {:ok, %{type: "Mandarin", colour: "multi-coloured", age: 4}},
iex> {:ok, %{type: "Domestic", colour: "white", age: 2}},
iex> {:ok, %{type: "Mallard", colour: "green", age: 3}},
iex> ]}
```
(the difference is in the type of "age")
If we add some invalid fields to the file, however, the result will be quite different:
| Type | Colour | Age |
|:--------:|:--------------:|----------|
| Mallard | green | 3 |
| Domestic | white | 2 |
| Mandarin | multi-coloured | 4 |
| Mystery | golden | 100 |
| Black | black | Infinity |
```elixir
iex> PondParser.parse("path/to/file.csv")
iex> {:error, [
iex> :error,
iex> :error,
iex> {:ok, %{type: "Mandarin", colour: "multi-coloured", age: 4}}
iex> {:ok, %{type: "Domestic", colour: "white", age: 2}},
iex> {:ok, %{type: "Mallard", colour: "green", age: 3}},
iex> ]}
```
Since the last two rows of the input are invalid, the first two rows in the output are errors.
> Note: The errors can be made more descriptive by returning tuples `{:error, any()}` from the validators and parsers. You can see this in action in the examples for the `DataQuacker.Schema` module.
"""
alias DataQuacker.Builder
@doc """
Takes in a source, a schema, support data, and a keyword list of options.
Returns a tuple with `:ok` or `:error` (indicating whether all rows are valid) as the first element,
and a list of tuples `{:ok, map()} | {:error, any()} | :error)`.
In case of `{:ok, map()}` for a given row, the map is the output defined in the schema.
## Source
Any data which will be given to the adapter so that it can retrieve the source data.
In case of the `DataQuacker.Adapter.CSV` this can be a file path or a file url.
## Schema
A schema formed with the DSL from `DataQuacker.Schema`.
## Support data
Any data which is supposed to be accessible inside various schema elements when parsing a source.
## Options
The options can also be specified in the config, for example:
```elixir
use Mix.Config
# ...
config :data_quacker,
adapter: DataQuacker.Adapters.Identity,
adapter_opts: []
# ...
```
- `:adapter` - the adapter module to be used to retrieve the source data; defaults to `DataQuacker.Adapters.CSV`
- `:adapter_opts` - a keyword list of opts to be passed to the adapter; defaults to `[separator: ?,, local?: true]`; for a list of available adapter options see the documentation for the particular adapter
"""
@spec parse(any(), map(), any(), Keyword.t()) ::
{:ok, list({:ok, map()} | {:error, any()} | :error)}
| {:error, list({:ok, map()} | {:error, any()} | :error)}
def parse(source, schema, support_data, opts \\ []) do
with opts <- apply_default_opts(opts),
adapter <- get_adapter(opts),
{:ok, source} <- adapter.parse_source(source, get_adapter_opts(opts)) do
Builder.call(source, schema, support_data, adapter)
end
end
defp apply_default_opts(opts) do
default_opts()
|> Keyword.merge(Application.get_all_env(:data_quacker))
|> Keyword.merge(opts)
end
defp default_opts do
[
adapter: DataQuacker.Adapters.CSV,
adapter_opts: [separator: ?,, local?: true]
]
end
defp get_adapter(opts) do
Keyword.get(opts, :adapter)
end
defp get_adapter_opts(opts) do
Keyword.get(opts, :adapter_opts, [])
end
end