Skip to content

Commit

Permalink
Merge pull request #50 from cite-architecture/dev
Browse files Browse the repository at this point in the history
Major redesign of package
  • Loading branch information
neelsmith committed Dec 31, 2021
2 parents b515730 + a62b7ad commit dcddb94
Show file tree
Hide file tree
Showing 38 changed files with 1,788 additions and 288 deletions.
4 changes: 1 addition & 3 deletions Project.toml
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
name = "CiteEXchange"
uuid = "e2e9ead3-1b6c-4e96-b95f-43e6ab899178"
authors = ["Neel Smith <dnsmith.neel@gmail.com>"]
version = "0.7.0"
version = "0.8.0"

[deps]
CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
CitableBase = "d6f014bd-995c-41bd-9893-703339864534"
CitableObject = "e2b2f5ea-1cd8-4ce8-9b2b-05dad64c2a57"
DocStringExtensions = "ffbed154-4ef7-542d-bbb7-c09d3a79fcae"
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
HTTP = "cd3eb016-35fb-5094-929b-558a96fad6f3"
Expand All @@ -15,7 +14,6 @@ Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
[compat]
CSV = "0.9"
CitableBase = "8"
CitableObject = "0.11"
DocStringExtensions = "0.8"
Documenter = "0.27"
HTTP = "0.9"
Expand Down
5 changes: 4 additions & 1 deletion docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,12 @@ makedocs(
sitename = "CiteEXchange",
pages = [
"Overview" => "index.md",

"The `blocks` function" => "blocks.md",
"The `data` function" => "data.md",

"API documentation" => "apis.md"


]

)
Expand Down
23 changes: 13 additions & 10 deletions docs/src/apis.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,24 @@
# API documentation

```@meta
CurrentModule = CiteEXchange
```

## Exported functions and types
## Exported types and functions

### The `Block` data type

```@docs
Block
blocktypes
cexversion
```

### The `blocks` function

```@docs
blocks
blocktypes
blocksfortype
datafortype
```

## Internals
### The `data` function

```@docs
blocktype
```
data
```
84 changes: 84 additions & 0 deletions docs/src/blocks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
```@setup blocks
root = pwd() |> dirname |> dirname
f = joinpath(root, "test", "assets", "burneyex.cex")
```


# The `blocks` function

The `blocks` function can:

- parse a CEX data source into `Block`s, optionally filtering it by block type
- filter a list of `Block`s by block type

It always returns a (possibly empty) Vector of `Block`s.

## Parsing a CEX data source

The following examples parse a CEX source with two blocks, one a `ctscatalog` block, the other a `ctsdata` block. They parse identical data from a URL, a file (`f` in the example below is `test/assets/burneyex.cex` in this github repository), and a string value using `blocks` with a specified "reader".

Parse CEX from a URL:

```@example blocks
using CiteEXchange
url = "https://raw.githubusercontent.com/cite-architecture/CiteEXchange.jl/main/test/assets/burneyex.cex"
urlblocks = blocks(url, CiteEXchange.UrlReader)
```

From a file:

```@example blocks
fileblocks = blocks(f, CiteEXchange.FileReader)
```

From a string:

```@example blocks
cexstring = read(f, String)
stringblocks = blocks(cexstring, CiteEXchange.StringReader)
```

The default is to parse from a string.

```@example blocks
defaultblocks = blocks(cexstring)
```

The results are equivalent.

```@example blocks
urlblocks == fileblocks == stringblocks == defaultblocks
```


## Filter CEX source by type

Specify the String value of a CEX block type as an additional parameter to filter the resulting Vector of `Block`s to include only blocks of that type.


```@example blocks
urlcatalog = blocks(url, CiteEXchange.UrlReader, "ctscatalog")
filecatalog = blocks(f, CiteEXchange.FileReader, "ctscatalog")
stringcatalog = blocks(cexstring, CiteEXchange.StringReader, "ctscatalog")
defaultcatalog = blocks(cexstring, "ctscatalog")
```


```@example blocks
urlcatalog == filecatalog == stringcatalog == defaultcatalog
```


## Filter a list of `Block`s by type

The `blocks` function can also be used to select blocks of a given type from a list of `Block`s.


```@example blocks
filteredcatalog = blocks(fileblocks, "ctscatalog")
```


```@example blocks
filteredcatalog == filecatalog
```
138 changes: 138 additions & 0 deletions docs/src/data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
```@setup data
root = pwd() |> dirname |> dirname
f = joinpath(root, "test", "assets", "burneyex.cex")
using CitableBase
using CiteEXchange
struct UnstructuredUrn <: Urn
id::AbstractString
end
import Base: show
function show(io::IO, u::UnstructuredUrn)
print(io, u.id)
end
struct MyComparable <: UrnComparisonTrait end
import CitableBase: urncomparisontrait
function urncomparisontrait(::Type{UnstructuredUrn})
MyComparable()
end
import CitableBase: urnequals
function urnequals(::MyComparable, u1::UnstructuredUrn, u2::UnstructuredUrn)
u1 == u2
end
import CitableBase: urncontains
function urncontains(::MyComparable, u1::UnstructuredUrn, u2::UnstructuredUrn)
c1 = components(u1.id)
c2 = components(u2.id)
c1[3] == c2[3]
end
import CitableBase: urnsimilar
function urnsimilar(::MyComparable, u1::UnstructuredUrn, u2::UnstructuredUrn)
urncontains(u1, u2)
end
root = pwd() |> dirname |> dirname
f = joinpath(root, "test", "assets", "laxlibrary1.cex")
```

# The `data` function

The `data` function can:

- select data lines for a specified block type from a CEX source or from a list of `Block`s
- optionally filter data by a URN value

It always returns a (possibly empty) Vector of string values representing CEX data lines.

## Select data lines from CEX sources

In this example, we work with a CEX source that has several different kinds of CEX blocks, and two `ctsdata` blocks with passages from two different texts. We can collect all of the text datalines using the same syntax as for the `blocks` function.


```@example data
url = "https://raw.githubusercontent.com/cite-architecture/CiteEXchange.jl/dev/test/assets/laxlibrary1.cex"
str = read(f, String)
lines1 = data(f, CiteEXchange.FileReader, "ctsdata")
lines2 = data(url, CiteEXchange.UrlReader, "ctsdata")
lines3 = data(str, CiteEXchange.StringReader, "ctsdata")
lines4 = data(str, "ctsdata")
```
```@example data
lines1 == lines2 == lines3 == lines4
```


Note in particular that `citerelationset` blocks have three lines of metadata before the relations data. These three lines appear in the `lines` field of a block, but are not included in the output of `data`.

```@example data
relblocks = blocks(str, "citerelationset")
relblocks[1].lines
```

```@example data
data(str, "citerelationset")
```

## Select data lines from a list of `Block`s

Instead of a CEX source, you can also directly supply a list of blocks (without a "reader" type).

```@example data
blockgroup = blocks(str)
blocklines = data(blockgroup, "ctsdata")
```

```@example data
blocklines == lines3
```

## Filter data lines by URN

The `data` function optionally accepts a third parameter with a URN value to filter on by URN containment. The background setup for this page has defined a subtype of `Urn` called `UnstructuredUrn` that accepts any kind of URN string, and has implemented the `UrnComparisonTrait` for the type, so we can use `UnstructuredUrn` values to filter the data from blocks in our source.

!!! note "Realistic URN types"

The `UnstructuredUrn` is used solely for the purposes of testing the `CiteEXchange` package. In our experience, we can cover all needs for scholarly citation with either the `CtsUrn` type of [the `CitableText` package](https://cite-architecture.github.io/CitableText.jl/stable/), or the `Cite2Urn` of [the `CitableObject` package](https://github.com/cite-architecture/CitableObject.jl).


When we collected all the `ctsdata` lines, we got five passages from two different texts. Now we'll filter the request to get data from a single text.

```@example data
urn = UnstructuredUrn("urn:cts:citedemo:gburg")
textdata = data(str, "ctsdata", urn)
```

URN filtering can be used with any of the variations of the `data` function, including filtering `Block`s.

```@example data
blks = blocks(f, CiteEXchange.FileReader)
textfromblocks = data(blks, "ctsdata", urn)
```

### Negating a URN filter

To collect all data lines that are *not* contained by a URN filter, set the optional parameter `complement` to `true`.

```@example data
urn = UnstructuredUrn("urn:cts:citedemo:gburg")
textdata = data(str, "ctsdata", urn, complement=true)
```


### Filtering `citerelationset`s

Note that when filtering `citerelationset`s by URN value, the filter applies to the URN for an entire relation set, *not* to URNs in individual relations.

```@example data
relsetfilter = UnstructuredUrn("urn:cite2:hmt:dse.v1:")
data(str, "citerelationset", relsetfilter)
```

44 changes: 16 additions & 28 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,57 +6,45 @@ f = joinpath(root, "test", "assets", "burneyex.cex")

# CiteEXchange

*Parse strings and files in CEX format.*
*Parse data in the delimited-text CEX format.*

Cite EXchange format (CEX) is a plain-text format for serializing citable scholarly resources.

## Quick introduction
Cite EXchange format (CEX) is a plain-text format for serializing citable scholarly resources. CEX organizes data in one or more blocks defined by a CEX header line. Using the `CiteEXchange` package, you can work with data from CEX sources as labelled `Block`s with associated lines of metadata and data, can extract data contents by CEX block type, and can filter contents by URN.

The plain-text CEX format organizes data in one or more blocks defined by a CEX header line.
Reading CEX source data with the `blocks` function creates an array of `Block`s, each of which has a label identifying the block type, followed by a series of data lines. (Empty or whitespace-only lines are ignored.) You can use `blocks`:

- with a single argument to parse a string of CEX data
- with a file name and a second parameter `FileReader` to parse CEX data from a file
- with a URL string and a second parameter `UrlReader` to parse CEX retrieved from a URL

This example reads a file with two blocks, `ctscatalog` and `ctsdata` block.
## Quick introduction

!!! note
You can use the `blocks` function to read source data into a Vector of `Block` objects. This example reads a file with two blocks, one labelled `ctscatalog` and one labelled `ctsdata`.

The file `f` in the example below is `test/assets/burneyex.cex` in this github repository.

```@example simple
using CiteEXchange
blocklist = CiteEXchange.blocks(f, FileReader)
blocklist |> length
blocklist = blocks(f, CiteEXchange.FileReader)
```

!!! note

The file `f` in the example below is `test/assets/burneyex.cex` in this github repository.

### Work with contents of an individual block

You can work directly the array of blocks:
Each `Block` has a label and an array of data lines. You can work directly with the array of blocks:

```@example simple
```
blocklist[1].label
```

```@example simple
```
blocklist[1].lines
```


### Work with an array of `Block`s

`CiteEXchange` also has functions that work with arrays of `Block`s. You can see what types of blocks are present.
## In more detail

```@example simple
blocktypes(blocklist)
```
The `CiteEXchange` package provides two main functions for working with CEX data:

You can find all data for a given type of block.
- the `blocks` function parses and filters CEX sources into lists of `Block`s
- the `data` function parses and filters CEX sources, and extracts only the data lines from the resulting `Block`s

```@example simple
datafortype("ctscatalog", blocklist)
```

A typical work pattern might be to read an array of blocks, see what types of block are included, and then use an appropriate module to process blocks depending on their type (e.g., use the `CitableCorpus` module to read a `ctsdata` or `ctscatalog` block).
They are documented on the following pages.
Loading

0 comments on commit dcddb94

Please sign in to comment.