Merge pull request #50 from cite-architecture/dev

Major redesign of package
cite-architecture · Dec 31, 2021 · dcddb94 · dcddb94
2 parents b515730 + a62b7ad
commit dcddb94
Show file tree

Hide file tree

Showing 38 changed files with 1,788 additions and 288 deletions.
diff --git a/Project.toml b/Project.toml
@@ -1,12 +1,11 @@
 name = "CiteEXchange"
 uuid = "e2e9ead3-1b6c-4e96-b95f-43e6ab899178"
 authors = ["Neel Smith <dnsmith.neel@gmail.com>"]
-version = "0.7.0"
+version = "0.8.0"
 
 [deps]
 CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
 CitableBase = "d6f014bd-995c-41bd-9893-703339864534"
-CitableObject = "e2b2f5ea-1cd8-4ce8-9b2b-05dad64c2a57"
 DocStringExtensions = "ffbed154-4ef7-542d-bbb7-c09d3a79fcae"
 Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
 HTTP = "cd3eb016-35fb-5094-929b-558a96fad6f3"
@@ -15,7 +14,6 @@ Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
 [compat]
 CSV = "0.9"
 CitableBase = "8"
-CitableObject = "0.11"
 DocStringExtensions = "0.8"
 Documenter = "0.27"
 HTTP = "0.9"

diff --git a/docs/make.jl b/docs/make.jl
@@ -21,9 +21,12 @@ makedocs(
     sitename = "CiteEXchange",
     pages = [
         "Overview" => "index.md",
-
+        "The `blocks` function" => "blocks.md",
+        "The `data` function" => "data.md",
 
         "API documentation" => "apis.md"
+
+
     ]
 
 )

diff --git a/docs/src/apis.md b/docs/src/apis.md
@@ -1,21 +1,24 @@
 # API documentation
 
-```@meta
-CurrentModule = CiteEXchange
-```
 
-## Exported functions and types
+## Exported types and functions
+
+### The `Block` data type
+
 ```@docs
 Block
+blocktypes
 cexversion
+```
+
+### The `blocks` function
+
+```@docs
 blocks
-blocktypes
-blocksfortype
-datafortype
 ```
 
-## Internals
+### The `data` function
 
 ```@docs
-blocktype
-```
+data
+```
diff --git a/docs/src/blocks.md b/docs/src/blocks.md
@@ -0,0 +1,84 @@
+```@setup blocks
+root = pwd() |> dirname |> dirname
+f = joinpath(root, "test", "assets", "burneyex.cex")
+```
+
+
+# The `blocks` function
+
+The `blocks` function can:
+
+- parse a CEX data source into `Block`s, optionally filtering it by block type
+- filter a list of `Block`s by block type
+
+It always returns a (possibly empty) Vector of `Block`s.
+
+## Parsing a CEX data source
+
+The following examples parse a CEX source with two blocks, one a `ctscatalog` block, the other a `ctsdata` block.  They parse identical data from a URL, a file (`f` in the example below is `test/assets/burneyex.cex` in this github repository), and a string value using `blocks` with a specified "reader".
+
+Parse CEX from a URL:
+
+```@example blocks
+using CiteEXchange
+url = "https://raw.githubusercontent.com/cite-architecture/CiteEXchange.jl/main/test/assets/burneyex.cex"
+urlblocks = blocks(url, CiteEXchange.UrlReader)
+```
+
+From a file:
+
+```@example blocks
+fileblocks = blocks(f, CiteEXchange.FileReader)
+```
+
+From a string:
+
+```@example blocks
+cexstring = read(f, String)
+stringblocks = blocks(cexstring, CiteEXchange.StringReader)
+```
+
+The default is to parse from a string.
+
+```@example blocks
+defaultblocks = blocks(cexstring)
+```
+
+The results are equivalent.
+
+```@example blocks
+urlblocks == fileblocks == stringblocks == defaultblocks
+```
+
+
+## Filter CEX source by type
+
+Specify the String value of a CEX block type as an additional parameter to filter the resulting Vector of `Block`s to include only blocks of that type.
+
+
+```@example blocks
+urlcatalog = blocks(url, CiteEXchange.UrlReader, "ctscatalog")
+filecatalog = blocks(f, CiteEXchange.FileReader, "ctscatalog")
+stringcatalog = blocks(cexstring, CiteEXchange.StringReader, "ctscatalog")
+defaultcatalog = blocks(cexstring, "ctscatalog")
+```
+
+
+```@example blocks
+urlcatalog == filecatalog == stringcatalog == defaultcatalog
+```
+
+
+## Filter a list of `Block`s by type
+
+The `blocks` function can also  be used to select blocks of a given type from a list of `Block`s.
+
+
+```@example blocks
+filteredcatalog = blocks(fileblocks, "ctscatalog")
+```
+
+
+```@example blocks
+filteredcatalog == filecatalog
+```
diff --git a/docs/src/data.md b/docs/src/data.md
@@ -0,0 +1,138 @@
+```@setup data
+root = pwd() |> dirname |> dirname
+f = joinpath(root, "test", "assets", "burneyex.cex")
+
+using CitableBase
+using CiteEXchange
+
+struct UnstructuredUrn <: Urn
+    id::AbstractString
+end
+
+import Base: show
+function show(io::IO, u::UnstructuredUrn)
+    print(io, u.id)
+end
+
+struct MyComparable <: UrnComparisonTrait end
+
+import CitableBase: urncomparisontrait
+function urncomparisontrait(::Type{UnstructuredUrn})
+    MyComparable()
+end
+
+import CitableBase: urnequals
+function urnequals(::MyComparable, u1::UnstructuredUrn, u2::UnstructuredUrn)
+    u1 == u2
+end
+import CitableBase: urncontains
+function urncontains(::MyComparable, u1::UnstructuredUrn, u2::UnstructuredUrn)
+    c1 = components(u1.id)
+    c2 = components(u2.id)
+    c1[3] == c2[3]
+end
+
+import CitableBase: urnsimilar
+function urnsimilar(::MyComparable, u1::UnstructuredUrn, u2::UnstructuredUrn)
+    urncontains(u1, u2)
+end
+
+root = pwd() |> dirname |> dirname
+f = joinpath(root, "test", "assets", "laxlibrary1.cex")
+
+```
+
+# The `data` function
+
+The `data` function can:
+
+- select data lines for a specified block type from a CEX source or from a list of `Block`s 
+- optionally filter data by a URN value
+
+It always returns a (possibly empty) Vector of string values representing CEX data lines.
+
+## Select data lines from CEX sources
+
+In this example, we work with a CEX source that has several different kinds of CEX blocks, and two `ctsdata` blocks with passages from two different texts.  We can collect all of the text datalines using the same syntax as for the `blocks` function.
+
+
+```@example data
+url = "https://raw.githubusercontent.com/cite-architecture/CiteEXchange.jl/dev/test/assets/laxlibrary1.cex"
+str = read(f, String)
+
+lines1 = data(f, CiteEXchange.FileReader, "ctsdata")
+lines2 = data(url, CiteEXchange.UrlReader, "ctsdata")
+lines3 = data(str, CiteEXchange.StringReader, "ctsdata")
+lines4 = data(str, "ctsdata")
+```
+```@example data
+lines1 == lines2 == lines3 == lines4
+```
+
+
+Note in particular that `citerelationset` blocks have three lines of metadata before the relations data. These three lines appear in the `lines` field of a block, but are not included in the output of `data`.
+
+```@example data
+relblocks = blocks(str, "citerelationset")
+relblocks[1].lines
+```
+
+```@example data
+data(str, "citerelationset")
+```
+
+## Select data lines from a list of `Block`s
+
+Instead of a CEX source, you can also directly supply a list of blocks (without a "reader" type). 
+
+```@example data
+blockgroup = blocks(str)
+blocklines = data(blockgroup, "ctsdata")
+```
+
+```@example data
+blocklines == lines3
+```
+
+## Filter data lines by URN
+
+The `data` function optionally accepts a third parameter with a URN value to filter on by URN containment.  The background setup for this page has defined a subtype of `Urn` called `UnstructuredUrn` that accepts any kind of URN string, and has implemented the `UrnComparisonTrait` for the type, so we can use `UnstructuredUrn` values to filter the data from blocks in our source.
+
+!!! note "Realistic URN types"
+
+    The `UnstructuredUrn` is used solely for the purposes of testing the `CiteEXchange` package.  In our experience, we can cover all needs for scholarly citation with either the `CtsUrn` type of [the `CitableText` package](https://cite-architecture.github.io/CitableText.jl/stable/), or the `Cite2Urn` of [the `CitableObject` package](https://github.com/cite-architecture/CitableObject.jl).
+
+
+When we collected all the `ctsdata` lines, we got five passages from two different texts.  Now we'll filter the request to get data from a single text.
+
+```@example data
+urn = UnstructuredUrn("urn:cts:citedemo:gburg")
+textdata = data(str, "ctsdata", urn)
+``` 
+
+URN filtering can be used with any of the variations of the `data` function, including filtering `Block`s.
+
+```@example data
+blks = blocks(f, CiteEXchange.FileReader)
+textfromblocks = data(blks, "ctsdata", urn)
+``` 
+
+### Negating a URN filter
+
+To collect all data lines that are *not* contained by a URN filter, set the optional parameter `complement` to `true`.
+
+```@example data
+urn = UnstructuredUrn("urn:cts:citedemo:gburg")
+textdata = data(str, "ctsdata", urn, complement=true)
+``` 
+
+
+ ### Filtering `citerelationset`s
+
+Note that when filtering `citerelationset`s by URN value, the filter applies to the URN for an entire relation set, *not* to URNs in individual relations.
+
+```@example data
+relsetfilter = UnstructuredUrn("urn:cite2:hmt:dse.v1:")
+data(str, "citerelationset", relsetfilter)
+```
+
diff --git a/docs/src/index.md b/docs/src/index.md
@@ -6,57 +6,45 @@ f = joinpath(root, "test", "assets", "burneyex.cex")
 
 # CiteEXchange
 
-*Parse strings and files in CEX format.*
+*Parse data in the delimited-text CEX format.*
 
-Cite EXchange format (CEX) is a plain-text format for serializing citable scholarly resources.
 
-## Quick introduction
+Cite EXchange format (CEX) is a plain-text format for serializing citable scholarly resources. CEX organizes data in one or more blocks defined by a CEX header line.  Using the `CiteEXchange` package, you can work with data from CEX sources as labelled `Block`s with associated lines of metadata and data, can extract data contents by CEX block type, and can filter contents by URN.
 
-The plain-text CEX format organizes data in one or more blocks defined by a  CEX header line.  
-Reading CEX source data with the `blocks` function creates an array of `Block`s, each of which has a label identifying the block type, followed by a series of data lines.  (Empty or whitespace-only lines are ignored.)  You can use `blocks`:
 
-- with a single argument to parse a string of CEX data
-- with a file name and a second parameter `FileReader` to parse CEX data from a file
-- with a URL string and a second parameter `UrlReader` to parse CEX retrieved from a URL
 
-This example reads a file with two blocks,  `ctscatalog` and `ctsdata` block.
+## Quick introduction
 
-!!! note
+You can use the `blocks` function to read source data into a Vector of `Block` objects.  This example reads a file with two blocks, one labelled `ctscatalog` and one labelled `ctsdata`.
 
-    The file `f` in the example below is `test/assets/burneyex.cex` in this github repository.
 
 ```@example simple
 using CiteEXchange
-blocklist = CiteEXchange.blocks(f, FileReader)
-blocklist |> length
+blocklist = blocks(f, CiteEXchange.FileReader)
 ```
 
+!!! note
+
+    The file `f` in the example below is `test/assets/burneyex.cex` in this github repository.
 
-### Work with contents of an individual block 
 
-You can work directly the array of blocks:
+Each `Block` has a label and an array of data lines.  You can work directly with the array of blocks:
 
-```@example simple
+```
 blocklist[1].label
 ```
 
-```@example simple
+```
 blocklist[1].lines
 ```
 
 
-### Work with an array of `Block`s
-
-`CiteEXchange` also has functions that work with arrays of `Block`s.  You can see what types of blocks are present.
+## In more detail
 
-```@example simple
-blocktypes(blocklist)
-```
+The `CiteEXchange` package provides two main functions for working with CEX data:
 
-You can find all data for a given type of block.
+- the `blocks` function parses and filters CEX sources into lists of `Block`s
+- the `data` function parses and filters CEX sources, and extracts only the data lines from the resulting `Block`s
 
-```@example simple
-datafortype("ctscatalog", blocklist)
-```
 
-A typical work pattern might be to read an array of blocks, see what types of block are included, and then use an appropriate module to process blocks depending on their type (e.g., use the `CitableCorpus` module to read a `ctsdata` or `ctscatalog` block).
+They are documented on the following pages.