Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add citation support #129

Merged
merged 5 commits into from
Dec 15, 2023
Merged

feat: add citation support #129

merged 5 commits into from
Dec 15, 2023

Conversation

bdarcus
Copy link
Owner

@bdarcus bdarcus commented Sep 16, 2023

It seems right to add this key piece of functionality next.

I will likely only add author-date initially, since that's all I really use myself. But if so, I will design it all along the same lines as 1.0.

Also, I will initially only support a more abstract import format; not actual documents. Still waiting on djot support for citations.

I thought I had this working, but it turns out not; process_citations is currently returning empty vectors.

  "citations": [
    [],
    [],
    []
  ]

Digging a bit more, I think I may need to rethink and refactor the rendering code to account for the citations.


Details

I'm not sure how best to do this, but probably need to look at https://github.com/jgm/citeproc and https://github.com/zotero/citeproc-rs, though I have a hard time understanding the code in many places.

This could be the citation definition, but doesn't seem right.

https://github.com/zotero/citeproc-rs/blob/2ab195a1e6f84f0ff284813ece61dc62096abbfe/crates/pandoc-types/src/definition.rs#L222

See, though, the design document. It takes a parallel approach, where in "Pass 1", it creates different representations of the intermediate output, that can be resolved in "Pass 2."

haskell citeproc

Here's the haskell processor type, which makes more sense to me.

https://github.com/jgm/citeproc/blob/6969ce218d0dfdee29d54cce674c7f9cef4b4f0a/src/Citeproc/Types.hs#L310
https://github.com/jgm/citeproc/blob/6969ce218d0dfdee29d54cce674c7f9cef4b4f0a/src/Citeproc/Types.hs#L263

data Citation a =
  Citation { citationId         :: Maybe Text
           , citationNoteNumber :: Maybe Int
           , citationItems      :: [CitationItem a] }

data CitationItem a =
  CitationItem
  { citationItemId             :: ItemId
  , citationItemLabel          :: Maybe Text
  , citationItemLocator        :: Maybe Text
  , citationItemType           :: CitationItemType
  , citationItemPrefix         :: Maybe a
  , citationItemSuffix         :: Maybe a
  , citationItemData           :: Maybe (Reference a)
  }

data CitationItemType =
    AuthorOnly      -- ^ e.g., Smith
  | SuppressAuthor  -- ^ e.g., (2000, p. 30)
  | NormalCite      -- ^ e.g., (Smith 2000, p. 30)

Here's the high-level processing logic, which is basically what I am planning here.

https://github.com/jgm/citeproc/blob/6969ce218d0dfdee29d54cce674c7f9cef4b4f0a/src/Citeproc.hs#L20C23-L20C23

-- | Process a list of 'Citation's, producing formatted citations
-- and a bibliography according to the rules of a CSL 'Style'.
-- If a 'Lang' is specified, override the style's default locale.
-- To obtain a 'Style' from an XML stylesheet, use
-- 'parseStyle' from "Citeproc.Style".
citeproc :: CiteprocOutput a
         => CiteprocOptions    -- ^ Rendering options
         -> Style a            -- ^ Parsed CSL style
         -> Maybe Lang         -- ^ Overrides default locale for style
         -> [Reference a]      -- ^ List of references (bibliographic data)
         -> [Citation a]       -- ^ List of citations to process
         -> Result a

Question: how are rendered citations inserted in document?

Disambiguation

... I also need to figure out where and how disambiguation fits in this.

https://github.com/jgm/citeproc/blob/6969ce218d0dfdee29d54cce674c7f9cef4b4f0a/src/Citeproc/Eval.hs#L408

I'm hoping other aspects of this design will make this part easier, but I haven't yet figured it out.

My initial thoughts:

The main aspects of disambiguation I need to focus on first are (author) names, and years.

The latter is easy because in practice it's global. So I've already implemented it.

The former is the tricky piece, since typically it applies to citations, and not bibliographies (I guess unless a style requires a given name initial to be expanded?).

I suppose one option would be to follow the citeproc-rs approach: somehow generate alternate name representations on first pass, and disambiguate them separately.

Maybe I could create a hash-table for author names, something vaguely like:

pub struct Author {
    pub name: String,
    pub disambiguate_given: Vec<String>,
    pub role: ContributorRole,
    pub substitute: bool,
}

Regardless of the details, the idea would be to lookup the right name with disambiguation string in that hash map.

@bdarcus bdarcus added the enhancement New feature or request label Sep 16, 2023
@bdarcus bdarcus force-pushed the citations branch 3 times, most recently from 30829da to cab0f8e Compare September 16, 2023 21:05
@bdarcus bdarcus force-pushed the citations branch 4 times, most recently from 96cd8d2 to 8a377cd Compare October 13, 2023 21:31
@bdarcus
Copy link
Owner Author

bdarcus commented Oct 15, 2023

@jgm - can I ask you a high-level question about citeproc and pandoc integration for citation rendering?

You render citations independently of the document, and insert them in the document how, and when?

@jgm
Copy link

jgm commented Oct 16, 2023

@bdarcus - after the input format is parsed to a Pandoc AST, we apply

processCitations  :: PandocMonad m => Pandoc -> m Pandoc

which transforms the Pandoc AST by (1) replacing each citation with the formatted citation and (2) adding a bibliography. The code is in Text.Pandoc.Citeproc.

The transformed AST can then be rendered by any of the pandoc writers. Small complication: for display details, we use special Span and Div elements. These will be ignored by most writers, but for a few writers we've implemented code that responds to them by doing the proper formatting (e.g. docx, latex, html).

@bdarcus
Copy link
Owner Author

bdarcus commented Oct 16, 2023

Thanks @jgm!

I have a hard time reading Haskell code. Am I correct that the output you use from citeproc is basically the same as the server JSON; an array of citation strings?

@jgm
Copy link

jgm commented Oct 17, 2023

My Haskell citeproc library uses polymorphic types.

-- | Process a list of 'Citation's, producing formatted citations
-- and a bibliography according to the rules of a CSL 'Style'.
-- If a 'Lang' is specified, override the style's default locale.
-- To obtain a 'Style' from an XML stylesheet, use
-- 'parseStyle' from "Citeproc.Style".
citeproc :: CiteprocOutput a
         => CiteprocOptions    -- ^ Rendering options
         -> Style a            -- ^ Parsed CSL style
         -> Maybe Lang         -- ^ Overrides default locale for style
         -> [Reference a]      -- ^ List of references (bibliographic data)
         -> [Citation a]       -- ^ List of citations to process
         -> Result a

For pandoc we use a = Inlines, so that the contents are pandoc Inline sequences, not raw strings. The typeclass instance for this is defined in Citeproc.Pandoc:

instance CiteprocOutput Inlines where
...

We also have an instance for HTML, which we use for the standard citeproc test suite.

The advantage of this is that when we're using pandoc, we can define bibliography entries with any of the formatting pandoc provides (e.g. math), and this will be carried through all the way to the result.

@bdarcus
Copy link
Owner Author

bdarcus commented Oct 17, 2023

I only need to implement this to a proof-of-concept state ATM, so my plan is just return something similar to the citeproc server JSON.

{
  "citations": [ ... ],
  "bibliography": [ ... ],
}

I was just confused how one would replace the citation input with that output, but I guess it doesn't matter too much now.

The advantage of this is that when we're using pandoc, we can define bibliography entries with any of the formatting pandoc provides (e.g. math), and this will be carried through all the way to the result.

Right. Am thinking to use djot for this somehow, if and when it gets citations.

Signed-off-by: Bruce D'Arcus <bdarcus@gmail.com>
This more closely aligns the model with the haskell citeproc
implementation.

Signed-off-by: Bruce D'Arcus <bdarcus@gmail.com>
Add a struct to handle intermediately rendered output.

The intention is something similar to the haskell citeproc server json.

Signed-off-by: Bruce D'Arcus <bdarcus@gmail.com>
@bdarcus bdarcus force-pushed the citations branch 5 times, most recently from 08be9bd to 855b3bb Compare November 23, 2023 20:51
Signed-off-by: Bruce D'Arcus <bdarcus@gmail.com>
@bdarcus bdarcus force-pushed the citations branch 2 times, most recently from 7e227a2 to ea15946 Compare November 25, 2023 16:29
Signed-off-by: Bruce D'Arcus <bdarcus@gmail.com>
@bdarcus bdarcus marked this pull request as ready for review December 15, 2023 14:12
@bdarcus bdarcus merged commit a1d9db1 into main Dec 15, 2023
6 checks passed
@bdarcus bdarcus deleted the citations branch December 15, 2023 14:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants