# Crossref data preprocessing

This notebook contains code in **Julia language** to pre-process Crossref public data file, that can be downloaded [here](https://academictorrents.com/details/4426fa56a4f3d376ece9ac37ed088095a30de568). The script will process Crossref JSON files and create tab-delimited files that contains only the following information about every DOI:

- DOI
- the time when DOI was created in Crosref
- article type (journal article, conference, book or chapter, academic thesis and etc.)
- publication date
- publisher name

The script takes around 18 hours to run on a server equipped with 2 Intel Xeon CPU E5-2698 v4 2.20GHz. After the process is complete, other preprocessing scripts e.g. *crossref date count.ipynb* can be ran.

The data is prepared for the paper 'From black open access to open access of colour: accepting the diversity of approaches towards free science'.

In [1]:
# the script will run 160 processes in parallel

using Distributed

addprocs(160 - nprocs(); exeflags = `--project=$(Base.active_project())`) ;

In [2]:
@everywhere begin

    using Distributed
    
    using Glob
    using JSON3
    using ProgressMeter
    using DelimitedFiles

    # specify the folder where downloaded Crossref dataset is located
    # the dataset must be unpacked from .gz files bedore processing
    crossref = "/raid5/datasets/crossref/" ;
    
end

Before running the script, create the folder named **tabs** in the same directory where Crossref dataset is stored. The script will save output to that folder.

Crossref dataset is stored in multiple *.json* files. For each *.json*, the script will create corresponding *.tab* file.

In [3]:
# make a list of .json files available, as well as .tab files
# to avoid repeatedly processing those files that are already done

jsons = glob("*.json", crossref * "April 2024 Public Data File from Crossref")
tabs  = glob("*.tab",  crossref * "tabs")

# exclude already processed files from queue, and
# sort the list by size to process largest files first

jsons = [j for j in jsons if filesize(j) > 0 && split(j, "/")[end] * ".tab" ∉ tabs]
jsons = sort(jsons, by = j -> -filesize(j))

31601-element Vector{String}:
 "/raid5/datasets/crossref/April 2024 Public Data File from Crossref/19530.json"
 "/raid5/datasets/crossref/April 2024 Public Data File from Crossref/19534.json"
 "/raid5/datasets/crossref/April 2024 Public Data File from Crossref/25432.json"
 "/raid5/datasets/crossref/April 2024 Public Data File from Crossref/24432.json"
 "/raid5/datasets/crossref/April 2024 Public Data File from Crossref/19526.json"
 "/raid5/datasets/crossref/April 2024 Public Data File from Crossref/25431.json"
 "/raid5/datasets/crossref/April 2024 Public Data File from Crossref/30414.json"
 "/raid5/datasets/crossref/April 2024 Public Data File from Crossref/24430.json"
 "/raid5/datasets/crossref/April 2024 Public Data File from Crossref/25393.json"
 "/raid5/datasets/crossref/April 2024 Public Data File from Crossref/30415.json"
 "/raid5/datasets/crossref/April 2024 Public Data File from Crossref/22481.json"
 "/raid5/datasets/crossref/April 2024 Public Data File from Crossref/19529.json

Preview a random article from random json file in Crossref index to give a sense of how data is structured:

In [4]:
jtest = JSON3.read(open(jsons[12345])).items[1]
JSON3.pretty(jtest)

{
    "URL": "http://dx.doi.org/10.5026/jgeography.129.195",
    "resource": {
        "primary": {
            "URL": "https://www.jstage.jst.go.jp/article/jgeography/129/2/129_129.195/_article/-char/ja/"
        }
    },
    "member": "2942",
    "score": 0,
    "created": {
        "date-parts": [
            [
                2020,
                5,
                11
            ]
        ],
        "date-time": "2020-05-11T22:05:48Z",
        "timestamp": 1589234748000
    },
    "ISSN": [
        "0022-135X",
        "1884-0884"
    ],
    "container-title": [
        "Journal of Geography (Chigaku Zasshi)"
    ],
    "issued": {
        "date-parts": [
            [
                2020,
                4,
                25
            ]
        ]
    },
    "issue": "2",
    "prefix": "10.5026",
    "reference-count": 32,
    "indexed": {
        "date-parts": [
            [
                2022,
                4,
                3
            ]
        ],
        "date-ti

In [5]:
jtest.DOI

"10.5026/jgeography.129.195"

In [6]:
jtest.created["date-time"]

"2020-05-11T22:05:48Z"

**Note**: the publication date is not available for every DOI, and sometimes is not full (e.g. only year will be stored)

In [7]:
jtest.published["date-parts"][1]

3-element JSON3.Array{Int64, Base.CodeUnits{UInt8, String}, SubArray{UInt64, 1, Vector{UInt64}, Tuple{UnitRange{Int64}}, true}}:
 2020
    4
   25

In [8]:
jtest.type

"journal-article"

In [9]:
jtest.publisher

"Tokyo Geographical Society"

Define a function that takes .json file name as an input, processes it and creates .tab file as output, that contains selected information (creation and publication date, type and publisher name) for every DOI:

In [10]:
@everywhere begin

    function tabulate(json)

        # a function to clear excessive space from publisher names
        clear(s) = join(split(s, r"\s+"), " ")

        # get JSON file name, without extension and path
        jid = split(split(json, "/")[end], ".")[end-1]

        # if the file was already processed and .tab created, skip
        isfile(crossref * "tabs/$jid.tab") && return 1

        # create temporary file to store output
        output = open(crossref * "tabs/$jid.tmp", "w")

        # read JSON file into variable
        data = JSON3.read(open(json))

        # each JSON file contains multiple records
        # one record per article, enumerate in the lopop
        for item in data.items

            # process only correct items that have DOI creation date
            # and item type specified and non-empty DOI
            
            fields = keys(item)
            
            if (length(item.DOI) > 0) && ("created" in fields) && ("type" in fields)

                # extract DOI creation time
                created = split(item.created["date-time"], "T")[1]

                # if publication date is available, extract
                # it into variable and format to YYYY-MM-DD
                
                if ("published"  in fields) &&
                   ("date-parts" in keys(item.published))
                    ymd = string.(item.published["date-parts"][1])
                    while length(ymd) < 3 push!(ymd, "00") end
                    for i in [2,3]
                        ymd[i] = lpad(ymd[i], 2, "0") end
                        ymd[1] = lpad(ymd[1], 4, "0")
                    published = join(ymd, "-")
                else
                    published = "0000-00-00"
                end

                # extract publisher name
                publisher = "publisher" in fields ? clear(item.publisher) : "N/A"

                # write extracted data to tab-delimited file
                write(output, join([
                            
                    lowercase(item.DOI),
                    item.type,
                    created,
                    published,
                    publisher
                
                ], "\t") * "\n")
                
        end end

        close(output)

        # create output .tab file from temporary
        
        mv(crossref * "tabs/$jid.tmp",
           crossref * "tabs/$jid.tab") ;
    
    end
    
end

Run parallel processing of the array of .json files. The processing will typically take a few hours and accelerate with every processed file, since largest files are processed first:

In [None]:
@showprogress output = stdout @distributed for j in jsons tabulate(j) end

[32mProgress:  21%|████████▊                                |  ETA: 11:16:10[39m57[39mm