# Sci-Hub: number of full-text articles per publisher

This notebook contains code in **Julia language** to calculate number of articles per each publisher and year, that are available as full-text in Sci-Hub.

The script requires pre-processed Crossref tabulated files to be already available in *data/crossref* folder. To generate .tab files from publicly available Crossref dataset see *crossref to tab.ipynb*. The script also uses a list of DOI available in Sci-Hub as an input.

The script will produce a separate .txt file for every publisher in Crossref, that will contain total number of articles published before 2001, as well as for every year after 2001, available as full-text in Sci-Hub. Only DOI of type *journal-article* and *proceedings-article* are counted. The data will be used in  *sci-hub percent per publisher.ipynb* notebook.

The data is prepared for the paper 'From black open access to open access of colour: accepting the diversity of approaches towards free science'.

In [1]:
using MD5,
      Glob,
      Random,
      ProgressMeter,
      DelimitedFiles,
      DataStructures,
      OrderedCollections

      dir = @__DIR__ ;

Load the list of DOI available in Sci-Hub as full-text:

In [2]:
scihub_dois = dir * "/../data/sci-hub-doi-2022-02-12.txt"

scihub_dois = readdlm(scihub_dois, '\t', String, '\n', quotes = false, use_mmap = true)
scihub_dois = Dict(lowercase.(replace.(scihub_dois[:,1], !isascii => "")) .=> true) ;

Load the list of Crossref .tab files:

In [3]:
tabs = glob("*.tab", dir * "/../data/crossref")

31601-element Vector{String}:
 "/media/alexandra/8 TB/Projects/"[93m[1m ⋯ 24 bytes ⋯ [22m[39m"ocessing/../data/crossref/0.tab"
 "/media/alexandra/8 TB/Projects/"[93m[1m ⋯ 24 bytes ⋯ [22m[39m"ocessing/../data/crossref/1.tab"
 "/media/alexandra/8 TB/Projects/"[93m[1m ⋯ 25 bytes ⋯ [22m[39m"cessing/../data/crossref/10.tab"
 "/media/alexandra/8 TB/Projects/"[93m[1m ⋯ 26 bytes ⋯ [22m[39m"essing/../data/crossref/100.tab"
 "/media/alexandra/8 TB/Projects/"[93m[1m ⋯ 27 bytes ⋯ [22m[39m"ssing/../data/crossref/1000.tab"
 "/media/alexandra/8 TB/Projects/"[93m[1m ⋯ 28 bytes ⋯ [22m[39m"sing/../data/crossref/10000.tab"
 "/media/alexandra/8 TB/Projects/"[93m[1m ⋯ 28 bytes ⋯ [22m[39m"sing/../data/crossref/10001.tab"
 "/media/alexandra/8 TB/Projects/"[93m[1m ⋯ 28 bytes ⋯ [22m[39m"sing/../data/crossref/10002.tab"
 "/media/alexandra/8 TB/Projects/"[93m[1m ⋯ 28 bytes ⋯ [22m[39m"sing/../data/crossref/10003.tab"
 "/media/alexandra/8 TB/Projects/"[93m[1m ⋯ 28 bytes ⋯ [22

The following function will take a name of the .tab file as an input, and return two outputs:

- list of Sci-Hub available DOIs that are found in this .tab file
- dictonary with total number of DOIs counted by year and publisher

In [4]:
function years(tab, scihub, y_start = 2000, y_end = 2024)
    data = readdlm(tab, '\t', String, '\n', quotes = false, use_mmap = true)
    
    counts = Dict()
    dois   = []

    # iterate through every record in Crosref tab file
    
    for (doi, _, _, published, pub) in eachrow(data)

         # only articles availabe in Sci-Hub will be counted
        
         if (doi ∈ scihub)
            
            year = parse(Int, published[1:4])

            # skip if no publishing date or wrong one was specified            
            (year < 1600)    && continue
            
            # articles =< 2000 and >= 2024
            # will be counted in totals
            
            (year < y_start) && (year = y_start)
            (year > y_end)   && (year = y_end)

            # initialize the counter for publisher if it wasn't yet
            
            if (pub ∉ keys(counts))
                counts[pub] = Dict()
                for y in y_start:y_end
                    counts[pub][y] = 0
            end end
            
            # increment the counter
            counts[pub][year] += 1

            # save DOI to the list that will be returned
            push!(dois, doi)
            
        end
    end

    (dois, counts)
end ;

Process every Crossref tab file and extract from it the number of DOIs by year and publisher. The processing is done sequentially (not in parallel mode)

In [5]:
scihub_counts = Dict()

# a handy function that will be used to
# sum up counts from different tab files
plus(a, b) = merge(+, a, b)

shuffle!(tabs)

@showprogress output = stdout (

    # iterate through every Crossref tab file
    for tab in tabs

        # count the number of Sci-Hub DOIs present in that file
        dois, counts = years(tab, keys(scihub_dois))

        # remove the DOIs that were found from the list
        # of Sci-Hub articles to avoid memory overload
        for doi in dois
            delete!(scihub_dois, doi) end

        # add counts extracted from tab file to total counts
        scihub_counts = merge(plus, scihub_counts, counts)
        
    end ) ;

[32mProgress: 100%|█████████████████████████████████████████| Time: 0:06:02[39m:02[39m


Preview the result:

In [6]:
scihub_counts

Dict{Any, Any} with 4085 entries:
  "Asia University"         => Dict{Any, Any}(2024=>0, 2004=>0, 2002=>0, 2006=>…
  "University Press of Flo… => Dict{Any, Any}(2024=>0, 2004=>0, 2002=>0, 2006=>…
  "Sociedad Peruana de Obs… => Dict{Any, Any}(2024=>0, 2004=>0, 2002=>0, 2006=>…
  "JCDR Research and Publi… => Dict{Any, Any}(2024=>0, 2004=>0, 2002=>0, 2006=>…
  "Goodfellow Publishers"   => Dict{Any, Any}(2024=>0, 2004=>0, 2023=>0, 2010=>…
  "University of Rijeka"    => Dict{Any, Any}(2024=>0, 2004=>0, 2023=>0, 2010=>…
  "Lavoisier"               => Dict{Any, Any}(2024=>0, 2004=>205, 2002=>148, 20…
  "Society for the Study o… => Dict{Any, Any}(2024=>0, 2004=>94, 2002=>119, 200…
  "Faculty of Tourism and … => Dict{Any, Any}(2024=>0, 2004=>0, 2002=>0, 2006=>…
  "Institute of Archaeolog… => Dict{Any, Any}(2024=>0, 2004=>0, 2002=>0, 2006=>…
  "Adenine Press, Inc."     => Dict{Any, Any}(2024=>0, 2004=>0, 2002=>0, 2006=>…
  "Japan Society of Sarcoi… => Dict{Any, Any}(2024=>0, 2004=>0, 2002=>0, 20

Save results to output folder **data/publishers sci-hub**. Each publisher is saved to a separate .txt file, and the name of file equals to MD5 hash of publisher name. Each file containts total number of articles (DOIs) available in Sci-Hub that were published every year:

In [7]:
@showprogress output = stdout (
for (pub, years) in scihub_counts

    hash = bytes2hex(md5(pub))
    
    open(dir * "/../data/publishers sci-hub/" * hash * ".txt", "w") do oi
         years = years |> SortedDict
         println(oi, pub)
        [println(oi, string(y, "\t", n)) for (y, n) in years] end

end )

[32mProgress: 100%|█████████████████████████████████████████| Time: 0:00:00[39m
