# Sci-Hub: number of articles available as full-text

This notebook contains code in **Julia language** to calculate number of articles that were available as full-text in Sci-Hub for every date since 2011 (when Sci-Hub was established). Only articles with DOI corresponding to *journal-article* or *proceedings-article* in Crossref are counted (i.e. book chapters are omitted).

The script requires pre-processed Crossref tabulated files to be present in *data/crossref* folder. To generate .tab files from publicly available Crossref dataset, see *crossref to tab.ipynb*. The script also uses *sci-hub doi date.tab* file provided by Sci-Hub as an input. That file contains, for every DOI that is currently available in Sci-Hub as full-text, the date of the earliest appearance of the DOI in Sci-Hub download logs.

The script will produce *sci-hub date count.tab* file, that will contain the number of DOI available in Sci-Hub for every date formatted as YYYY-MM-DD. The output will be used in *sci-hub db growth.ipynb* notebook.

The data is prepared for the paper 'From black open access to open access of colour: accepting the diversity of approaches towards free science'.

In [1]:
using Glob,
      Random,
      ProgressMeter,
      DelimitedFiles,
      DataStructures

      dir = @__DIR__ ;

Load Sci-Hub's list of DOI available as full-text with date corresponding to earliest appearance in download logs for every doi:

In [2]:
scihub_doi_date = dir * "/../data/sci-hub doi date.tab" ;

doi_date = readdlm(scihub_doi_date, '\t', String, '\n', quotes = false, use_mmap = true)
doi_date = Dict(doi_date[:,1] .=> doi_date[:,2]) ;

Load the list of Crossref .tab files:

In [3]:
tabs = glob("*.tab", dir * "/../data/crossref")

31601-element Vector{String}:
 "/media/alexandra/8 TB/Projects/"[93m[1m ⋯ 24 bytes ⋯ [22m[39m"ocessing/../data/crossref/0.tab"
 "/media/alexandra/8 TB/Projects/"[93m[1m ⋯ 24 bytes ⋯ [22m[39m"ocessing/../data/crossref/1.tab"
 "/media/alexandra/8 TB/Projects/"[93m[1m ⋯ 25 bytes ⋯ [22m[39m"cessing/../data/crossref/10.tab"
 "/media/alexandra/8 TB/Projects/"[93m[1m ⋯ 26 bytes ⋯ [22m[39m"essing/../data/crossref/100.tab"
 "/media/alexandra/8 TB/Projects/"[93m[1m ⋯ 27 bytes ⋯ [22m[39m"ssing/../data/crossref/1000.tab"
 "/media/alexandra/8 TB/Projects/"[93m[1m ⋯ 28 bytes ⋯ [22m[39m"sing/../data/crossref/10000.tab"
 "/media/alexandra/8 TB/Projects/"[93m[1m ⋯ 28 bytes ⋯ [22m[39m"sing/../data/crossref/10001.tab"
 "/media/alexandra/8 TB/Projects/"[93m[1m ⋯ 28 bytes ⋯ [22m[39m"sing/../data/crossref/10002.tab"
 "/media/alexandra/8 TB/Projects/"[93m[1m ⋯ 28 bytes ⋯ [22m[39m"sing/../data/crossref/10003.tab"
 "/media/alexandra/8 TB/Projects/"[93m[1m ⋯ 28 bytes ⋯ [22

The following function takes Crossref .tab file as in input, and processes it, extracting DOIs that are both:

- are of specified type (in this case, journal-article or proceedings-article)
- can be found in Sci-Hub full-text database

The list of extracted DOIs is returned as output.

In [4]:
function articles(tab, subset, superset)
    
    dois = []
    data = readdlm(tab, '\t', String, '\n', quotes = false, use_mmap = true)
    
    for (doi, class, _, _, _) in eachrow(data)
        ((class ∈ subset)    &&
         (doi   ∈ superset)) &&
         (push!(dois, doi))
    end

    dois
end ;

The function will be used to calculate the number of articles that became available Sci-Hub every day:

In [5]:
date_count = Dict()

shuffle!(tabs)

@showprogress output = stdout (
    
    # enumerate through every Crossref .tab file
    for tab in tabs

        # extract those DOI available in Sci-Hub and
        # are either journal or proceedings article
        dois = articles(tab, ["journal-article", "proceedings-article"], keys(doi_date))

        # iterate over every DOI extracted
        for doi in dois

            # get date of earliest appearance of DOI in Sci-Hub logs
            date = doi_date[doi]

            # increment the counter for that date:
            date ∉ keys(date_count) && (
                date_count[date]  = 0  )
                date_count[date] += 1

            # free memory of processed DOI to avoid overloading
            delete!(doi_date, doi)
        end
    end )

date_count  # preview the result

[32mProgress: 100%|█████████████████████████████████████████| Time: 0:07:49[39m


Dict{Any, Any} with 3210 entries:
  "2017-09-04" => 7753
  "2012-05-16" => 4747
  "2019-05-04" => 5688
  "2013-08-11" => 4416
  "2019-07-08" => 9697
  "2016-03-05" => 9695
  "2017-12-03" => 4860
  "2017-11-23" => 5262
  "2019-02-06" => 10767
  "2015-09-12" => 13917
  "2017-05-22" => 7898
  "2019-07-04" => 7729
  "2014-02-01" => 4228
  "2020-02-27" => 4937
  "2013-06-23" => 6118
  "2018-08-14" => 13898
  "2015-08-26" => 27127
  "2016-03-26" => 59611
  "2011-10-24" => 527
  "2019-05-21" => 16622
  "2016-08-28" => 2787
  "2015-01-03" => 40373
  "2013-07-26" => 5995
  "2018-06-25" => 23444
  "2012-06-25" => 5068
  ⋮            => ⋮

Save calculated numbers to output file:

In [6]:
writedlm(dir * "/../data/sci-hub date count.tab", date_count |> SortedDict)