Text- and Data Mining (TDM) case study

The CrossRef API participants are retrieved from here and here. Springer and TaylorFrancis are only represented in the second link, which is odd.

Information retrieval

A three-step approach was used in the coding. Step 1 included collection of resources, whereas step 2 pertains to the coding of these resources.

Step 1 pertains to the variables publisher, url, open_access, contact, whois, terms_conditions_url, crossref_tdm, crossref_tdm_clickthrough. These variables were resource variables and could be readily coded without much time effort or potential to interpretation.

Step 2 requires some more interpretative elements to the resources collected in Step 1. This pertains to the variables api, api_unrestricted, terms_conditions_scrape, nc_terms_conditions_scrape, terms_conditions_spider, nc_terms_conditions_spider, tdm_policy, lit_dump, lit_dump_unrestricted, lit_dump_policy. As such, these were coded separate from Step 1 to clearly demarcate the objective elements from more interpretative elements. As such, the coding remains accountable and verifiable. To this end, notes will be kept per coded publisher to provide details.

Step 3 includes emailing the publishers to verify or disconfirm the codings.

INCLUDE TEMPLATE EMAIL HERE

Coding protocol

Open Terms and Conditions link
Save page to wayback machine
Open hypothes.is
Annotate parts containing information on scraping in yellow, use CTRL+F for keywords "scrap", "download", "automat", "bot", "spider" to identify key sections.
Code whether a specific country's copyright applies in the terms and conditions copyright_law_country (country acronym, + denotes whether it adds a modifier to include "other countries")
Code whether the user is allowed, based on the T&C, to download articles for research purposes. If scraping is not forbidden, it is seen as allowed.
Code terms_conditions_scrape, 0 = explicit statement forbidding scraping, 1 = no explicit statement forbidding scraping, NA = states nothing about scraping.
Code nc_terms_conditions_scrape, 0 = forbids scraping for non-commercial/does not make a distinction, 1 = allows scraping specifically for non-commercial activities, NA = nothing coded for terms_condition_scrape so nothing possible.
Code terms_conditions_spider similarly to #5.
Code nc_terms_conditions_spider similar to #6.
Use the site's websearch to look for a tdm policy with the term "tdm policy", "mining", "tdm". Only inspect first page for relevant hits pertaining to a tdm policy. If found, copy the url to the policy into the spreadsheet under variable tdm_policy.
Code api availability, 0 = no, 1 = yes, NA = is no info found (note that if available in crossref_tdm, this automatically converts to a yes)
Code api_free, where 0 = restrictions, 1 = free (i.e., one does not need to make an account or accept conditions to use API), NA = if no API available. Note that if there is an agreement in CrossRef, this is seen as restricted.
Use the websearch for "corpus", "dump", "subset" (based on how PMC calls their dump), "full-text", "fulltext"
Add whether the downloads are freely available in lit_dump_free, 0 = restriction (e.g., login), 1 = free, NA = no dump available
If there is a policy available on the download page, add the link to the policy in lit_dump_policy.

All of the links included in the spreadsheet will be archived at the Wayback Archive for persistence. A duplicate spreadsheet will be included that provides these links in case they are required in the future.

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
analyses		analyses
archive		archive
bibliography		bibliography
data		data
figures		figures
functions		functions
materials		materials
preregister		preregister
submission		submission
.Rhistory		.Rhistory
.gitattributes		.gitattributes
.gitignore		.gitignore
2016tdm.Rproj		2016tdm.Rproj
LICENSE		LICENSE
README.md		README.md
contributions.csv		contributions.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text- and Data Mining (TDM) case study

Information retrieval

Coding protocol

About

Releases

Packages

Languages

License

chartgerink/2016tdm

Folders and files

Latest commit

History

Repository files navigation

Text- and Data Mining (TDM) case study

Information retrieval

Coding protocol

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages