Skip to content

Scripts used in analyzing bibliometric data of COVID-19 research.

Notifications You must be signed in to change notification settings

covid-bib/bibliometric

Repository files navigation

bibliometric

Scripts used in analyzing bibliometric data of COVID-19 research.

Script get_queries_from_scopus_eid.py helps you collect data for more than 2.000 documents from Scopus

Instruction for usage

If your Scopus query results in more than 2.000 documents, you can select first 20.000 documents and export citation 
information only (you get link to download file named scopus.csv by email).

Use downloaded file for the script in order to collect other information (bibliographic information, abstract & keywords, 
funding details and other information).

The script takes scopus.scv file as input and creates queries in Scopus Advanced syntax format which can be directly
copied to search engine to obtain chunks of documents with less than 2000 documents.

The search results are saved to queries.csv (copy-paste information from each row).

By default, 2.000 (maximal number of documents that can be extracted)
documents are selected (by their unique identifier EID).

Partial data csv can be merged to one file using script merge_from_scopus_csv.py

Note: if your search query results in more than 20.000 documents, 
we advise that you split the results into smaller chunks directly in Scopus.

Script merge_from_scopus_csv.py merges document information from several small csv file to a single one (and also in xlsx format)

Instructions for usage

Rename smaller smaller csv files from Scopus in the following format
scopus (0).csv, scopus (1).csv ...

Use the main function merge_partial_files to merged them. 
The parameter n_max corresponds to the number of csv files to be merged.

Script add_sciences_to_scopus.py merges information about the documents with the information about the sources from which these documents originate

Instruction for usage

NOTE! Most of the files used for running the code are available for Scopus registered users
only and are due to copyright issues not available on GitHub.

	However, for registered Scopus users the detailed information how to get the data is provided.

In order to run the code three files have to be provided by user:

fn_documents: 
	File that contains bibliographic properties of documents from Scopus (for Scopus registered users only)
	fn_document shoud be in Excel format (in case, you have it in csv format, we suggest to convert in in pandas)
	The file contains bibliographic properties of documents from Scopus. It can be directly downloaded from Scopus.
	In case, that the search query results in more than 2.000 documents, 
	we recommend using the get_queries_from_scopus_eid.py script

fn_sources: 
	File that contains source titles and metrics (for Scopus registered users only)
	To get the file
		Visit: https://www.scopus.com/sources?utm_medium=SORG&utm_source=TW&dgcid=STMJ_1592825286_PMES_IM
		Click on Download Scopus Source List
		Choose Download source titles and metrics
		Extract zip file
		Copy data from sheet CiteScore 2019 to new Excel file

fn_classification:
	File is available on GitHub under name scopus_subject_area.xlsx
	It summarizes freely available information from
		https://service.elsevier.com/app/answers/detail/a_id/14882/supporthub/scopus/~/what-are-the-most-frequent-subject-area-categories-and-classifications-used-in/
		https://service.elsevier.com/app/answers/detail/a_id/15181/supporthub/scopus/

Script jaccard_sac.py plots clustermap of Subject are classifications based on Jaccard's distance

Instruction for usage

Run the code with the provided file binary_sac.xlsx
The file binary_sac.xlsx is computed from Scopus document dataset 
where each row represents a document and each column a Subject area classification.
The value 1 indicates that a document is classified to a particular SAC, and 0 otherwise.
The original information about the documents from Scopus are removed from the xlsx file due to copyright issues.

Script plot_venn_diagram.py plots Venn's diagram showing overlaping between different subject areas

Instructions for usage

Run the code with the provided file binary_subject_areas.xlsx
The file binary_subject_areas.xlsx is computed from Scopus document dataset 
where each row represents a document and each column a Subject area.
The value 1 indicates that a document is classified to a particular Subject area, and 0 otherwise.
The original information about the documents from Scopus are removed from the xlsx file due to copyright issues.

Script text_mining.py uses abstracts of documents and converts them to TF-IDF features. Besides the abstract Subject area membership has to be provided. The script coverts the dataset with abstract to a dataset which can be used for classification purposes

Instructions for usage

Use Abstract information from Scopus document dataset and add membership of Subject area (0 or 1).
For adding membership information see documnetation of add_sciences_to_scopus.py
The script uses additional stopwords stored in additional_stop_words.txt (on GitHub)
The text processing procedure follows: https://github.com/miguelfzafra/Latest-News-Classifier/blob/master/0.%20Latest%20News%20Classifier/03.%20Feature%20Engineering/03.%20Feature%20Engineering.ipynb
The script returns dataset which can be further used for classification.

NOTE! The original information about the abstracts is not available on GitHub due to Scopus copyright issues.
However, output of the code is available in dataset_for_classification.xlsx

About

Scripts used in analyzing bibliometric data of COVID-19 research.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages