Set of python scripts to implement the controlled snowball sampling to gather collection of the seminal scientific publications on desired subject. Details of the approach are available in
- Dobrovolskyi, H., Keberle, N., Todoriko, O. (2017). Probabilistic Topic Modelling for Controlled Snowball Sampling in Citation Network Collection. In: Różewski, P., Lange, C. (eds) Knowledge Engineering and Semantic Web. KESW 2017. Communications in Computer and Information Science, vol 786. Springer, Cham. https://doi.org/10.1007/978-3-319-69548-8_7
- Dobrovolskyi, Hennadii, and Nataliya Keberle. "Collecting the Seminal Scientific Abstracts with Topic Modelling, Snowball Sampling and Citation Analysis." ICTERI. 2018.
- Dobrovolskyi, H., & Keberle, N. (2018, May). On convergence of controlled snowball sampling for scientific abstracts collection. In International Conference on Information and Communication Technologies in Education, Research, and Industrial Applications (pp. 18-42). Springer, Cham.
- Kosa, V., Chaves-Fraga, D., Dobrovolskyi, H., Fedorenko, E., & Ermolayev, V. (2019). Optimizing Automated Term Extraction for Terminological Saturation Measurement. ICTERI, 1, 1-16.
- Kosa, V., Chaves-Fraga, D., Dobrovolskyi, H., & Ermolayev, V. (2019, June). Optimized term extraction method based on computing merged partial C-values. In International Conference on Information and Communication Technologies in Education, Research, and Industrial Applications (pp. 24-49). Springer, Cham.
- Dobrovolskyi, H., & Keberle, N. (2020). Obtaining the Minimal Terminologically Saturated Document Set with Controlled Snowball Sampling. In ICTERI (pp. 87-101).
Requirements: To run the package you need
- python3
- python-poetry
- optionally: anonymous proxy to query Google Scholar
You can use one of proxy services listed at https://www.didsoft.com
-
get copy ot this repository with
$ git clone https://github.com/gendobr/snowball.git
and step inside the snowball directory The content of the directory is
data - empty directory to place your data docs - place to additinal documentation pyproject.toml - list of required packages README.md - this file scripts - python code
-
copy the directory ./docs/data into ./data/YOUR_DATA_DIRECTORY
$ cp -R ./docs/data ./data/YOUR_DATA_DIRECTORY
-
install all required python packages with
$ poetry install
-
download the required NLTK packages
$ poetry run python scripts/init.py
-
Find 10 - 20 seed publications in the explore.openalex.org.
Each seed publication should be
- be relevant to your search topic
- have high citation index (however not extremely high)
- be 7-10 years old Reasoning of the above conditions is discussed in Lecy, J. D., & Beatty, K. E. (2012). Representative literature reviews using constrained snowball sampling and citation network analysis. Available at SSRN 1992601.
Paste the publication ids in the
./data/YOUR_DATA_DIRECTORY/in-seed.csv
file. One id per rowThe publication id is the long number in publication URL. For instance, in the URL https://academic.microsoft.com/paper/2899429816/citedby/ publication id is 2899429816
-
run one-after-one the following files (don't forget to change the
../data/GAN/
to../data/YOUR_DATA_DIRECTORY/
) inside each file
000_download.sh
- may take several hours time to download up to 20000 baseline publications001_tokenizer.sh
- performs tokenization step using NLTK tools002_rarewords.sh
- detects rare words003_joint_probabilities.sh
- estimates token co-occurrence probabilities004_stopwords.sh
- detects stopwords005_reduced_joint_probabilities.sh
- estimates token co-occurrence probabilities after the rare words and stopwords excluded006_SSNMF.sh
- creates topic model in 1-2 hours007_restricted_snowball.sh
- may take several hours time to download up to 20000 relevant publications008_search_path_count.sh
- does search path count calculation (see. Main path analysis for explanation)009_extend_items_google_scholar.sh
- the script downloads several hundreds publications from Google Scholar, so you must use proxy to avoid the ban. The proxy address is proxy parameter in the configuration file../data/YOUR_DATA_DIRECTORY/config.ini
009_extend_items_google_scholar_resume.sh
- sometimes you need to resume the previous command010_download_pdfs.sh
- downloads PDF files that are available for free011_export_xlsx.sh
- creates the final report according to./docs/data-requirements.txt
Final list of publications is the file 011_exported.xlsx
that is --outfile
parameter
in the 011_export_xlsx.sh
script.
- Optionally, you can extende the pipeline with ATE step
012_ate_pdf2txt.sh
- extract plain text from PDF files013_ate_clear_txt.sh
- clear extracted texts014_ate_generate_datasets.sh
- join extracted texts in the sequence of datasets015_ate_get_terms.sh
- extract terms016_ate_clear_terms.sh
- remove trash terms (list of trash terms is the file./data/YOUR_DATA_DIRECTORY/ate_stopwords.csv
)017_ate_saturation.sh
- does terminological saturation analysis