https://extractor.readthedocs.io/en/latest/
This repository contains a Python script for extracting and analyzing information from scientific articles in PDF format. The script performs various tasks to facilitate the analysis of multiple articles located in the directory /papers. To extract all information the script use the service GROBID (2008-2022) https://github.com/kermitt2/grobid.
- Utilizes Grobid to extract text from PDF documents, enabling further analysis of the content.
- Creates a keyword cloud based on the abstracts of the articles, providing a visual representation of the most common words.
- Counts the number of figures in each article, aiding in understanding the visual content of the research presented.
- Attempts to extract links within the PDF documents, particularly references cited in the articles, providing additional resources for research.
First of all, clone the repository
git clone https://github.com/adrijmz/extractor.gitTo install the GROBID image, execute the following command
docker pull lfoppiano/grobid:0.7.2To build the extractor image, execute the followint command from the root directory of the repository
cd /path/to/root/directory/of/extractor
docker build -t extractor .To install the GROBID image, execute the following command
docker pull lfoppiano/grobid:0.7.2This project requires Python >= 3.11
Create a virtual environment to isolate the project dependencies
conda create -n myenv python=3.11Init the environment created if it is necessary
conda init myenvActivate the new environment
conda activate myenvInstall dependencies
cd /path/to/root/directory/of/extractor
pip install -r requirements.txtCreate a Docker network to communicate both containers
docker network create extractor_networkTo run the GROBID container, execute the following command
docker run --name server --network extractor_network -p 8070:8070 lfoppiano/grobid:0.7.2To run extractor container, open a new terminal window and execute the following command
docker run --name extractor --network extractor_network extractorIf you want to see the files generated and you have used Docker to run extractor, execute the following command
To check container ID
docker ps -aTo copy all files to a desire directory
docker cp container_id:/app /path/to/your/directoryTo run the GROBID container, execute the following command
docker run --name server -p 8070:8070 lfoppiano/grobid:0.7.2Change in src/script.py this url value
url = 'http://server:8070/api/processFulltextDocument'to this value
url = 'http://localhost:8070/api/processFulltextDocument'To run python script from the root directory, execute the following command
python3 src/script.pyTo access the GROBID service, go to the following URL