GitHub - climate-nlp/reportparse: ReportParse is a unified NLP analyzer for corporate sustainability reports

A Unified NLP Analyzer for Corporate Sustainability Reports

ReportParse is a Python-based tool designed to parse corporate (sustainability) reports. It combines document structure analysis with natural language processing (NLP) models to extract sustainability-related information from the reports. We also provide easy-to-use web and command interfaces. The tool is expected to aid researchers and analysts in evaluating corporate commitment and management of sustainability efforts.

Tutorials

Understanding setup and basic example
Understanding setup and basic example (install without root permission for deepdoctection).
Analyzing sentiment of corporate sustainability reports

Why should I use ReportParse?

💡ReportParse will reduce your workload to analyze corporate sustainability activities for your research. We know implementing the PDF text extraction and NLP model utilization for this purpose is painstaking. ReportParse will take these things instead of you.
💡ReportParse will be useful to test the robustness of your research. To improve the analytical robustness, you may want to try similar but different model or processing variants. ReportParse can easility change the PDF analysis method and NLP model.
💡ReportParse will increase reproducibility of your analysis. ReportParse uses open sourced tools and methods. This will make it easier for other researchers to replicate your analysis.

You should use ReportParse for

Investigating the number of environmental claims in a corporate report.
Extracting claims related to GHG emission reduction targets in a corporate report.
Investigating ESG topics included in a corporate report.

You should NOT use ReportParse for

Fine-grained document structure analysis.
Requiring 100% accuracy. (In fact, you will face a lot of noise produced by layout analysis and NLP models.)
Automating some critical work because ReportParse usually contains noise and errors stem from the analysis.

How does ReportParse work?

We provide the core engine and interfaces. Conceptually, the core engine of ReportParse was inspired by PaperMage, which can extract information from scientific papers. However, different from PaperMage, ReportParse does not consider any specific document structure because corporates publish reports in very different structure. We support existing NLP models related to climate and sustainability domain. For interfaces, we provide the web (based on Gradio) and command line interfaces. Read our IJCAI 2024 demonstration paper "ReportParse: A Unified NLP Tool for Extracting Document Structure and Semantics of Corporate Sustainability Reporting" for technical detail.

Understanding document structure, reader, and annotator

ReportParse can extract document structure from a reports. The following figure shows the important document structure levels (page, block, and sentences) represented in ReportParse.

By using deepdoctection, the reader can analyze the document structure. Then, an annotator of ReportParse annotates labels for each structure level (i.e., page, block, or sentence) by using cutting-edge language models. For example by using environmenta_claim annotator, you can extract sentences that are related to environmental claims. You can easily change the reader and annotator, or you can create your own reader or annotator.

See current supported readers and annotators.

Citation

@inproceedings{morio-etal-2024-reportparse,
  title     = {{R}eport{P}arse: A Unified NLP Tool for Extracting Document Structure and Semantics of Corporate Sustainability Reporting},
  author    = {Morio, Gaku and In, Soh Young and Yoon, Jungah and Rowlands, Harri and Manning, Christopher D.},
  booktitle = {Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, {IJCAI-24}},
  publisher = {International Joint Conferences on Artificial Intelligence Organization},
  pages     = {to appear},
  year      = {2024},
  note      = {Demos},
}

Quick setup

Environment

We highly recommend you to prepare the Python environment only for ReportParse because our tool depends on complicated external package versions. At this time, we officially support the following version.

Python 3.8.16

Clone the project

git clone https://github.com/climate-nlp/reportparse
cd reportparse

Install dependencies

Run the following commands to install required packages.

pip install pip==23.3.1 setuptools==59.5.0 cython==3.0.6 wheel==0.42.0
pip install "deepdoctection[pt]==0.26" --no-deps
pip install -r requirements.txt
pip install git+https://github.com/facebookresearch/detectron2.git@ff53992b1985b63bd3262b5a36167098e3dada02 --no-build-isolation

python -m spacy download en_core_web_sm

# Make sure that the torch and torchvision version depend on your Python version
pip install torch==1.10.1 torchvision==0.11.2
# If you use CUDA, for example:
#pip install torch==1.10.1+cu111 torchvision==0.11.2+cu111 -f https://download.pytorch.org/whl/torch_stable.html

IMPORTANT: To use deepdoctection, you need to install external packages of poppler, tesseract, leptonica, and qpdf.

sudo apt-get update
sudo apt install -y libtool poppler-utils python3-opencv tesseract-ocr qpdf

If you want to install the above libs without root permissions, please refer auto_install_deepdoctection_deps.sh and example notebook .

After installing all of the above, you can check if the required packages are installed.

python -m reportparse.show_version

Quick start

The following shows examples of analyzing example PDF file at reportparse/asset/example.pdf. For example, we use pymupdf as the reader and environmental_claim and sst2 as the annotators.

By the python command line tool

python -m reportparse.main \
  -i ./reportparse/asset/example.pdf \
  -o ./results \
  --input_type "pdf" \
  --overwrite_strategy "no" \
  --reader "pymupdf" \
  --annotators "environmental_claim" "sst2"

By the python code

# Save this file as example_code.py under the root directory of the reportparse project

import os
from reportparse.reader.base import BaseReader
from reportparse.annotator.base import BaseAnnotator

reader = BaseReader.by_name('pymupdf')()
document = reader.read(input_path='./reportparse/asset/example.pdf')

document = BaseAnnotator.by_name("environmental_claim")().annotate(document=document)
document = BaseAnnotator.by_name("sst2")().annotate(document=document)

os.makedirs('./results', exist_ok=True)

# Save the full data as a JSON file
document.save('./results/example.pdf.json')
# Save the easy-to-use dataset as a CSV file
document.to_dataframe(level='sentence').to_csv('./results/example.pdf.sentence-level-dataset.csv')

python -m example_code

How do the results look like?

The output JSON file would be look like example.pdf.json. If you want to investigate full document structure, this file would be informative.
The output CSV file would be look like example.pdf.sentence-level-dataset.csv. This file is useful to count labels included in a document.
- If you use different annotation levels such as block or page, you can refer example.pdf.block-level-dataset.csv or example.pdf.page-level-dataset.csv.

Example data analysis using the output file of ReportParse

Extracting environmental claims and counting them.

import pandas as pd

# Read the CSV dataset file
df = pd.read_csv('reportparse/asset/example_results/example.pdf.sentence-level-dataset.csv')
# Get environmental claim sentences
df_environment = df[df['environmental_claim'] == 'yes']
# Remove "too short" sentences
df_environment = df[(df['sentence_text'].str.split().str.len() > 20)]

# Show some example text
print(df_environment[:5])
# Results -->
# 10    Hitachi identifies, evaluates, and manages cli...
# 18    Therefore, we have established COz emissions p...
# 19    We also set and manage a metric for avoided em...
# 20    We continue to reduce COz emissions generated ...
# 22    In addition, in April 2021, Hitachi, Ltd. intr...
# Name: sentence_text, dtype: object

# Show some example texts
print('The number of total sentences:', len(df))
# Result --> The number of total sentences: 158
print('The number of environmental claim sentences:', len(df_environment))
# Result --> The number of environmental claim sentences: 62
print('Environmental claim ratio [%]:', 100 * len(df_environment) / len(df))
# Result --> Environmental claim ratio [%]: 39.24050632911393

Web interfaces

We provide two types of Gradio-based interfaces to better understand the output results.

The following is the example to launch a demo server. You can upload your own PDF file to analyze it.

python -m reportparse.demo \
  --server_name 0.0.0.0 \
  --server_port 60233

The following is the example to launch a visualization server. You can only select already analyzed files (i.e., JSON output files).

python -m reportparse.viewer \
  --pdf_dir ./reportparse/asset \
  --json_dir ./results \
  --server_name 0.0.0.0 \
  --server_port 60233

Other tips

Options of the command line tool

When running python -m reportparse.main, you can use following options.

Option name	Type	Description
-i, --input	str	The input file or directory path. If you specify a directory, we will automatically find all files under the directory. You can specify either PDF files or JSON file (i.e., the output data file of ReportParse). If you would like to input json files, you have to change the `--input_type` option below.
-o, --output_dir	str	The output directory path.
--input_type	str ("pdf" or "json")	The input file type. The default is "pdf". If you specify "pdf", we will consider the input file as a PDF file. If "json" is specified, we consider the input file as the output file of ReportParse where we will load data only from it.
--reader	str ("pymupdf" or "deepdoctection")	The name of the PDF layout / text extraction method. The default is "pymupdf". We currently support "pymupdf" or "deepdoctection". See more detail at Reader types.
--annotators	List of str	The annotation methods to apply. The annotator assigns each element (sentence, block, or page) with a label. See more detail at Annotator types. If you do not specify anything here, the reader will only be applied (i.e., only document structure analysis will be conducted).
--max_pages	int	The number of max pages to load by the reader. We read all pages by default.
--skip_pages	List of int	The pages to skip. The default is None. Zero-indexed. For example, if you would like to skip the first cover page, you can specify 0.
--skip_load_image	bool (0 or 1)	Whether to skip loading the image of pages. The default is 0 (False).
--overwrite_strategy	str ("no", "all", "annotator-add", or "annotator-clear")	Whether to overwrite the output file if it exists. The default is "no". "no" will not overwrite the output file. "all" will replace the existing output file with the completely new one. "annotator-clear" will use existing "reader" results but does not use old annotator results. "annotator-add" will use existing "reader" results and overwrite the annotator results only for the specified annotators.

We also provide annotator specific optional arguments. Please refer them by running python -m reportparse.main --help.

Readers

We currently support following readers. Note that it is impossible to provide 100% accurate reader, given the diverse nature of corporate report structure. Please use the reader that best suits your purpose. If you would like to add more, please contribute!

Reader name	Description	Pro	Con
`pymupdf`	We use Fitz of PyMuPDF to extract document structure and text from a PDF file. Note that it does no use layout analysis. We only use sentence tokenization by SpaCy. This means the block and the sentence is exactly the same meaning here.	Fast No OCR errors Well tested	No layout analysis No text extraction for image-based PDF files Low precision
`deepdoctection`	We use deepdoctection to analyze document structure and extract text by OCR engines. The block type includes "title", "text", and "list".	Layout analysis Text extraction for image-based PDF files	Slow OCR errors Complicated installation Low recall

Annotators

We currently support following annotators. If you would like to add more, please contribute!

Annotator name	Reference	Description	Default level
`climate_commitment`	Huggingface, Paper	Classify climate-related text into "climate commitments and actions" or not.	block
`climate_sentiment`	Huggingface, Paper	Classify climate-related text into climate-related "sentiment classes", either opportunity, neutral, or risk.	block
`environmental_claim`	Huggingface, Paper	Classify text into environmental claim or not. The model is trained on the EnvironmentalClaims dataset.	sentence
`est_bert`	Huggingface, Blog	Classify text into 26 ESG-related topics. The full list of labels can be found here.	sentence
`netzero_reduction`	Huggingface, Paper	Classify climate-related text into net-zero target, reduction target, or no-target.	block
`sst2`	Huggingface, Related paper	Classify text into positive or negative.	sentence
`transition_physical_renewable`	Huggingface, Paper	Classify text into transition risk exposure, physical risk exposure, or transition risk exposure related to renewable energy.	block

Using custom huggingface models

Want to use some of your favorite huggingface models? You can use special annotator of custom_huggingface. For example, you can use FinanceInc/auditor_sentiment_finetuned as the annotator as follows.

python -m reportparse.main \
  -i reportparse/asset/example.pdf \
  -o ./results \
  --reader pymupdf \
  --annotators "custom_huggingface" \
  --custom_huggingface_annotator_name "auditor_sentiment" \
  --custom_huggingface_model_name_or_path "FinanceInc/auditor_sentiment_finetuned" \
  --custom_huggingface_level "block"

FAQs

I have faced errors when installing ReportParse.
- We provide installation examples on Google Colab notebooks. Unless you face erros on the notebooks, the problem would be on your own environment. We do not consider any inquiries in this case.
Can we use it for any PDF files other than sustainability reports?
- Technically yes. However, we do not actively support genral reports or other PDF files.
We want to extract more fine-grained document structure for my own report.
- Unfortunately, we do not want to support any documents. Reports are usually unstructured documents represented in a PDF file, and it is impossible to support all of them. Instead, we want to implement more general methods that can apply for various type of reports.
Is ReportParse reliable enough?
- We plan to add some test codes to ensure the functional correctness. Please keep in mind that our tool may contain any bugs. Do not hesitate to point out these bugs if you find.
Does ReportParse support tables and figures?
- Currently, no.

Future work

LLM support
Table support
Keyword extraction
Span-level annotation

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
reportparse		reportparse
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
auto_install_deepdoctection_deps.sh		auto_install_deepdoctection_deps.sh
example_code.py		example_code.py
requirements.txt		requirements.txt

License

climate-nlp/reportparse

Folders and files

Latest commit

History

Repository files navigation