# Fulltext analysis

This notebook is a work in progress

### To run this notebook, you need to install the following packages:

- ```paperetl``` (Python)
- ```GROBID``` (Java, [installation](https://grobid.readthedocs.io/en/latest/Install-Grobid/))
- ```pandas``` (Python)

### The notebook is sectioned into the following parts:

- [Parsing PDFs](#Parsing-PDFs)
- [Handling duplicates](#Handling-duplicates)

In [None]:
# Parsing PDFs

# To use paperetl, active GROBID instance is needed
# To use GROBID, JVM should also be installed
# !wget https://github.com/kermitt2/grobid/archive/0.7.2.zip
# !unzip 0.7.2.zip
# !cd grobid-0.7.2
# !./gradlew run

In [1]:
import json
import glob
import pandas as pd
import uuid
import os

## Parsing PDFs

- [Back to top](#Fulltext-analysis)

In [None]:
# Parsing Google Scholar PDFs
# !mkdir ./data/google-scholar/papers/json
# !python -m paperetl.file data/google-scholar/papers json://data/google-scholar/papers/json

# Parsing Springer PDFs
# !mkdir ./data/springer/papers/json
# !python -m paperetl.file data/springer/papers json://data/springer/papers/json

# Parsing arXiv PDFs
# !mkdir ./data/arxiv/papers/json
# !python -m paperetl.file data/arxiv/papers json://data/arxiv/papers/json

In [None]:
gs_papers = glob.glob("./data/google-scholar/papers/json/*.json")
s_papers = glob.glob("./data/springer/papers/json/*.json")
a_papers = glob.glob("./data/arxiv/papers/json/*.json")

print("Parsed {0}, {1}, {2} papers from Google Scholar (various publishers), Springer, and arXiv; respectively.".format(len(gs_papers), len(s_papers), len(a_papers)))

## Handling duplicates

- [Back to top](#Fulltext-analysis)

In [None]:
all_papers = []
all_papers_map = {}
sources = ["google-scholar", "springer", "arxiv"]

for source in sources:
    source_papers = glob.glob("./data/{0}/papers/json/*.json".format(source))
    for source_paper in source_papers:
        with open(source_paper) as json_file:
            paper = json.load(json_file)
            location_pdf = "-".join(source_paper.split("-")[:-1]).replace("/papers/json/", "/papers/") + ".pdf"
            paper_dict = {"source": source, "location_json": source_paper, "location_pdf": location_pdf, "details": paper}
            all_papers.append(paper_dict)
            if paper["title"] in all_papers_map.keys():
                all_papers_map[paper["title"]].append(paper_dict)
            else:
                all_papers_map[paper["title"]] = [paper_dict]
    print("Added papers from source:", source)
print("Total papers collected:", len(all_papers))
print("Total unique titles collected:", len(all_papers_map.keys()))

In [None]:
duplicates = []

for title in all_papers_map.keys():
    if len(all_papers_map[title]) > 1:
        paper_sources = []
        paper_versions = []
        paper_pdfs = []
        paper_jsons = []
        for paper in all_papers_map[title]:
            paper_sources.append(paper["source"])
            paper_versions.append(paper["details"])
            paper_pdfs.append(paper["location_pdf"])
            paper_jsons.append(paper["location_json"])
        duplicates.append({"title": all_papers_map[title][0]["details"]["title"], "sources": paper_sources, "versions": paper_versions, "pdfs": paper_pdfs, "jsons": paper_jsons})
print("Separated {} duplicate papers for further analysis".format(len(duplicates)))

In [None]:
# duplicates_log = []

# for duplicate in duplicates:
#     duplicates_log.append({"title": duplicate["title"], "sources": duplicate["sources"], "pdfs": duplicate["pdfs"], "jsons": duplicate["jsons"], "comment": ""})
    
# with open("./log/duplicates-log.json", "w") as duplicates_log_file:
#     json.dump(duplicates_log, duplicates_log_file, indent=2)

# No need to run, as this block has run before, and running this will overwrite the log file

In [None]:
# with open("./log/duplicates-log-commented.json", "r") as duplicates_log_file:
#     duplicates_log_commented = json.load(duplicates_log_file)
#     for duplicate in duplicates_log_commented:
#         print(duplicate["title"], "-->", duplicate["comment"])

# No need to run, as this block has run before, and running this will overwrite the log file

# Additionally deleted the duplicate paper: A_Method_to_Assess_Climate_Change_Induce(1).pdf

In [None]:
gs_papers = glob.glob("./data/google-scholar/papers/json/*.json")
s_papers = glob.glob("./data/springer/papers/json/*.json")
a_papers = glob.glob("./data/arxiv/papers/json/*.json")

print("After handling duplicates, {0}, {1}, {2} papers from Google Scholar (various publishers), Springer, and arXiv are left; respectively.".format(len(gs_papers), len(s_papers), len(a_papers)))

## Formatting and relocating JSONs

- [Back to top](#Fulltext-analysis)

**Some comments on the quality of the metadata of the papers so far:**

- 'id' values are not unique, must be updated with unique ones
- 'source' values are not complete, must be updated with unique local filepath
- 'published' & 'publication' values are not reliable, though must keep as not to lose valuable information
- 'affiliation' values are totally included in the 'affiliations' values
- 'reference' values are not reliable, there might have been a parsing issue
- 'tags' values were useful for parsing only
- 'authors', 'title', 'entry' values must be kept as they are
- 'sections' values must be connected under the same section title for each paper

In [2]:
!mkdir ./data/papers

In [None]:
sources = ["google-scholar", "springer", "arxiv"]

i = 1

for source in sources:
    source_dir = "./data/{0}/papers/json/".format(source)
    source_papers = glob.glob("{0}*.json".format(source_dir))
    print("Accessing {1} papers in {0}\n".format(source_dir, len(source_papers)))
    
    for source_paper in source_papers:
        print("---> Paper no. {0} <---\n".format(i))
        print("Processing paper from {0}\n".format(source_paper))
        
        with open(source_paper) as source_json:
            paper = json.load(source_json)
            paper_clean = {}
            paper_clean["uuid"] = str(uuid.uuid4())
            paper_clean["source"] = "./data/{0}/papers/{1}".format(source, paper["source"])
            for col in ["entry", "title", "authors", "affiliations", "published", "publication"]:
                paper_clean[col] = paper[col]
            
            paper_sections = {}
            for section in paper["sections"]:
                if section["name"] not in paper_sections.keys():
                    paper_sections[section["name"]] = section["text"]
                else:
                    paper_sections[section["name"]] = " ".join([paper_sections[section["name"]], section["text"]])
            paper_clean["sections"] = paper_sections
            
            paper_fulltext = ""
            for section in paper_sections.keys():
                paper_fulltext = paper_fulltext + " " + str(section)
                paper_fulltext = paper_fulltext + " " + paper_sections[section]
            paper_fulltext = " ".join(paper_fulltext.split()).strip()
            paper_clean["fulltext"] = paper_fulltext
            paper_words = len(paper_fulltext.split())
            paper_clean["words"] = paper_words
            
            target_paper = "./data/papers/{0}.json".format(paper_clean["uuid"])
            with open(target_paper, "w") as target_json:
                json.dump(paper_clean, target_json, indent=2)
                print("Saving paper to {0}\n".format(target_paper))
                i += 1
                
# Log saved to formatting-relocating-log.txt

In [17]:
!mkdir ./data/papers-short
!mkdir ./data/papers-medium
!mkdir ./data/papers-long

In [18]:
papers_path = glob.glob("./data/papers/*.json")

for paper_path in papers_path:
    with open(paper_path) as json_file:
        paper = json.load(json_file)
        if paper["words"] > 15000:
            target_paper = paper_path.replace("/papers/", "/papers-long/")
        elif paper["words"] < 1500:
            target_paper = paper_path.replace("/papers/", "/papers-short/")
        else:
            target_paper = paper_path.replace("/papers/", "/papers-medium/")
        with open(target_paper, "w") as target_json:
            json.dump(paper, target_json, indent=2)            

In [41]:
papers_path = glob.glob("./data/papers-medium/*.json")
papers_chunks = [papers_path[x:x+34] for x in range(0, len(papers_path), 34)]

for i in range(len(papers_chunks)):
    group = "./data/papers-medium/group{0}".format(i)
    !mkdir {group}
    for paper_path in papers_chunks[i]:
        target_paper = paper_path.replace("/papers-medium/", "/papers-medium/group{0}/".format(i))
        os.rename(paper_path, target_paper)

## Analyzing Springer abstracts with pyResearchInsights

- [Back to top](#Fulltext-analysis)

In [None]:
from pyResearchInsights.Cleaner import cleaner_main

abstracts_log_name = "./LOGS/log/abstracts.txt"
status_logger_name = "test_run"
cleaner_main(abstracts_log_name, status_logger_name)

In [None]:
from pyResearchInsights.Analyzer import analyzer_main

abstracts_log_name = "./LOGS/log/abstracts_CLEANED.txt"
status_logger_name = "test_run"
analyzer_main(abstracts_log_name, status_logger_name)

In [None]:
from pyResearchInsights.NLP_Engine import nlp_engine_main
abstracts_log_name = "./LOGS/log/abstracts_CLEANED.txt"
status_logger_name = "test_run"
nlp_engine_main(abstracts_log_name, status_logger_name)

## Extracting information from fulltexts

- [Back to top](#Fulltext-analysis)

In [42]:
import platform
import os
import openai
import re

import torch
import transformers
from transformers import AutoTokenizer

print('Python: ', platform.python_version())
print('re: ', re.__version__)
print('torch: ', torch.__version__)
print('transformers: ', transformers.__version__)

  from .autonotebook import tqdm as notebook_tqdm


Python:  3.10.6
re:  2.2.1
torch:  2.0.1+cu117
transformers:  4.29.2


In [64]:
openai.api_key = os.getenv("OPENAI_API_KEY")

In [66]:
def get_completion(prompt, model="gpt-3.5-turbo", temperature=0): 
    messages = [{"role": "user", "content": prompt}]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=temperature, 
    )
    return response.choices[0].message["content"]

In [95]:
def get_chunks(text, tokens, num_tokens, chunk_size=3000, overlap=100):
    chunks = []
    for i in range(0, num_tokens, chunk_size - overlap):
        chunk = tokens[i:i + chunk_size]
        chunks.append(chunk)
    return chunks

In [96]:
paper.keys()

dict_keys(['uuid', 'source', 'entry', 'title', 'authors', 'affiliations', 'published', 'publication', 'sections', 'fulltext', 'words'])

In [109]:
!mkdir ./data/papers-medium/group0/processed
papers_path = glob.glob("./data/papers-medium/group0/*.json")
total = len(papers_path)

tokenizer = AutoTokenizer.from_pretrained("gpt2")

for j, paper_path in enumerate(papers_path[28:]):
    with open(paper_path) as json_file:
        print(j+1, "/", total)
        print("Started processing file {0}".format(paper_path))
        paper = json.load(json_file)
        paper_tokens = tokenizer.encode(paper["fulltext"])
        paper_num_tokens = len(paper_tokens)
        paper["tokens"] = paper_num_tokens
        paper_chunks = get_chunks(paper["fulltext"], paper_tokens, paper_num_tokens)
        paper_extractions = []
        for i in range(len(paper_chunks)):
            print("Started processing chunk #{0}".format(i))
            paper_extraction = {}
            paper_chunk = tokenizer.decode(paper_chunks[i])
            paper_extraction["chunk"] = paper_chunk
            prompt = f"""If specific AI solutions and their applications to specific climate action tasks are mentioned in the text within three backticks, extract the details in a JSON format with these keys: 'climate-change-problem', 'climate-action-proposed', 'climate-action-type' ('Mitigation' or 'Adaptation', or other applicable types), 'climate-action-sector', 'climate-action-country', 'machine-learning-solution', 'deep-learning-solution', 'other-solution', and 'context'. If not mentioned, extract available information with the same JSON format. Leave 'Unknown' where applicable. Do not repeat yourself. Limit each 'context' to 50 words. Aggregate values with commas where ('climate-action-proposed', 'machine-learning-solution', 'deep-learning-solution') values are the same. Use sentence case for all texts. ```{paper_chunk}```"""
            paper_chunk_extraction = get_completion(prompt)
            paper_extraction["extraction"] = paper_chunk_extraction
            paper_extractions.append(paper_extraction)
        paper["extractions"] = paper_extractions
        paper_clean = {key:paper[key] for key in ['uuid', 'source', 'entry', 'title', 'authors', 'affiliations', 'published', 'publication', 'words', 'tokens', 'extractions']}
        target_paper = paper_path.replace("/group0/", "/group0/processed/")
        with open(target_paper, "w") as target_json:
            print("Saving processed file {0}\n".format(target_paper))
            json.dump(paper_clean, target_json, indent=2)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
mkdir: cannot create directory ‘./data/papers-medium/group0/processed’: File exists


Token indices sequence length is longer than the specified maximum sequence length for this model (11185 > 1024). Running this sequence through the model will result in indexing errors


1 / 34
Started processing file ./data/papers-medium/group0/6b203eaf-3d8c-4db9-b750-c9d49e6a5f9e.json
Started processing chunk #0
Started processing chunk #1
Started processing chunk #2
Started processing chunk #3
Saving processed file ./data/papers-medium/group0/processed/6b203eaf-3d8c-4db9-b750-c9d49e6a5f9e.json

2 / 34
Started processing file ./data/papers-medium/group0/12e4d142-fdb0-499f-9d7e-1bcc27bcb8f3.json
Started processing chunk #0
Started processing chunk #1
Started processing chunk #2
Saving processed file ./data/papers-medium/group0/processed/12e4d142-fdb0-499f-9d7e-1bcc27bcb8f3.json

3 / 34
Started processing file ./data/papers-medium/group0/376b3793-53c5-4d34-91d9-2c712aef4934.json
Started processing chunk #0
Started processing chunk #1
Started processing chunk #2
Started processing chunk #3
Started processing chunk #4
Saving processed file ./data/papers-medium/group0/processed/376b3793-53c5-4d34-91d9-2c712aef4934.json

4 / 34
Started processing file ./data/papers-medium/