# Fulltext analysis

This notebook is a work in progress - to be updated soon...

### To run this notebook, you need to install the following packages:

- ```paperetl``` (Python)
- ```GROBID``` (Java, [installation](https://grobid.readthedocs.io/en/latest/Install-Grobid/))
- ```pandas``` (Python)

### The notebook is sectioned into the following parts:

- [Parsing PDFs](#Parsing-PDFs)
- [Handling duplicates](#Handling-duplicates)

In [None]:
# Parsing PDFs

# To use paperetl, active GROBID instance is needed
# To use GROBID, JVM should also be installed
# !wget https://github.com/kermitt2/grobid/archive/0.7.2.zip
# !unzip 0.7.2.zip
# !cd grobid-0.7.2
# !./gradlew run

In [1]:
import json
import glob
import pandas as pd

## Parsing PDFs

- [Back to top](#Fulltext-analysis)

In [None]:
# Parsing Google Scholar PDFs
# !mkdir ./data/google-scholar/papers/json
# !python -m paperetl.file data/google-scholar/papers json://data/google-scholar/papers/json

# Parsing Springer PDFs
# !mkdir ./data/springer/papers/json
# !python -m paperetl.file data/springer/papers json://data/springer/papers/json

# Parsing arXiv PDFs
# !mkdir ./data/arxiv/papers/json
# !python -m paperetl.file data/arxiv/papers json://data/arxiv/papers/json

In [None]:
gs_papers = glob.glob("./data/google-scholar/papers/json/*.json")
s_papers = glob.glob("./data/springer/papers/json/*.json")
a_papers = glob.glob("./data/arxiv/papers/json/*.json")

print("Parsed {0}, {1}, {2} papers from Google Scholar (various publishers), Springer, and arXiv; respectively.".format(len(gs_papers), len(s_papers), len(a_papers)))

## Handling duplicates

- [Back to top](#Fulltext-analysis)

In [None]:
all_papers = []
all_papers_map = {}
sources = ["google-scholar", "springer", "arxiv"]

for source in sources:
    source_papers = glob.glob("./data/{0}/papers/json/*.json".format(source))
    for source_paper in source_papers:
        with open(source_paper) as json_file:
            paper = json.load(json_file)
            location_pdf = "-".join(source_paper.split("-")[:-1]).replace("/papers/json/", "/papers/") + ".pdf"
            paper_dict = {"source": source, "location_json": source_paper, "location_pdf": location_pdf, "details": paper}
            all_papers.append(paper_dict)
            if paper["title"] in all_papers_map.keys():
                all_papers_map[paper["title"]].append(paper_dict)
            else:
                all_papers_map[paper["title"]] = [paper_dict]
    print("Added papers from source:", source)
print("Total papers collected:", len(all_papers))
print("Total unique titles collected:", len(all_papers_map.keys()))

In [None]:
duplicates = []

for title in all_papers_map.keys():
    if len(all_papers_map[title]) > 1:
        paper_sources = []
        paper_versions = []
        paper_pdfs = []
        paper_jsons = []
        for paper in all_papers_map[title]:
            paper_sources.append(paper["source"])
            paper_versions.append(paper["details"])
            paper_pdfs.append(paper["location_pdf"])
            paper_jsons.append(paper["location_json"])
        duplicates.append({"title": all_papers_map[title][0]["details"]["title"], "sources": paper_sources, "versions": paper_versions, "pdfs": paper_pdfs, "jsons": paper_jsons})
print("Separated {} duplicate papers for further analysis".format(len(duplicates)))

In [None]:
# duplicates_log = []

# for duplicate in duplicates:
#     duplicates_log.append({"title": duplicate["title"], "sources": duplicate["sources"], "pdfs": duplicate["pdfs"], "jsons": duplicate["jsons"], "comment": ""})
    
# with open("./log/duplicates-log.json", "w") as duplicates_log_file:
#     json.dump(duplicates_log, duplicates_log_file, indent=2)

# No need to run, as this block has run before, and running this will overwrite the log file

In [None]:
# with open("./log/duplicates-log-commented.json", "r") as duplicates_log_file:
#     duplicates_log_commented = json.load(duplicates_log_file)
#     for duplicate in duplicates_log_commented:
#         print(duplicate["title"], "-->", duplicate["comment"])

# No need to run, as this block has run before, and running this will overwrite the log file

# Additionally deleted the duplicate paper: A_Method_to_Assess_Climate_Change_Induce(1).pdf

In [None]:
gs_papers = glob.glob("./data/google-scholar/papers/json/*.json")
s_papers = glob.glob("./data/springer/papers/json/*.json")
a_papers = glob.glob("./data/arxiv/papers/json/*.json")

print("After handling duplicates, {0}, {1}, {2} papers from Google Scholar (various publishers), Springer, and arXiv are left; respectively.".format(len(gs_papers), len(s_papers), len(a_papers)))

## Formatting and relocating JSONs

- [Back to top](#Fulltext-analysis)

In [None]:
!mkdir ./data/papers

In [None]:
sources = ["google-scholar", "springer", "arxiv"]

i = 1

for source in sources:
    source_dir = "./data/{0}/papers/json/".format(source)
    source_papers = glob.glob(source_dir + "*.json")
    print("Accessing {1} papers in {0}\n".format(source_dir, len(source_papers)))
    for source_paper in source_papers:
        with open(source_paper) as source_json:
            paper = json.load(source_json)
            print("---> Paper no. {0} <---\n".format(i))
            print("Opening paper from {0}".format(source_paper))
            target_paper = source_paper.replace(source_dir, "./data/papers/")
            with open(target_paper, "w") as target_json:
                json.dump(paper, target_json, indent=2)
                print("Saving paper to {0}\n".format(target_paper))
                i += 1
                
# Log saved to formatting-relocating-log.txt

## Analyzing fulltexts

- [Back to top](#Fulltext-analysis)

In [76]:
papers_path = glob.glob("./data/papers/*.json")
papers_dict = []

for paper_path in papers_path:
    with open(paper_path) as json_file:
        paper = json.load(json_file)
        papers_dict.append(paper)

papers_df = pd.DataFrame.from_dict(papers_dict)

In [77]:
papers_df

Unnamed: 0,id,source,published,publication,authors,affiliations,affiliation,title,tags,reference,entry,sections
0,f9fcd1242e40e173a46c682b9fb716b39dbce9f8,a2329.pdf,2020-09-10 00:00:00,,"Callewaert, Chris; Helffer, Katia; Lebaron, Ph...","Center for Microbial Ecology and Technology, G...",Laboratoire de Biodiversité et Biotechnologies...,Skin Microbiome and its Interplay with the Env...,PDF,https://doi.org/0000-0002-8520-9078,2022-12-19 00:00:00,"[{'name': 'TITLE', 'text': 'Skin Microbiome an..."
1,975ed21e0012d42cdc8aad57a3e25cba1e50cbb2,CONCEPTS_LOCATIONS_EMOTIONS_SEMANTIC_ANALYSIS_...,,,"Anastopoulou, N; Kavouras, M; Kokla, M; Tomai, E","School of Rural and Surveying Engineering, Nat...","School of Rural and Surveying Engineering, Nat...",CONCEPTS -LOCATIONS -EMOTIONS: SEMANTIC ANALYS...,PDF,https://doi.org/51F530DE80D0ABDECC75D5B148BFFF02,2022-12-19 00:00:00,"[{'name': 'TITLE', 'text': 'CONCEPTS -LOCATION..."
2,bf3638816090229ec3335fdf2945199c0b337ead,Application_of_artificial_intelligence_methods...,,,"Zahraee, S; Khalaji Assadi, M; Saidur, R","Department of Mechanical Engineering, Universi...",Centre of Research Excellence in Renewable Ene...,Application of Artificial Intelligence Methods...,PDF,https://doi.org/F271A76BD802DEBAAA2BF76D0C7409C6,2022-12-19 00:00:00,"[{'name': 'TITLE', 'text': 'Application of Art..."
3,8e25e90675bf3556d6febe7a3c2de31d47d32f6c,a2503.pdf,2016-09-10 00:00:00,,"Vatsavai, Raju; Chandola, Varun",North Carolina State University; State Univers...,State University of New York at Buffalo,Guest editorial: big spatial data,PDF,https://doi.org/88F4C5D18ED7C9EF50A86C51AF4037AF,2022-12-19 00:00:00,"[{'name': 'TITLE', 'text': 'Guest editorial: b..."
4,db3b4c91f1d103876cabda15a06c54d59b7dea52,a1498.pdf,2022-04-09 00:00:00,,"Pretzsch, Hans; Ahmed, Shamim; Jacobs, Martin;...","Department of Life Science Systems, Chair for ...","Department of Life Science Systems, Chair for ...",Linking crown structure with tree ring pattern...,PDF,https://doi.org/0000-0002-4958-1868,2022-12-19 00:00:00,"[{'name': 'TITLE', 'text': 'Linking crown stru..."
...,...,...,...,...,...,...,...,...,...,...,...,...
3329,86c6aebb0a96be0cc6125d6e26afd24cea77825e,a754.pdf,,,"Glasl, Bettina; Bourne, David; Frade, Pedro; T...",Australian Institute of Marine Science; Colleg...,"Australian Centre for Ecogenomics, University ...",Microbial indicators of environmental perturba...,PDF,https://doi.org/0000-0002-6812-868X,2022-12-19 00:00:00,"[{'name': 'TITLE', 'text': 'Microbial indicato..."
3330,65fb68c4d82e8ea7f2d5d1be54164df89726ca66,AI4ESP1061_Holm_Jennifer.pdf,,,"Keenan, Trevor",,,Deep learning techniques to disentangle water ...,PDF,https://doi.org/58E8163B06A04662118094937F5AE848,2022-12-19 00:00:00,"[{'name': 'TITLE', 'text': 'Deep learning tech..."
3331,0b0a0047b081c287d38da1b2539cca8c426adbad,2012.09508v2.Towards_Optimal_District_Heating_...,,,"Le Coz, Adrien; Nabil, Tahar; Courtot, François",EDF R&D China Center,EDF R&D China Center,Towards Optimal District Heating Temperature C...,PDF,https://doi.org/DF9623284CC718272FF1AB9D87F39418,2022-12-19 00:00:00,"[{'name': 'TITLE', 'text': 'Towards Optimal Di..."
3332,3a12a4108fa6de6083b3bba8f057aeebc6ecc7b0,a319.pdf,2022-03-30 00:00:00,,"Hell, Maximilian; Brandmeier, Melanie; Briechl...",DGPF * Maximilian Hell ; 1 Esri Deutschland G...,"3 Faculty of Geoinformatics, HM University of...",Classification of Tree Species and Standing De...,PDF,https://doi.org/0000-0003-3559-5043,2022-12-19 00:00:00,"[{'name': 'TITLE', 'text': 'Classification of ..."


In [9]:
papers_df[papers_df["published"].notnull()]["published"]

0       2020-09-10 00:00:00
3       2016-09-10 00:00:00
4       2022-04-09 00:00:00
5       2021-11-02 00:00:00
6       2019-01-17 00:00:00
               ...         
3323    2022-10-31 00:00:00
3324    2022-08-10 00:00:00
3328    2021-05-03 00:00:00
3333    2022-03-30 00:00:00
3334    2010-11-28 00:00:00
Name: published, Length: 1845, dtype: object

In [82]:
i = 0

with open("data/text/papers.txt".format(i), "w") as a:

    for paper in papers_dict:
        paper_dict = {}
        paper_txt = []

        for section in paper["sections"]:
            if section["name"] not in paper_dict.keys():
                paper_dict[section["name"]] = section["text"]
            else:
                paper_dict[section["name"]] = paper_dict[section["name"]] + " " + section["text"]

        for key in paper_dict.keys():
            paper_txt.append(key)
            paper_txt.append(" ".join(paper_dict[key].split()))

        with open("data/text/paper{0}.txt".format(i), "w") as p:
            for txt in paper_txt:
                if txt is not None:
                    p.write(txt)
                    p.write("\n")
                    a.write(txt)
                    a.write("\n")

        i += 1

## Finding reviews and summaries

- [Back to top](#Fulltext-analysis)

In [None]:
all_papers = []
sources = ["google-scholar", "springer", "arxiv"]

for source in sources:
    source_papers = glob.glob("./data/{0}/papers/json/*.json".format(source))
    for source_paper in source_papers:
        with open(source_paper) as json_file:
            paper = json.load(json_file)
            location_pdf = "-".join(source_paper.split("-")[:-1]).replace("/papers/json/", "/papers/") + ".pdf"
            paper_dict = {"source": source, "location_json": source_paper, "location_pdf": location_pdf, "details": paper}
            all_papers.append(paper_dict)
    print("Added papers from source:", source)
print("Total papers collected:", len(all_papers))

In [None]:
for paper in all_papers:
    if "review" in paper["details"]["title"] or "summary" in paper["details"]["title"] or "survey" in paper["details"]["title"]:
#         print(paper["details"]["title"])
        file_url = "file:///home/gereltuya/Downloads/spbu/ai-for-climate-action" + paper["location_pdf"][1:]
        print(file_url, "\n")

In [None]:
all_papers[0].keys()

In [None]:
all_papers[0]["details"].keys()

## Analyzing Springer abstracts with pyResearchInsights

- [Back to top](#Fulltext-analysis)

In [None]:
from pyResearchInsights.Cleaner import cleaner_main

abstracts_log_name = "./LOGS/log/abstracts.txt"
status_logger_name = "test_run"
cleaner_main(abstracts_log_name, status_logger_name)

In [None]:
from pyResearchInsights.Analyzer import analyzer_main

abstracts_log_name = "./LOGS/log/abstracts_CLEANED.txt"
status_logger_name = "test_run"
analyzer_main(abstracts_log_name, status_logger_name)

In [None]:
from pyResearchInsights.NLP_Engine import nlp_engine_main
abstracts_log_name = "./LOGS/log/abstracts_CLEANED.txt"
status_logger_name = "test_run"
nlp_engine_main(abstracts_log_name, status_logger_name)