In [9]:
import json
import numpy as np

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [3]:
path = '/content/drive/Shared drives/NLP/Project NLP/Final/'

# Open Document Extraction

Here we we extract the files with pmcid which are the ones with the full text containied within the LitCovid database found [here](https://ftp.ncbi.nlm.nih.gov/pub/lu/LitCovid/).

In [4]:
def get_data_pmc(load=True):
    if load:    
        with open(path + 'pmc_data.json', 'r') as f:
            data_pmc = json.load(f)
    else: 
        filename = path + "litcovid2pubtator.json"
        with open(filename) as json_file:
            data = json.load(json_file)

        total = len(data[1])
        data_pmc = []

        for i, doc in enumerate(data[1]):
            if doc['pmcid']:
                data_pmc.append(doc)
            print("{}/{}".format(i, total))
        with open(path + 'pmc_data.json', 'w') as f:
            json.dump(data_pmc, f)
    
    return data_pmc

In [7]:
data_pmc = get_data_pmc()
print("Articles in the public set:", len(data_pmc))

Articles in the public set: 12457


Here we have all the passage types on our articles:

In [23]:
# Get all passage types
passage_types, type_counts = np.unique([p['infons']['type'] for doc in data_pmc for p in doc['passages']], return_counts=True)
print("# of Types:",len(passage_types))
print("   COUNT - TYPE\n----------------")
for i, j in zip(passage_types, type_counts):
    print("{:8} - {}".format(j,i))

# of Types: 27
   COUNT - TYPE
----------------
   17649 - abstract
   10042 - abstract_title_1
    3925 - fig
   13536 - fig_caption
     918 - fig_title_caption
   13564 - footnote
      50 - footnote_title
   12458 - front
  207378 - paragraph
  202816 - ref
   10737 - table
    7460 - table_caption
    1786 - table_foot
      32 - table_foot_title
    6520 - table_footnote
     573 - table_title_caption
   16713 - title
   34185 - title_1
     161 - title_1_caption
   20788 - title_2
      77 - title_2_caption
    3805 - title_3
       6 - title_3_caption
     419 - title_4
       1 - title_4_caption
      14 - title_5
     123 - title_caption


With the passage types above we now know which ones to filter out like tables, figures, etc.

In [113]:
whitelist = ['paragraph', 'title', 'title_1', 'title_1_caption', 'title_2',
       'title_2_caption', 'title_3', 'title_3_caption', 'title_4',
       'title_4_caption', 'title_5', 'title_caption']
def get_text(doc):
    text, title, abstract = "", "", ""
    ids = (doc["pmid"], doc["pmcid"])
    for passage in doc['passages']:
        if passage['infons']['type'] in "abstract_title_1":
            abstract += passage["text"] + "\n"
        elif passage['infons']['type'] == "front":
            title += passage["text"]
        elif passage['infons']['type'] in whitelist:
            text += passage["text"] + "\n"
    return title, abstract, text, ids

In [114]:
# An example of how our extractor works
title, abstract, text, ids = get_text(data_pmc[1554])
print(title)
print(text)

Imbalanced Host Response to SARS-CoV-2 Drives Development of COVID-19
Coronaviruses are a diverse group of single-stranded positive-sense RNA viruses with a wide range of vertebrate hosts. Four common coronavirus genera (alpha, beta, gamma, and delta) circulate among vertebrates and cause mild upper respiratory tract illnesses in humans and gastroenteritis in animals. However, in the past two decades, three highly pathogenic human betacoronaviruses have emerged from zoonotic events. In 2002-2003, severe acute respiratory syndrome-related coronavirus 1 (SARS-CoV-1) infected ~8,000 people worldwide with a case fatality rate of ~10%, followed by Middle East respiratory syndrome-related coronavirus (MERS-CoV), which has infected ~2,500 people with a case fatality rate of ~36% since 2012. At present, the world is suffering from a pandemic of SARS-CoV-2, which causes coronavirus disease 2019 (COVID-19) and has a global mortality rate that remains to be determined. SARS-CoV-2 infection is cha

Here we create a dataset with the text. This is used later to summarize

In [89]:
import pandas as pd

In [115]:
def get_dataframe(filename = None, obj = None):
    '''
    filename: file to load/save the data
    obj: array of objects from where we extract the data. If None we load csv from filename
    '''
    if obj == None:
        df = pd.read_csv(path + filename)
        return df
    else:
        data = {'pmcid': [], 'pmid': [], 'title': [], 'abstract': [], 'text': []}
        for doc in obj:
            title, abstract, text, ids = get_text(doc)
            data['pmid'].append(ids[0])
            data['pmcid'].append(ids[1])
            data['title'].append(title)
            data['abstract'].append(abstract)
            data['text'].append(text)
        df = pd.DataFrame(data)
        df.to_csv(path + filename, index_label=False)
        return df
df = get_dataframe(filename = 'preprocessed.csv', obj=data_pmc)

In [96]:
df.head()

Unnamed: 0,pmcid,pmid,title,abstract,text
0,PMC7118592,32292259,Understanding of Guidance for acupuncture and ...,"At present, the situation of global fight agai...",Novel coronavirus pneumonia was renamed by Wor...
1,PMC7142670,32340751,Manejo de la epidemia por coronavirus SARS-CoV...,RESUMEN\n\nLa epidemia de SARS-CoV-2 represent...,
2,PMC7272901,32396670,Symptom Criteria for COVID-19 Testing of Heath...,Abstract\n\nBackground\n\nLimitations on testi...,
3,PMC7250076,32470603,Early network properties of the COVID-19 pande...,Highlights\n\nClassic epidemiological control ...,The challenges associated with the COVID-19 pa...
4,PMC7146697,32234312,COVID-19 in endoscopy: Time to do more?,Disclosure\n\nReferences\n\n,To the Editor:\n\nWe have read with great inte...
