# COVID-19 Raw Data Processing

This file processes one directory of input files based on [available data sets](https://pages.semanticscholar.org/coronavirus-research).

First, we set up the data directories and experiment information

In [54]:
input_dir = '/Users/weale/data/covid/raw/'
tmp_dir = '/Users/weale/data/covid/tmp/'
output_dir = '/Users/weale/data/covid/out/'

experiment = 'comm_use_subset/pdf_json/'

#RUN ONCE PER EXPERIMENT DIRECTORY
#os.makedirs(tmp_dir + experiment)

input_path = input_dir + experiment
tmp_path = tmp_dir + experiment
output_path = output_dir + experiment

### Create file with extracted information

Take the original json objects from the given directory and create a single tab-separated output file that contains (per line):
1. Paper ID
2. Title
3. Abstract

It also creates a second file that only contains the number of documents in the directory. This will be used to help with array allocation and creation in future steps.

This should provide a simple representation for the content, without requiring the full text documents. This helps with rapid prototyping and should be able to run on my laptop without too much duress.

In [55]:
import json
import os

fOUT = open(tmp_path + "20200420_title_abstract_text.tsv", "w")
fCOUNT = open(tmp_path + "20200420_title_abstract_count.txt", "w")

In [56]:
file_len = 0

for filename in os.listdir(input_path):
    with open(os.path.join(input_path, filename)) as f:
        #print(f.name)
        
        #Load data from the file
        parsed_data = json.load(f)
        
        #Get the ID of the paper
        pID = parsed_data['paper_id']

        #Get the paper title
        tmp = parsed_data['metadata']
        if len(tmp) > 0:
            title = tmp.get('title')
        
        #Get the paper abstract
        tmp = parsed_data['abstract']
        if len(tmp) > 0:
            tmp = tmp[0]
            abstract = tmp.get('text')

        #Combine into the output and print to the file
        line = pID + '\t' + title + '\t' + abstract + '\n'
        fOUT.write(line)
        
        #Increment file length
        file_len = file_len + 1
    
    f.close()
fOUT.close()

## Print the number of elements to another file
fCOUNT.write(str(file_len))
fCOUNT.close()

## Create Representation Vectors

We will now take the title and abstract information and extract representation vectors using the [spaCY](https://spacy.io/) Natural Language Toolkit and a [scispaCy model](https://github.com/allenai/scispacy).

We create an *i* x *j* array for each document, where *i* is the number of tokens in the title/abstract and *j* is the number of elements in the representation vector. For the MEDIUM sized model, *j=200*. We process the titles and abstracts separately.

The two resulting arrays are stored as numpy arrays of dimension (*n* x *i* x *j*), where *n* is the number of documents. No further processing is done at this time.

In [57]:
import scispacy
import spacy

# Using MEDIUM size model from scispacy 
nlp = spacy.load("en_core_sci_md")

In [75]:
fIN = open(tmp_path + "20200420_title_abstract_text.tsv", "r")

dim_title = 0
dim_abstract = 0

lines = fIN.readlines()
for line in lines:
    
    elements = line.split('\t')
    
    # Use spaCy to extract tokens
    len_title = len(nlp(elements[1]))
    len_abstract = len(nlp(elements[2]))
    
    # Find the length of the longest set of title and abstract tokens
    if dim_title < len_title:
        dim_title = len_title
        
    if dim_abstract < len_abstract:
        dim_abstract = len_abstract

fIN.close()

# Validation
print(dim_title)
print(dim_abstract)

275
1366


In [76]:
# Load the number of documents

fIN = open(tmp_path + "20200420_title_abstract_count.txt", "r")
numlines = int(fIN.readline())
fIN.close

9557


In [77]:
#Create the arrays for the representation of the titles and abstracts

import numpy as np

titleArr = np.zeros((numlines, dim_title, 200))
abstractArr = np.zeros((numlines, dim_abstract, 200))

In [80]:
fIN = open(tmp_path + "20200420_title_abstract_text.tsv", "r")

i=0
lines = fIN.readlines()
for line in lines:
    elements = line.split('\t')

    processed = nlp(elements[1])
    j=0
    for token in processed:
        titleArr[i,j,:] = token.vector
        j+=1
        
    processed = nlp(elements[2])    
    j=0
    for token in processed:
        abstractArr[i,j,:] = token.vector
        j+=1
        
    i+=1
fIN.close()
print(i)

9557


In [82]:
# Write the arrays to numpy binary files for future processing

from numpy import asarray
from numpy import save

save(output_path + "20200420_title_vectors.npy", titleArr)
save(output_path + "20200420_abstract_vectors.npy", abstractArr)