Intro notebook on working with JSON data from CORD-19 dataset. Reviews some common data access and transformations to get JSON data into a format were it can be analyzed through pandas and NLP tools.

* Load JSON for one of the records into a variable
* Access various element in JSON as python dictionary
* Use glob to access all files in non_comm_use_subset directory
* Use a loop to access elements of JSON (body_text and id)
* Convert to a pandas dataframe
* Access Google Cloup Platform NLP APIs to assign sentiment and categories to each record in the directory


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

#import os
#for dirname, _, filenames in os.walk('/kaggle/input'):
#    for filename in filenames:
#        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
import json

Open one of the files in the noncomm_use_subset directory and load the JSON into a dictionary object

In [None]:
with open('/kaggle/input/CORD-19-research-challenge/document_parses/pdf_json/6f192000b0e87fe2c55632441655de31c61596c4.json') as json_file:
    data = json.load(json_file)

In [None]:
data

Take a look at the top level keys of the dictionary

In [None]:
data.keys()

In [None]:
#Access the text for some of the elements of the document, abstract and body_text

Some of these dictionary elements map to list (for example, both abstract and body_text are split into a list of text fields)

In [None]:
data['abstract']

In [None]:
data['abstract'][0]['text']

In [None]:
data['body_text']

In [None]:
len(data['body_text'])

Access each field for body text in a loop

In [None]:
for bt in data['body_text']:
    print(bt['text'])

So far we've accessed the data for one document - next, let's write a loop to access all the documents in a directory. 

In [None]:
import glob

In [None]:
json_files = glob.glob('/kaggle/input/CORD-19-research-challenge/document_parses/pdf_json/*.json')

In [None]:
json_files

In [None]:
len(json_files)

I sometimes find it easier to work with tabular data in pandas rather than parsing JSON documents. One way to do this is to parse the documents in a loop, store the results to lists, and build a pandas dataframe from the lists. 

Here, we'll build a pandas dataframe with the paper id and the full body text. 

In [None]:
paper_texts = []
paper_ids = []

i = 0

for jf in json_files[:100]:
    with open(jf) as json_file:
        data = json.load(json_file)
    
    paper_ids.append(data['paper_id'])
    
    paper_text = []
    for b in data['body_text']:
        paper_text.append(b['text'])
    
    paper_texts.append(" ".join(paper_text))

In [None]:
print(len(paper_texts))
print(len(paper_ids))

In [None]:
#import pandas as pd

In [None]:
df = pd.DataFrame({"paper_id": paper_ids, "text": paper_texts})

In [None]:
df.head(10)

In [None]:
df.iloc[2]['text']

For a bit of analysis, we'll use the Google Cloud NLP api to assign categories and sentiment scores to each paper. 

In [None]:
from googleapiclient.discovery import build

In [None]:
import getpass
APIKEY = getpass.getpass()

In [None]:
lservice = build('language', 'v1', developerKey=APIKEY)

In [None]:
categories_response = lservice.documents().classifyText(
  body={
    'document': {
      'type': 'PLAIN_TEXT',
      'content':  df.iloc[2]['text'] }
}).execute()

In [None]:
categories_response

In [None]:
sentiment_response = lservice.documents().analyzeSentiment(
    body={
      'document': {
        'type': 'PLAIN_TEXT',
        'content': df.iloc[2]['text']
    }
  }).execute()

In [None]:
sentiment_response