In [17]:
import json
import re

### Read data

Load file directly from filesystem use JSON library

In [18]:
f = open("__data_export.json")
data = json.load(f)

In [19]:
print(len(data))

4644


In [21]:
#print(data.keys())

In [23]:
#print(data['NCT01266616'])

In [24]:
print(data['NCT01266616']['keywords'])

['HIV Infections', 'Interleukin-12', 'Electroporation', 'HIV Therapeutic Vaccine', 'Vaccines', 'Profectus HIV MAG pDNA vaccine', 'IL-12']


### Keywords

Extract all unique keywords
Note that we are cleaning this in an attempt to remove duplicates and avoid unusual characters
This may affect the outcome, and creates a mapping issue if we want to get back to the original names!

In [7]:
keywords = set()
for d in data.keys():
    for k in data[d]['keywords']:
        word = re.sub("[^a-zA-Z0-9/s]"," ", k).strip().lower()
        k = word.replace(" ", "_")
        keywords.add(word)

In [8]:
len(keywords)

17233

In [26]:
#keywords

### How to create a bag of words matching records to keywords?

1) Extract the text from a record

2) Extract the keywords for a particular record

3) Associate the text with the presence or absence of a keyword

In [27]:
keys = data['NCT01778439'].keys()

In [11]:
# get the unique keys for a single record
#print(keys)

In [28]:
# get all the data fields
record = data['NCT01266616']

for d in record:
    print(d, record[d])

completion_date 2013-04-30
conditions [{'clusters': ['Infectious Diseases'], 'name': 'HIV/AIDS', 'raw': ['HIV Infections'], 'slug': 'hiv-aids'}]
description_html <p>Although highly active antiretroviral therapy (HAART) has greatly reduced HIV infection-related morbidity and mortality, individual response to therapy can be variable. Therapeutic vaccination works by augmenting virus-specific immunity and can be given with or without immunomodulatory agents or adjuvants. In conjunction with HAART, therapeutic vaccination may be a more effective treatment for the suppression of HIV-1 replication. This study will examine the safety and efficacy of giving an investigational vaccine with or without IL-12 in HIV-1 infected adults receiving HAART. This study will also test whether delivering the vaccine using EP is safe and increases the efficacy of the vaccine.</p>

<p>Participation in this study will last approximately 36 weeks. Participants will be randomly assigned to one of five cohorts. C

### Extract text 

Ignore everything that is not a text field, and strip all non alphanumeric characters

In [30]:
data_text = ""
for d in record:
    if type(record[d]) is list or record[d] is None:
        continue
    else:
        data_text += str(record[d]) + ' '
data_text =  re.sub('<[^<]+?>', '', data_text)
data_text = re.sub("[^a-zA-Z0-9]"," ", data_text)
data_text = ' '.join(data_text.split())

print(data_text)

2013 04 30 Although highly active antiretroviral therapy HAART has greatly reduced HIV infection related morbidity and mortality individual response to therapy can be variable Therapeutic vaccination works by augmenting virus specific immunity and can be given with or without immunomodulatory agents or adjuvants In conjunction with HAART therapeutic vaccination may be a more effective treatment for the suppression of HIV 1 replication This study will examine the safety and efficacy of giving an investigational vaccine with or without IL 12 in HIV 1 infected adults receiving HAART This study will also test whether delivering the vaccine using EP is safe and increases the efficacy of the vaccine Participation in this study will last approximately 36 weeks Participants will be randomly assigned to one of five cohorts Cohort 1 will receive the HIV multi antigen plasmid DNA HIV MAG pDNA vaccine or placebo intramuscularly IM in the upper arm followed by EP Cohorts 2 through 4 will receive th

### Extract keywords

Strip out non alphanumeric keywords, and replace spaces with underscores
This will help reduce the number of duplicates
(though it will also create a potential mapping problem if we want to get back to the original names!)

In [29]:
record_keywords = []
for k in record['keywords']:
    word = re.sub("[^a-zA-Z0-9/s]"," ", k).strip().lower()
    word = word.replace(" ", "_")
    record_keywords.append(word)
print(record_keywords)

['hiv_infections', 'interleukin_12', 'electroporation', 'hiv_therapeutic_vaccine', 'vaccines', 'profectus_hiv_mag_pdna_vaccine', 'il_12']
