### Raw Extraction

In [None]:
# Q1. What are the most relevant and important details one would want to extract
# from a clinical trial?

# Patient Enrollment
# Frequency of procedures
# A classification of how burdensome the procedures are
# Is treatment involved?
# Inclusion and Exclusion Criteria

In [None]:
# Main problem to solve
# 1. Get all procedures
# 2. Get frequency of procedures

In [None]:
# Problem Structure 
# Which I will solve using GPT-3 for now
# And then write an article about it

# 0. Fetching clinical trials using PubMed
# 1. Langchain for localization of relevant text [Key]
# 2. Arm extraction, procedure extraction, treatment extraction (GPT-3) [Key]
# 3. Frequency of treatments and procedure extraction (GPT-3) [Key]
# 4. Calculation of patient burden score
  # 4.1 Very simple calculation: Freq per time period * total amount of time/no. of cycles * procedure severity score
  # 4.2 We sum this up over all procedures in a trial arm
  # 4.3 Now we have the patient burden per trial arm
  # 4.4 We take the greater as the patient procedure burden of the trial

# 5. Repeat process across 50-100 trials

In [None]:
pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m94.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m108.0 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m28.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.28.1


In [None]:
context = """Patients will be randomized, via an Interactive Voice/Web Response System, 1:1:1 to either: placebo once daily (qd) plus investigators' choice of platinum-based chemotherapy (three 21-day cycles of carboplatin area under the curve 5 + pemetrexed 500 mg/m2 or cisplatin 75 mg/m2 plus pemetrexed 500 mg/m2); osimertinib 80 mg qd plus investigators' choice of chemotherapy (as above); or osimertinib 80 mg qd monotherapy for ≥9 weeks."""

Here is a possible text using the keywords:

The following text extracts all the procedures and their frequency in the context of a randomized controlled trial (RCT) . An RCT is a research design that compares the effectiveness of different interventions by randomly assigning participants to different groups . The basic steps in conducting an RCT are: drawing up a protocol, selecting reference and experimental populations, randomization, manipulation or intervention, follow-up, and assessment of outcome .

Arm: Placebo + chemotherapy
Procedure: Receive placebo once daily plus investigators' choice of platinum-based chemotherapy (three 21-day cycles of carboplatin area under the curve 5 + pemetrexed 500 mg/m2 or cisplatin 75 mg/m2 plus pemetrexed 500 mg/m2)
Frequency: Once daily for placebo; every 21 days for chemotherapy

Arm: Osimertinib + chemotherapy
Procedure: Receive osimertinib 80 mg once daily plus investigators' choice of chemotherapy (as above)
Frequency: Once daily for osimertinib; every 21 days for chemotherapy

Arm: Osimertinib monotherapy
Procedure: Receive osimertinib 80 mg once daily
Frequency: Once daily for ≥9 weeks

In [None]:
from transformers import pipeline
model = pipeline(model="google/flan-t5-base")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [None]:
prompt = "Extract all the clinical treatment arms on patients from the given context. Context: {} Answer:".format(context)
output = model(prompt, max_length=200, do_sample=False)

In [None]:
output[0]['generated_text']

"placebo once daily (qd) plus investigators' choice of platinum-based chemotherapy (three 21-day cycles of carboplatin area under the curve 5 + pemetrexed 500 mg/m2 or cisplatin 75 mg/m2 plus pemetrexed 500 mg/m2); osimertinib 80 mg qd plus investigators' choice of chemotherapy (as above); or osimertinib 80 mg qd monotherapy for 9 weeks"

In [None]:
input = output[0]['generated_text'].split(';')[0]
prompt = "Return all the procedures in the given clinical trial arm in the context, comma-separated. Context: {} Answer:".format(input)
out = model(prompt, max_length=128, do_sample=False)
out

[{'generated_text': 'cisplatin 75 mg/m2 plus pemetrexed 500 mg/m2'}]

In [None]:
input = output[0]['generated_text'].split(';')[0]
prompt = "What is the dosage of the chemotherapy in the clinical trial arm in the context? Context: {} Answer:".format(input)
out = model(prompt, max_length=128, do_sample=False)
out

[{'generated_text': "once daily (qd) plus investigators' choice of platinum-based chemotherapy"}]

### Part 0.0: Fetching relevant text using PubTator API - Experimentation

In [None]:
!pip install Bio

In [None]:
from Bio import Entrez

Entrez.email = "ayushlall27@gmail.com"

handle = Entrez.esearch(db="pubmed", retmax=10, term="clinical[Title/Abstract] AND trial[Title/Abstract]) OR clinical trials as topic[MeSH Terms] OR clinical trial[Publication Type] OR random*[Title/Abstract] OR random allocation[MeSH Terms] OR therapeutic use[MeSH Subheading]")
record = Entrez.read(handle)

handle.close()

In [None]:
record



In [None]:
handle = Entrez.efetch(db="pubmed", id='28199710', retmode="xml")

In [None]:
data = Entrez.read(handle)

In [1]:
import json
print(json.dumps(data, indent=3))

NameError: name 'data' is not defined

In [None]:
import requests
id = '26903255'
url = 'https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_{0}/{1}/{2}'.format('json', id,'unicode')
print(url)
response = requests.get(url)

https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/26903255/unicode


In [None]:
data = response.json()

In [None]:
out = []
for text in data['documents'][0]['passages']:
    if 'text' in text.keys():
      out.append(text['text'])

In [None]:
out

### Part 0.1: Fetching relevant text using BioC API - code

In [2]:
import requests
import json

idx = '30088834'
idx1 = '3194'

def fetch_data(idx, path):
    url = 'https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_{0}/{1}/{2}'.format('json', idx,'unicode')
    resp = requests.get(url)

    data = resp.json()
    out = []
    for text in data['documents'][0]['passages']:
        if 'text' in text.keys():
            out.append(text['text'])

    outstr = '\n'.join(out)
    with open(path, 'wt') as f:
        f.write(outstr)

In [3]:
resp = fetch_data(idx, idx + '.txt')

### Part 1.0: Localization - Experimentation

In [4]:
!pip install langchain
!pip install transformers
!pip install sentence-transformers
!pip install faiss-cpu
import nltk
nltk.download('punkt')

Collecting langchain
  Downloading langchain-0.0.161-py3-none-any.whl (758 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m759.0/759.0 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting async-timeout<5.0.0,>=4.0.0
  Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Collecting openapi-schema-pydantic<2.0,>=1.2
  Downloading openapi_schema_pydantic-1.2.4-py3-none-any.whl (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.0/90.0 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pydantic<2,>=1
  Downloading pydantic-1.10.7-cp38-cp38-macosx_10_9_x86_64.whl (2.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
[?25hCollecting aiohttp<4.0.0,>=3.8.3
  Downloading aiohttp-3.8.4-cp38-cp38-macosx_10_9_x86_64.whl (359 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m359.3/359.3 kB[0m [31m8.1 MB

[nltk_data] Downloading package punkt to /Users/ayushlall/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [4]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter, NLTKTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

loader = TextLoader(idx +'.txt')
documents = loader.load()



In [None]:
# Get your splitter ready
text_splitter = RecursiveCharacterTextSplitter(chunk_size=5000, chunk_overlap=0)

# Split your docs into texts
texts = text_splitter.split_documents(documents)
print (f"You have {len(texts)} documents")

# Get embedding engine ready
embeddings = HuggingFaceEmbeddings(model_name='pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb')
# Embed your texts
db = FAISS.from_documents(texts, embeddings)

You have 14 documents


Downloading (…)ac940/.gitattributes:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7b999ac940/README.md:   0%|          | 0.00/4.45k [00:00<?, ?B/s]

Downloading (…)999ac940/config.json:   0%|          | 0.00/691 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)_sts-dev_results.csv:   0%|          | 0.00/767 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/433M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)sts-test_results.csv:   0%|          | 0.00/300 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)ac940/tokenizer.json:   0%|          | 0.00/669k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/412 [00:00<?, ?B/s]

Downloading (…)7b999ac940/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)99ac940/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

In [None]:
print ("Preview:")
print (texts[0].page_content, "\n")
print (texts[1].page_content)

Preview:
Mindfulness‐based cognitive therapy for patients with chronic, treatment‐resistant depression: A pragmatic randomized controlled trial
Background
Chronic and treatment‐resistant depressions pose serious problems in mental health care. Mindfulness‐based cognitive therapy (MBCT) is an effective treatment for remitted and currently depressed patients. It is, however, unknown whether MBCT is effective for chronic, treatment‐resistant depressed patients.
Method
A pragmatic, multicenter, randomized‐controlled trial was conducted comparing treatment‐as‐usual (TAU) with MBCT + TAU in 106 chronically depressed outpatients who previously received pharmacotherapy (≥4 weeks) and psychological treatment (≥10 sessions).
Results
Based on the intention‐to‐treat (ITT) analysis, participants in the MBCT + TAU condition did not have significantly fewer depressive symptoms than those in the TAU condition (–3.23 [–6.99 to 0.54], d = 0.35, P = 0.09) at posttreatment. However, compared to TAU, the M

In [None]:
docs = db.similarity_search_with_score('interventions')
for d in docs:
  print(d[0].page_content[:100].replace('\n','|') + '...', d[1])

This study is the first to investigate the effectiveness of MBCT + TAU for chronic, treatment‐resist... 169.50354
Mindfulness‐based cognitive therapy for patients with chronic, treatment‐resistant depression: A pra... 175.59206
Mindfulness‐based cognitive therapy (MBCT) is an 8‐week group training that combines mindfulness med... 176.60593
Mindfulness‐based cognitive therapy for treatment‐resistant depression: A pilot study|A randomized c... 180.71136


In [None]:
retriever = db.as_retriever()

In [None]:
docs = retriever.get_relevant_documents("interventions")

In [None]:
split_str = '\n' + "="*100 + '\n'
print(split_str.join([x.page_content for x in docs[:6]]))

This study is the first to investigate the effectiveness of MBCT + TAU for chronic, treatment‐resistant depressed patients who did not benefit from pharmacotherapy and psychological treatment. The current study has high ecological validity because of its pragmatic design. Participants were moderately to severely depressed outpatients and were enrolled in MBCT trainings provided regularly at their local mental health care institution. Thereby this study provides much‐needed insight into the effectiveness rather than efficacy of MBCT, which was formulated as an important research goal in a recent review paper of MBCT (Dimidjian & Segal, 2015). However, the effect sizes found in this study (ITT: d = 0.35; PP: d = 0.45) are smaller than in previous preliminary studies of chronic or treatment‐resistant depressed patients (Barnhofer et al., 2009; Eisendrath et al., 2008; Kenny & Williams, 2007), as well as smaller than the effect size found in previous research on currently depressed patient

### Part 1.1: Abstractive Summarization to improve responses

In [None]:
# Extractive summarization
from transformers import pipeline

summarizer = pipeline("summarization", model="Alred/t5-small-finetuned-summarization-cnn")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.35k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [None]:
summarizer(documents[0].page_content)

Token indices sequence length is longer than the specified maximum sequence length for this model (12034 > 512). Running this sequence through the model will result in indexing errors


In [None]:
import torch
from transformers import LEDTokenizer, LEDForConditionalGeneration

tokenizer = LEDTokenizer.from_pretrained("hyesunyun/update-summarization-bart-large-longformer")
model = LEDForConditionalGeneration.from_pretrained("hyesunyun/update-summarization-bart-large-longformer")
model.to('cuda:0')

#input = "<EV> <t> Hypoglycemic effect of bitter melon compared with metformin in newly diagnosed type 2 diabetes patients. <abs> ETHNOPHARMACOLOGICAL RELEVANCE: Bitter melon (Momordica charantia L.) has been widely used as an traditional medicine treatment for diabetic patients in Asia. In vitro and animal studies suggested its hypoglycemic activity, but limited human studies are available to support its use. AIM OF STUDY: This study was conducted to assess the efficacy and safety of three doses of bitter melon compared with metformin. MATERIALS AND METHODS: This is a 4-week, multicenter, randomized, double-blind, active-control trial. Patients were randomized into 4 groups to receive bitter melon 500 mg/day, 1,000 mg/day, and 2,000 mg/day or metformin 1,000 mg/day. All patients were followed for 4 weeks. RESULTS: There was a significant decline in fructosamine at week 4 of the metformin group (-16.8; 95% CI, -31.2, -2.4 mumol/L) and the bitter melon 2,000 mg/day group (-10.2; 95% CI, -19.1, -1.3 mumol/L). Bitter melon 500 and 1,000 mg/day did not significantly decrease fructosamine levels (-3.5; 95% CI -11.7, 4.6 and -10.3; 95% CI -22.7, 2.2 mumol/L, respectively). CONCLUSIONS: Bitter melon had a modest hypoglycemic effect and significantly reduced fructosamine levels from baseline among patients with type 2 diabetes who received 2,000 mg/day. However, the hypoglycemic effect of bitter melon was less than metformin 1,000 mg/day. <EV> <t> Momordica charantia for type 2 diabetes mellitus. <abs> There is insufficient evidence to recommend momordica charantia for type 2 diabetes mellitus. Further studies are therefore required to address the issues of standardization and the quality control of preparations. For medical nutritional therapy, further observational trials evaluating the effects of momordica charantia are needed before RCTs are established to guide any recommendations in clinical practice."
input = docs[0].page_content
inputs_dict = tokenizer(input, padding="max_length", max_length=10240, return_tensors="pt", truncation=True)
input_ids = inputs_dict.input_ids
attention_mask = inputs_dict.attention_mask
global_attention_mask = torch.zeros_like(attention_mask)
# put global attention on <s> token
global_attention_mask[:, 0] = 1

predicted_summary_ids = model.generate(input_ids.to('cuda:0'), attention_mask=attention_mask.to('cuda:0'), global_attention_mask=global_attention_mask.to('cuda:0'))
print(tokenizer.batch_decode(predicted_summary_ids, skip_special_tokens=True))



['MBCT appears to be an effective treatment for people with depression, particularly in the treatment of persistent depressive symptoms. However, the effect sizes found in this study (ITT: d\xa0=\xa00.35; PP: 0.45) are smaller than in previous preliminary studies of chronic or treatment‐resistant depressed patients (Barnhofer et al., 2009; Eisendrath et\xa0al., 2008; Kenny & Williams, 2007), as well as smaller than the effect size found in previous research on currently depressed patients. Both the relatively small effect size and the high drop‐out rate from MBCT can likely be explained by the higher severity of symptoms in the current study sample compared to previous research (van Aalderen et\xa0Al., 2011). As only completers showed a significant decrease in depressive symptoms, obstacles to completing treatment should be investigated in future research, for example, by conducting qualitative interviews. The current study has high ecological validity because of its pragmatic design. 

In [None]:
ans = "MBCT appears to be an effective treatment for people with depression, particularly in the treatment of persistent depressive symptoms. However, the effect sizes found in this study (ITT: d\xa0=\xa00.35; PP: 0.45) are smaller than in previous preliminary studies of chronic or treatment‐resistant depressed patients (Barnhofer et al., 2009; Eisendrath et\xa0al., 2008; Kenny & Williams, 2007), as well as smaller than the effect size found in previous research on currently depressed patients. Both the relatively small effect size and the high drop‐out rate from MBCT can likely be explained by the higher severity of symptoms in the current study sample compared to previous research (van Aalderen et\xa0Al., 2011). As only completers showed a significant decrease in depressive symptoms, obstacles to completing treatment should be investigated in future research, for example, by conducting qualitative interviews. The current study has high ecological validity because of its pragmatic design. Participants were moderately to severely depressed outpatients and were not blinded to treatment allocation, which might have been a potential source of bias. Thereby this study provides much‐needed insight into the effectiveness rather than efficacy of MBCt, which was formulated as an important research goal in a recent Cochrane review (Dimidjian & Segal, 2015)."
ans = [a + '\n' if i % 120 == 0 else a for i, a in enumerate(ans)]
print(''.join(ans))

M
BCT appears to be an effective treatment for people with depression, particularly in the treatment of persistent depress
ive symptoms. However, the effect sizes found in this study (ITT: d = 0.35; PP: 0.45) are smaller than in previous preli
minary studies of chronic or treatment‐resistant depressed patients (Barnhofer et al., 2009; Eisendrath et al., 2008; Ke
nny & Williams, 2007), as well as smaller than the effect size found in previous research on currently depressed patient
s. Both the relatively small effect size and the high drop‐out rate from MBCT can likely be explained by the higher seve
rity of symptoms in the current study sample compared to previous research (van Aalderen et Al., 2011). As only complete
rs showed a significant decrease in depressive symptoms, obstacles to completing treatment should be investigated in fut
ure research, for example, by conducting qualitative interviews. The current study has high ecological validity because 
of its pragmatic design. Parti

In [None]:
del model, input_ids, attention_mask, global_attention_mask
import torch
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
model_name = 'tuner007/pegasus_summarizer'
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)

Downloading spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/1.34k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

In [None]:
def get_response(input_text, input_max_length=1024, output_max_length=1024):
  batch = tokenizer([input_text],truncation=True,padding='longest',max_length=1024, return_tensors="pt").to(torch_device)
  gen_out = model.generate(**batch,max_length=output_max_length,num_beams=5, num_return_sequences=1, temperature=1.5)
  output_text = tokenizer.batch_decode(gen_out, skip_special_tokens=True)
  return output_text

In [None]:
resp = get_response(docs[0].page_content)
print(resp[0], len(resp[0]))

This study is the first to investigate the effectiveness of MBCT + TAU for chronic, treatment-resistant depressed patients who did not benefit from pharmacotherapy and psychological treatment. The effect sizes found in this study (ITT: d = 0.35; PP: d = 0.45) are smaller than in previous studies of chronic or treatment-resistant depressed patients (Barnhofer et al., 2009; Eisendrath et al., Kenny & Williams, 2007), as well as smaller than the effect size found in previous research on currently depressed patients (d = 0.53; van Aalderen et al., 2011). 556


### Part 2: Pipelines

In [1]:
# Class structure
# Class 1:
  # DocumentGenerator - reads the PMC article and creates langchain documents
  # also finds the top most relevant documents to clinical trials
  # LangChain parameters controlled through a config file
# Class 2:
  # DocumentSummarizer - Summarizes the top key documents using the Pegasus model
# Class 3 (Future):
  # GPTExtractor - Uses GPT and a pre-created prompt to extract procedures from
  # the article and grade them by severity

depression_idx = '30088834'
covid19_idx = '33301246'

In [2]:
config = {
  'pmc_url':'https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{0}/unicode',
  'pmc_id': covid19_idx,
  'raw_text_path':'',
  'db_path':'',
  'use_existing_db': 'False',
  'chunk_size':'1000',
  'chunk_overlap':'0',
  'embedding_model_name': 'pritamdeka/S-PubMedBert-MS-MARCO',
  'query':'interventions, methods, procedures, clinical trial arms',
  'summarizer_model_name': 'tuner007/pegasus_summarizer',
  'summarized_text_path': 'summary.txt',
  'max_summary_length': '128',
  'top_k':'3'
}

#### Part 2.1: Document Generator Class

In [7]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter, NLTKTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
import os
import requests

class DocumentGenerator():
# Fetches raw text and creates vector embeddings
# that can be queried upon for a single PubMed article
# (Can be full text or abstract)

# Note: We can use an existing FAISS DB store here
# however, the embedding model must be the same as what
# is used to create the DB
# Additionally, the path to the existing db must be given as
# db_path, and 'use_existing_db' must be set to true.

# If a new db is to be created, it will be created at db_path.
# Warning: This will OVERWRITE any existing dbs at the path.
    def __init__(self,config):
        self.url = config['pmc_url']
        self.output_path = config['raw_text_path']
        self.chunk_size = int(config['chunk_size'])
        self.chunk_overlap = int(config['chunk_overlap'])
        self.model_name = config['embedding_model_name']
        self.db_path = config['db_path']
        self.use_existing_db = config['use_existing_db']
        self.pmc_id = config['pmc_id']
        self.top_k = int(config['top_k'])

        self.load_embedding_engine()

        if self.use_existing_db.lower() == 'true':
            self.load_db_from_existing_path()
        else:
            print('Creating new vector database')
            self.fetch_from_pmc()
            self.load_text_into_langchain()
            self.split_text_into_chunks()
            self.create_vector_embeddings()

    def fetch_from_pmc(self):
        print('Fetching raw text from PMC API...', end='')
        url = self.url.format(self.pmc_id)
        resp = requests.get(url)
        if resp.status_code == '404':
            raise ValueError('404 response from PMC API. Is the PMC id given correct?')

        data = resp.json()
        out = []
        for text in data['documents'][0]['passages']:
            if 'text' in text.keys():
                out.append(text['text'])

        outstr = '\n'.join(out)
        self.path = self.output_path + str(idx) + '.txt'
        with open(self.path, 'wt') as f:
            f.write(outstr)
        print('Done!')

    def load_text_into_langchain(self):
        print('Reading the raw document into Langchain...', end='')
        loader = TextLoader(idx +'.txt')
        self.documents = loader.load()
        print('Done!')

    def split_text_into_chunks(self):
        # Get your splitter ready
        print('Splitting raw document into chunks...', end='')
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=self.chunk_size, chunk_overlap=self.chunk_overlap)

        # Split your docs into texts
        self.texts = text_splitter.split_documents(self.documents)
        print (f"Done! You have {len(self.texts)} chunks")

    def load_embedding_engine(self):
    # Get embedding engine ready
        print('Loading Embedding Engine...', end='')
        self.model = HuggingFaceEmbeddings(model_name=self.model_name)
        print('Done!')

    def create_vector_embeddings(self):
    # Embed your texts
        print('Creating vector DB and embeddings for the texts...', end='')
        self.db = FAISS.from_documents(self.texts, self.model)
        self.db.save_local(self.db_path + self.pmc_id + '.db')
        print('Done!')

    def load_db_from_existing_path(self):
        print('Loading DB from existing path...', end='')
        does_file_exist = os.path.isfile(self.db_path)
        if not does_file_exist:
            raise ValueError('DB does not exist: There is no file at the path:', self.db_path)

        self.db = FAISS.load_local(self.db_path, self.model)
        print('Done!')

    def similarity_search(self, query):
        self.docs = self.db.similarity_search(query, k=self.top_k)
        self.docs = [d.page_content for d in self.docs]
        with open(self.output_path + self.pmc_id + '_' + query.replace(' ', '_') + '.txt', 'wt') as f:
            f.write('|'.join(self.docs))
        return self.docs


#dg = DocumentGenerator(config)

In [67]:
#dep_config = config
#cancer_config = config
#cancer_config['pmc_id'] = '34278827'

In [8]:
dg = DocumentGenerator(config)
#can_dg = DocumentGenerator(cancer_config)

Loading Embedding Engine...

  from .autonotebook import tqdm as notebook_tqdm


Done!
Creating new vector database
Fetching raw text from PMC API...Done!
Reading the raw document into Langchain...Done!
Splitting raw document into chunks...Done! You have 53 chunks
Creating vector DB and embeddings for the texts...Done!


In [14]:
data = dg.db.similarity_search('patient enrollment', 3)
[str(i) + ':' + d.page_content for i, d in enumerate(data)]

['0:Between July 27, 2020, and November 14, 2020, a total of 44,820 persons were screened, and 43,548 persons 16 years of age or older underwent randomization at 152 sites worldwide (United States, 130 sites; Argentina, 1; Brazil, 2; South Africa, 4; Germany, 6; and Turkey, 9) in the phase 2/3 portion of the trial. A total of 43,448 participants received injections: 21,720 received BNT162b2 and 21,728 received placebo (Figure 1). At the data cut-off date of October 9, a total of 37,706 participants had a median of at least 2 months of safety data available after the second dose and contributed to the main safety data set. Among these 37,706 participants, 49% were female, 83% were White, 9% were Black or African American, 28% were Hispanic or Latinx, 35% were obese (body mass index [the weight in kilograms divided by the square of the height in meters] of at least 30.0), and 21% had at least one coexisting condition. The median age was 52 years, and 42% of participants were older than 5

#### Part 2.2: Document Summarizer class

In [13]:
import torch
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

def load_tokenizer(config):
    print('Loading Tokenizer...', end='')
    tokenizer = PegasusTokenizer.from_pretrained(config['summarizer_model_name'])
    print('Done!')
    return tokenizer

def load_model(config):
    print('Loading Summarizer Model...', end='')
    torch_device = 'mps'
    model = PegasusForConditionalGeneration.from_pretrained(config['summarizer_model_name']).to(torch_device)
    print('Done!')
    return model

tokenizer = load_tokenizer(config)
model = load_model(config)

Loading Tokenizer...Done!
Loading Summarizer Model...Done!


In [17]:
from tqdm import tqdm
class DocumentSummarizer():
    def __init__(self, model, tokenizer, config):
    # self.model_name = config['summarizer_model_name']
        self.torch_device = 'mps'
        self.model = model
        self.tokenizer = tokenizer
        self.path = config['summarized_text_path']
        self.chunk_size = int(config['chunk_size'])
        self.max_out_length = int(config['max_summary_length'])

    def get_response(self, input_text):
        batch = self.tokenizer([input_text],truncation=True,padding='longest',max_length=self.chunk_size, return_tensors="pt").to(self.torch_device)
        gen_out = self.model.generate(**batch,max_length=self.max_out_length,num_beams=5, num_return_sequences=1, temperature=1.5)
        output_text = self.tokenizer.batch_decode(gen_out, skip_special_tokens=True)
        return output_text   

    def run_on_documents(self, documents):
    # Runs on documents

    # If documents is a path
        if type(documents) == str:
            does_file_exist = os.path.isfile(documents)
            if not does_file_exist:
                raise ValueError('The path specified does not exist.')

            with open(documents, 'r') as f:
                documents = f.read()

            documents = documents.split('|')
        else:
            documents = [d.page_content for d in documents]

        self.summarized = []
        for doc in tqdm(documents):
            resp = self.get_response(doc)
            self.summarized.extend(resp)

        return self.summarized

    def save(self):
        with open(self.path, 'wt') as f:
            f.write('|'.join(self.summarized))

ds = DocumentSummarizer(model, tokenizer, config)
ds.run_on_documents(data)
#ds.save()

100%|█████████████████████████████████████████████| 3/3 [00:42<00:00, 14.04s/it]


['As many as 37,706 participants had a median of at least 2 months of safety data available after the second dose and contributed to the main safety data set in the Phase 2/3 trial of BNT162b2 for the treatment of COVID-19, according to data from 152 sites worldwide. Among these 37,706 participants, 49% were female, 83% were White, 9% were Black or African American, 28% were Hispanic or Latinx, 35% were obese.',
 'The safety population includes persons 16 years of age or older; a total of 43,448 participants constituted the population of enrolled persons injected with the vaccine or placebo. The modified intention-to-treat (mITT) efficacy population includes all age groups 12 years of age or older (43,355 persons; 100 participants who were 12 to 15 years of age contributed to person-time years but included no cases).',
 'Efficacy of BNT162b2 against COVID-19 after the first dose (modified intention-to-treat population) was shown in a study. The cumulative incidence of COVID-19 after th

### Part 3: GPT Response

In [3]:
import sys
sys.path.append('./')

In [3]:
from document_generator import DocumentGenerator
from document_summarizer import DocumentSummarizer
dg = DocumentGenerator(config)

data = dg.db.similarity_search('patient enrollment', 3)
[str(i) + ':' + d.page_content for i, d in enumerate(data)]



  from .autonotebook import tqdm as notebook_tqdm


Loading Embedding Engine...Done!
Creating new vector database
Fetching raw text from PMC API...Done!
Reading the raw document into Langchain...Done!
Splitting raw document into chunks...Done! You have 53 chunks
Creating vector DB and embeddings for the texts...Done!


['0:Between July 27, 2020, and November 14, 2020, a total of 44,820 persons were screened, and 43,548 persons 16 years of age or older underwent randomization at 152 sites worldwide (United States, 130 sites; Argentina, 1; Brazil, 2; South Africa, 4; Germany, 6; and Turkey, 9) in the phase 2/3 portion of the trial. A total of 43,448 participants received injections: 21,720 received BNT162b2 and 21,728 received placebo (Figure 1). At the data cut-off date of October 9, a total of 37,706 participants had a median of at least 2 months of safety data available after the second dose and contributed to the main safety data set. Among these 37,706 participants, 49% were female, 83% were White, 9% were Black or African American, 28% were Hispanic or Latinx, 35% were obese (body mass index [the weight in kilograms divided by the square of the height in meters] of at least 30.0), and 21% had at least one coexisting condition. The median age was 52 years, and 42% of participants were older than 5

In [None]:
ds = DocumentSummarizer(config)
ds.run_on_documents(data)

In [None]:
# Document(s)
ds.summarized

#### For the mindfulness based cognitive therapy (MBCT) clinical trial on depression

{
arm: MBCT + TAU
procedure: mindfulness-based cognitive therapy (MBCT) and treatment as usual (TAU)
frequency of procedure per week: 1 (for MBCT) + variable (for TAU)
}

{
arm: TAU
procedure: treatment as usual (TAU)
frequency of procedure per week: variable
}

[{
'arm': 'MBCT + TAU', <br>
'procedure': 'Mindfulness-based cognitive therapy (MBCT) is an eight-week group training that combines mindfulness meditation techniques with elements of cognitive therapy. MBCT teaches participants to recognise and disengage from maladaptive automatic cognitive patterns, and to develop a nonjudgmental and compassionate attitude toward their own cognitions and feelings.', <br>
'frequency of procedure per week (in numbers)': '1'<br>
}, <br>
{
'arm': 'TAU',<br>
'procedure': 'Treatment as usual (TAU) is the standard care that patients receive for their condition, which may include pharmacotherapy and psychological treatment.', <br>
'frequency of procedure per week (in numbers)': 'Varies'<br>
}]

#### For 34278827 - Neoadjuvant osimertinib with/without chemotherapy versus chemotherapy alone for EGFR-mutated resectable non-small-cell lung cancer: NeoADAURA 

