# Data Exploration

Some initial investigation into seeing the connection between the ICF and Protocols.

## Setup

Handling imports

In [1]:
import nltk
nltk.download('punkt') # get punkt to use sentence tokenizers


[nltk_data] Downloading package punkt to /home/btor/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
from pathlib import Path
from pypdf import PdfReader
import torch

from nltk.tokenize import sent_tokenize


In [3]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2") # we use this model for semantic similarity later

  from tqdm.autonotebook import tqdm, trange
  return torch._C._cuda_getDeviceCount() > 0


## Ingest Data

The `retrieve_trials.py` script downloads copies of clinical protocols and informed consent forms to the `icf/` and `prot/` directories. We first need to load this data for analysis.

The amount of protocols may exceed the amount of available informed consent forms. We can track both the icf and prot forms by their associated ID, which ensures we can match them together later on.

In [5]:
icf_files = {i.name.split("_")[0]: i for i in Path("icf").glob("**/*")}
prot_files = {i.name.split("_")[0]: i for i in Path("prot").glob("**/*")}
len(icf_files), len(prot_files)

(28, 42)

Read the text from the PDF files.

In [6]:
icf_text = {}
prot_text = {}
for k in icf_files.keys():
    if k not in prot_files:
        continue
    reader = PdfReader(prot_files[k])
    text = "\n\n".join([page.extract_text() for page in reader.pages])
    prot_text[k] = text
    
    reader = PdfReader(icf_files[k])
    text = "\n\n".join([page.extract_text() for page in reader.pages])
    icf_text[k] = text

## Calculate Similarity

We can take a few approaches to calculating the overlap between the text:

* Semantic Similarity - generate a text embedding for both documents, then calculate their similarity. This is likely ineffective, since the protocols are significantly larger than the ICFs, and may have details that are intentionally included in the ICFs
* Ngram overlap - generate a few ngrams based off of the protocol and ICF files, then calculate based on the commonly shared ngrams. This is better, but still not ideal since the ngram calcluation I have does not respect sentences. Furthermore, minor paraphrased gaps between the text will not be accounted for by this approach. (You could pair it up with semantic similarity between ngrams, but at that point it may be more effort than its worth.)
* fuzzy matching - between the two text calculate their levenshtein distance from one another. This is usually useful for detecting similarities between texts that have been modified off of each other. However, it isn't as effective here, since the protocol documents are just so much larger than the ICF documents.
* Sentence Semantic Similarity - generate text embedding for all sentences in the documents, then calculate a similarity matrix from the sentences. Based off of the similarity matrix, find the highest similarity score for a given sentence in the ICF, and take the average count of those scores across the document

The Sentence Semantic Similarity approach is the one I primarily pursued. We want to take the maximum sentence similarity, as it more strongly accounts for the hypothesis that the ICFs are informed by the protocols. Minimum and average similarity are unlikely to be useful since these scores will be artificially low. The protocols and ICFs cover a range of semanticly distinct topics (procedures, risks, boilerplate text), so it cannot be assumed that all sentences will share any degree of relevancy.

For the sake of completness, I do offer code on how to explore the other methods as well.

In [8]:
def generate_ngrams(text: str, N: int=2):
    """given a text, generate ngrams"""
    return set([text[i: i + N] for i in range(len(text) - N +1)])

def ngram_overlap(icf_text: str, prot_text: str, N=2):
    """given icf and prot, return number of shared ngrams that appear in the icf"""
    icf_ngrams: set = generate_ngrams(icf_text, N)
    prot_ngrams: set = generate_ngrams(prot_text, N)

    num_shared_ngrams = len(icf_ngrams & prot_ngrams)
    total_ngrams = len(icf_ngrams | prot_ngrams) # this may not be useful, since we just care about overlap from the ICF
    return num_shared_ngrams / len(icf_ngrams)

In [10]:
from thefuzz import fuzz

def fuzzy_matching(icf_text: str, prot_text: str):
    """calculate levenshtein Distance between two text"""
    return fuzz.ratio(icf_text, prot_text)

In [7]:
def calculate_icf_similarity_matrix(icf_text: str, prot_text: str, model):
    """Given two blocks of text, calculate their sentence similarity matrix"""
    icf_sentences = sent_tokenize(icf_text)
    prot_sentences = sent_tokenize(prot_text)
    icf_embeddings = model.encode(icf_sentences)
    prot_embeddings = model.encode(prot_sentences)

    return model.similarity(icf_embeddings, prot_embeddings)

In [16]:
from tqdm import tqdm

scores = {}
for key in tqdm(icf_text.keys()):

    # account for places where we have the id for icf
    #   but not for protocols
    if key not in prot_text:
        continue
    
    i = icf_text[key]
    p = prot_text[key]

    similarities = calculate_icf_similarity_matrix(i, p, model)
    max_median = torch.max(similarities, axis=1)[0].median()
    max_mean = torch.max(similarities, axis=1)[0].mean()

    scores[key] = max_median

100%|██████████| 20/20 [01:56<00:00,  5.83s/it]


### Results

Calculating scores we get an average overlap of around ~0.785 between the ICF onto the Protocols. This is high enough for me to infer that the text in the ICF is highly similar to the text in the Protocols.

In [14]:
sum(scores.values())/len(scores) # calculate the average

tensor(0.7851)