In [1]:
!pip install sentence-transformers








In [None]:
from docx import Document
from storage import upload_file

# Load the summary template
summary_doc = Document("Summary.docx")

# Dynamically pull every field name from the first column of each table
report_fields = []
for table in summary_doc.tables:
    for row in table.rows:
        fld = row.cells[0].text.strip()
        # skip empty rows, duplicates, or header placeholders
        if fld and fld not in report_fields and fld.lower() not in ("", "choose an item.", "click here to enter text"):
            report_fields.append(fld)

print(f"📝 Found {len(report_fields)} fields to summarize:")
for f in report_fields:
    print("  •", f)



📝 Found 74 fields to summarize:
  • Protocol Summary - Master Disclosure Document for Interventional Studies
  • The Information on this page will not be posted
  • Before using this template for authoring, refer to the supplemental instructions on Find-IT. This template is used for ALL Interventional studies that evaluate the safety, efficacy or effectiveness of a GSK product.
Master Disclosure Document (MDD) serves as the source document to disclose protocol related information across different clinical trial registers (e.g. ClinicalTrials.gov, EU Clinical Trials Information System (EU CTIS) and/or GSK/ViiV Clinical Study Register) as required by external regulations and/or GSK policy. Check in TMF for the latest version of the template before initiating a new MDD. As information from the approved MDD will be disclosed on publicly available clinical trial register(s) as required by applicable regulations and GSK policy, minimize inclusion of information that may be considered commerc

In [3]:
# Read the dynamically saved filename from app.py
with open("latest_protocol.txt", "r") as f:
    input_filename = f.read().strip()

# Load the DOCX file
source_doc = Document(input_filename)
paragraphs = [p.text.strip() for p in source_doc.paragraphs if p.text.strip()]

print(f"✅ Loaded file: {input_filename}")
print(f"📄 Total source paragraphs: {len(paragraphs)}")


✅ Loaded file: Protocol_20250625-214012.docx
📄 Total source paragraphs: 835


In [4]:
from sentence_transformers import SentenceTransformer, util

# Load a pre-trained model that turns text into embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')


  from .autonotebook import tqdm as notebook_tqdm


In [5]:
# Encode all source paragraphs into embeddings
paragraph_embeddings = model.encode(paragraphs, convert_to_tensor=True)


In [6]:
# Dictionary to store best-matched paragraph for each field
matched_paragraphs = {}

for field in report_fields:
    # Encode the field name
    field_embedding = model.encode(field, convert_to_tensor=True)

    # Calculate cosine similarity with all source paragraphs
    similarities = util.cos_sim(field_embedding, paragraph_embeddings)[0]

    # Find the index of the paragraph with the highest similarity score
    top_idx = similarities.argmax()
    best_para = paragraphs[top_idx]

    print(f"\n🟩 FIELD: {field}")
    print(f"🔍 Best Matching Paragraph:\n{best_para}\n")

    # Save the match
    matched_paragraphs[field] = best_para



🟩 FIELD: Protocol Summary - Master Disclosure Document for Interventional Studies
🔍 Best Matching Paragraph:
6.	STUDY INTERVENTION(S) AND CONCOMITANT THERAPY	46


🟩 FIELD: The Information on this page will not be posted
🔍 Best Matching Paragraph:
PAGE




🟩 FIELD: Before using this template for authoring, refer to the supplemental instructions on Find-IT. This template is used for ALL Interventional studies that evaluate the safety, efficacy or effectiveness of a GSK product.
Master Disclosure Document (MDD) serves as the source document to disclose protocol related information across different clinical trial registers (e.g. ClinicalTrials.gov, EU Clinical Trials Information System (EU CTIS) and/or GSK/ViiV Clinical Study Register) as required by external regulations and/or GSK policy. Check in TMF for the latest version of the template before initiating a new MDD. As information from the approved MDD will be disclosed on publicly available clinical trial register(s) as required by applicable regulations and GSK policy, minimize inclusion of information that may be considered commercially confidential.
🔍 Best Matching Paragraph:
The key design elements of this protocol and results summaries will be posted on www.ClinicalTrials.gov and/


🟩 FIELD: Study identifier /CTMS number
🔍 Best Matching Paragraph:
STUDY DESIGN


🟩 FIELD: European Union (EU) Clinical Trial Regulation (EU CTR) number (if applicable)
🔍 Best Matching Paragraph:
Providing oversight of the conduct of the study at the site and adherence to requirements of 21 Code of Federal Regulation (CFR), ICH guidelines, the IRB/IEC, European regulation 536/2014 for clinical studies (if applicable), and all other applicable local regulations


🟩 FIELD: Is this an Applicable Clinical Trial (ACT)?
🔍 Best Matching Paragraph:
The participant has participated in a clinical trial and has received an investigational product within the following time period prior to the first dosing day in the current study: 3 months, 5 half-lives or twice the duration of the biological effect of the investigational product (whichever is longer).


🟩 FIELD: Trial registers where the study will be disclosed
🔍 Best Matching Paragraph:
STUDY ASSESSMENTS AND PROCEDURES




🟩 FIELD: MDD version date
🔍 Best Matching Paragraph:
Dates of administration including start and end dates


🟩 FIELD: Approver:
🔍 Best Matching Paragraph:
10.1.3.	Informed consent process	69


🟩 FIELD: Clinical Lead/equivalent
🔍 Best Matching Paragraph:
Participant safety will be continuously monitored by the Sponsor’s Medical Monitor, and designated Safety Lead (or delegate) throughout the study. Pertinent findings and conclusions are shared with the product’s safety review team for review of the overall benefit-risk profile of the product.


🟩 FIELD: Section # and Name of the field
{Add section number and name}
🔍 Best Matching Paragraph:
Medical Monitor Name and Contact Information:


🟩 FIELD: Unique Protocol ID
🔍 Best Matching Paragraph:
Protocol Title:




🟩 FIELD: Brief title
🔍 Best Matching Paragraph:
Brief Title:


🟩 FIELD: Acronym
🔍 Best Matching Paragraph:
LIST OF ABBREVIATIONS AND DEFINITIONS OF TERMS


🟩 FIELD: CTMS abbreviated title [EU only]
🔍 Best Matching Paragraph:
LIST OF ABBREVIATIONS AND DEFINITIONS OF TERMS




🟩 FIELD: Official title of the trial
🔍 Best Matching Paragraph:
Brief Title:


🟩 FIELD: Secondary IDs
🔍 Best Matching Paragraph:
Secondary estimands


🟩 FIELD: Secondary ID type
🔍 Best Matching Paragraph:
Secondary estimands


🟩 FIELD: Sponsor
🔍 Best Matching Paragraph:
Sponsor Signatory:


🟩 FIELD: Collaborators
🔍 Best Matching Paragraph:
Activity


🟩 FIELD: Brief summary
🔍 Best Matching Paragraph:
Brief Title:


🟩 FIELD: Detailed description
🔍 Best Matching Paragraph:
Synopsis


🟩 FIELD: Main objective(s) [EU only]
🔍 Best Matching Paragraph:
Objectives, Endpoints, and Estimands:


🟩 FIELD: Secondary objective(s) [EU only]
🔍 Best Matching Paragraph:
OBJECTIVES, ENDPOINTS AND ESTIMANDS


🟩 FIELD: Data monitoring committee
🔍 Best Matching Paragraph:
Data Monitoring/Other Committee:


🟩 FIELD: Keywords
🔍 Best Matching Paragraph:
REFERENCES


🟩 FIELD: Medical condition(s) investigated [EU only]
🔍 Best Matching Paragraph:
Abnormal laboratory findings associated with the underlying disease


🟩 FIELD: Therapeutic area [EU only]
🔍 Best Matching Paragraph:
Providing oversight of the conduct of the study at the site and adherence to requirements of 21 Code of Federal Regulation (CFR), ICH guidelines, the IRB/IEC, European regulation 536/2014 for clinical studies (if applicable), and all other applicable local regulations


🟩 FIELD: Rare disease [EU only]
🔍 Best Matching Paragraph:
Abnormal laboratory findings associated with the underlying disease are not considered clinically significant unless judged by the Investigator to be more severe than expected for the participant’s condition.


🟩 FIELD: Primary purpose
🔍 Best Matching Paragraph:
Reason for use


🟩 FIELD: Study Phase
🔍 Best Matching Paragraph:
STUDY DESIGN


🟩 FIELD: Type of human pharmacology (Phase 1) study (EU only)
🔍 Best Matching Paragraph:
Pharmacodynamics


🟩 FIELD: Interventional study model
🔍 Best Matching Paragraph:
Study intervention(s) administered


🟩 FIELD: Allocation
🔍 Best Matching Paragraph:
Committe


🟩 FIELD: Number of arms
🔍 Best Matching Paragraph:
Number of Participants:


🟩 FIELD: Masking type 
(EU only)
🔍 Best Matching Paragraph:
Blinding/Masking


🟩 FIELD: Masking description / Blinding implementation details
🔍 Best Matching Paragraph:
Blinding/Masking


🟩 FIELD: Arms
Repeat below rows depending on number of arms
🔍 Best Matching Paragraph:
Exclusion criteria


🟩 FIELD: Arm / Group Label
🔍 Best Matching Paragraph:
Not applicable as this is an open-label study.


🟩 FIELD: Arm / Group description
🔍 Best Matching Paragraph:
The estimand is described by the following attributes:


🟩 FIELD: Interventions
Repeat below rows depending on number of interventions
🔍 Best Matching Paragraph:
Table 4	Study intervention administered.


🟩 FIELD: Intervention name
🔍 Best Matching Paragraph:
Study intervention(s) administered


🟩 FIELD: Intervention description
🔍 Best Matching Paragraph:
Assignment to study intervention




🟩 FIELD: Other names
🔍 Best Matching Paragraph:
REFERENCES


🟩 FIELD: Relationship between Arms & Interventions
Repeat below rows depending on number of arms)
🔍 Best Matching Paragraph:
Table 4	Study intervention administered.	48


🟩 FIELD: Arm
🔍 Best Matching Paragraph:
APPENDIX


🟩 FIELD: Outcome measure type
🔍 Best Matching Paragraph:
Assessment of outcomes


🟩 FIELD: Outcome measure title
🔍 Best Matching Paragraph:
Assessment of outcomes


🟩 FIELD: Outcome measure description
🔍 Best Matching Paragraph:
Assessment of outcomes


🟩 FIELD: Time frame
🔍 Best Matching Paragraph:
* Timeframe allowed after receipt or awareness of the information by the Investigator/site staff.


🟩 FIELD: Sex
🔍 Best Matching Paragraph:
Activity


🟩 FIELD: Gender based
🔍 Best Matching Paragraph:
Women in the following categories are considered WONCBP:


🟩 FIELD: Age Limits
🔍 Best Matching Paragraph:
Male or female 18 to 65 years, at the time of signing the informed consent form.


🟩 FIELD: Age ranges (EU on


🟩 FIELD: Why study stopped
🔍 Best Matching Paragraph:
Discontinuation of study intervention


🟩 FIELD: Investigational New Drug Application (IND)/Investigational Device Exemption (IDE) Information
🔍 Best Matching Paragraph:
The Investigator is obligated to perform or arrange for the conduct of supplemental measurements and/or evaluations as medically indicated or as requested by Sponsor to elucidate the nature and/or causality of the AE or SAE as fully as possible. This may include additional laboratory tests or investigations, histopathological examinations, or consultation with other health care professionals.


🟩 FIELD: U.S. FDA IND/IDE study
🔍 Best Matching Paragraph:
Appendix 1: Regulatory, ethical, and study oversight considerations


🟩 FIELD: FDA center (formerly IND/IDE grantor)
🔍 Best Matching Paragraph:
That I am aware of and will comply with Good Clinical Practise (GCP) and all applicable regulatory requirements.


🟩 FIELD: IND/IDE number
🔍 Best Matching Paragraph:
An Inves


🟩 FIELD: Product exported from US
🔍 Best Matching Paragraph:
Seroxat [SmPC; Summary of Product Characteristics]. Brentford, UK: GSK; 2022.


🟩 FIELD: Availability of expanded access
🔍 Best Matching Paragraph:
Inclusion criteria


🟩 FIELD: Expanded access record National Clinical Trial (NCT) number
🔍 Best Matching Paragraph:
10.1.7.	Dissemination of Clinical Study Data	71


🟩 FIELD: Plan to share IPD data
🔍 Best Matching Paragraph:
The participant must be informed that their personal study-related data will be used by the Sponsor in accordance with local data protection law. The level of disclosure must also be explained to the participant, that their data will be used as described in the informed consent.


🟩 FIELD: url
🔍 Best Matching Paragraph:
PAGE



In [None]:
import json

# Save the matched paragraphs from Day 2 into a file
with open("matched_paragraphs.json", "w") as f:
    json.dump(matched_paragraphs, f, indent=2)

upload_file("matched_paragraphs.json", "json", "preprocessed-file")
