<a href="https://colab.research.google.com/github/christophergaughan/NLP_Public/blob/main/Copy_of_NLP_FDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Introduction

In this notebook, we aim to harness a **Natural Language Model (NLM)** to query and analyze data sourced from the **FDA (Food and Drug Administration)**. By structuring information in a context–question–answer format, we can leverage advanced language processing techniques to extract insights from clinical and regulatory documents.

## Key Goals

1. **Data Preparation**: Organize FDA dataset content into a question-answer-friendly format.
2. **Model Integration**: Use a language model capable of understanding domain-specific terminology and context.
3. **Query Execution**: Demonstrate how to pose queries about FDA data (e.g., clinical trials, drug efficacy, labeling information) and retrieve precise answers.
4. **Validation**: Ensure that generated responses align with the source information to maintain accuracy and reliability.

Through this project, we seek to streamline how we interact with complex FDA data, making it more accessible and actionable using the power of natural language processing.




## Weights & Biases (W&B)

**Weights & Biases** (often abbreviated as **W&B**) is a powerful experiment tracking and model management tool commonly used in machine learning and deep learning projects. It helps you monitor and compare different runs or experiments by logging various metrics, parameters, and artifacts.

### Why Use `wandb` (W&B)?

1. **Experiment Tracking**  
   W&B keeps a detailed record of your training metrics (e.g., loss, accuracy) over time. It allows you to visualize these metrics in real time and compare them across different runs.

2. **Hyperparameter Management**  
   You can log hyperparameters such as learning rate, batch size, or model architecture details. This makes it straightforward to see which parameter settings lead to better performance.

3. **Collaboration and Reproducibility**  
   Using W&B, you can easily share your results and experiments with your team. This shared environment helps maintain reproducibility since everything you need to recreate a run (code, hyperparameters, metrics, etc.) is logged.

4. **Artifact Storage**  
   W&B can store model checkpoints, plots, and other generated files. You can later retrieve them for analysis, comparison, or deployment.

### How We’re Using W&B

In this project, we leverage W&B to:
- **Track** the performance of our model(s) over multiple training runs.  
- **Compare** the effects of different hyperparameter settings or data pre-processing strategies.  
- **Log** important artifacts like final model weights, figures, and tables for easy reference.

By using W&B, we aim to streamline the experimentation workflow, making it more transparent and easier to improve and reproduce.


In [None]:
!pip install wandb
import wandb
import random

1. Set Up Colab Environment
Open a new Colab notebook and make sure you enable GPU acceleration. We'll be using the A100, which is a powerful GPU and should be able to handle the rifors of NLP training

In [None]:
# 1. Install & Imports
!pip install wandb
import wandb
import random
import torch
# ... other imports

# 2. Initialize W&B
wandb.init(
    project="nlp_fda",
    config={
        "learning_rate": 0.02,
        "architecture": "CNN",
        "dataset": "CIFAR-100",
        "epochs": 10,
    }
)

# 3. Training / Loop
epochs = 10
offset = random.random() / 5
for epoch in range(2, epochs):
    acc = 1 - 2 ** -epoch - random.random() / epoch - offset
    loss = 2 ** -epoch + random.random() / epoch + offset

    # log metrics to wandb
    wandb.log({"acc": acc, "loss": loss})

# 4. Finish the run
wandb.finish()


In [None]:
# In Colab: Runtime -> Change runtime type -> Hardware accelerator -> GPU

!nvidia-smi  # Optional: Check that GPU is enabled
!pip install transformers datasets  # Hugging Face libraries


## 2. (Optional) Fetching FDA Data
We can fetch some sample data from the FDA drug label endpoint. For demonstration, we’ll just pull one or two records and store relevant text. You’ll want more data for a real-world model.

In [None]:
import requests
import json

base_url = "https://api.fda.gov/drug/label.json"
params = {
    "search": "brand_name:advil",  # Example query
    "limit": 1
}

# The FDA drug label endpoint does NOT require an API key.
# We can simply make a GET request with the desired query parameters.
response = requests.get(base_url, params=params)
data = response.json()

# Extract the relevant text fields from the response as needed.
if "results" in data:
    record = data["results"][0]
    warnings_text = record.get("warnings", [""])[0]
else:
    warnings_text = "No data found."

print(warnings_text)



## 3. Create a Small QA Dataset
We’ll simulate a Question-Answer dataset. Each sample will have:

* `context`: The text from the FDA label (e.g., warnings/adverse reactions).
* `question`: What we want to extract (e.g., “What are the warnings?”).
* `answers`: The labeled correct answer (including its start and end character indices in `context`).

For demonstration, we’ll do minimal manual labeling here. In a real project, I’d systematically label many examples (or build them with a script).


In [None]:
from datasets import Dataset

# Example: We assume the label says something like:
context_text = warnings_text

# We'll pretend the user manually identified the answer to the question below:
question_text = "What are the warnings for this medication?"
answer_text = warnings_text  # For demonstration, we use the entire text as answer

start_idx = context_text.find(answer_text)
if start_idx == -1:
    # If the text doesn't contain the answer due to formatting, just set it to 0
    start_idx = 0
end_idx = start_idx + len(answer_text)

train_data = {
    "context": [context_text],
    "question": [question_text],
    "answers": [{"text": [answer_text], "answer_start": [start_idx]}]
}

# Create a Hugging Face Dataset from this dictionary
train_dataset = Dataset.from_dict(train_data)

# For a real scenario, also create a validation set (val_dataset)
val_dataset = Dataset.from_dict(train_data)  # Just reusing the same for demo


## 4. Load a Pretrained QA Model and Tokenizer
Let’s use the DistilBERT model fine-tuned on SQuAD (a common QA dataset). Then we’ll further fine-tune it on our custom data.

In [None]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

model_name = "distilbert-base-uncased-distilled-squad"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)


## 5. Preprocess the Data (Tokenization)
We need a function that takes each data row (context + question + answers) and tokenizes it into the correct format for QA tasks.

In [None]:
def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,  # typical max sequence length for QA
        truncation=True,
        padding="max_length"
    )

    # To train QA, we also need labels (start/end positions) in tokenized form
    start_positions = []
    end_positions = []

    for i, text in enumerate(examples["context"]):
        answer_starts = examples["answers"][i]["answer_start"]
        answer_texts = examples["answers"][i]["text"]

        # We assume a single answer. If multiple, you'd handle each accordingly.
        start_char = answer_starts[0]
        end_char = start_char + len(answer_texts[0])

        # Convert character positions to token positions
        token_start = inputs.char_to_token(i, start_char)
        token_end = inputs.char_to_token(i, end_char - 1)

        # Handle edge cases where char_to_token could return None
        if token_start is None:
            token_start = 0
        if token_end is None:
            token_end = len(inputs["input_ids"][i]) - 1

        start_positions.append(token_start)
        end_positions.append(token_end)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions

    return inputs


**Now apply this to our datasets:**

In [None]:
train_dataset = train_dataset.map(preprocess_function, batched=True)
val_dataset = val_dataset.map(preprocess_function, batched=True)

# For training, we typically remove columns we don't need
train_dataset.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "start_positions", "end_positions"]
)
val_dataset.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "start_positions", "end_positions"]
)


## 6. Train the Model (Using Hugging Face Trainer)
We’ll use the Trainer API, which handles a lot of the boilerplate for us.

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",      # Use eval_strategy instead of evaluation_strategy
    logging_strategy="steps",   # Log every N steps
    logging_steps=1,            # Choose your logging interval
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=3,
    weight_decay=0.01
)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

# Train!
trainer.train()


In [None]:
!pip install wandb -qU

In [None]:
import random
import math

# Launch 5 simulated experiments
total_runs = 5
for run in range(total_runs):
  # 1️. Start a new run to track this script
  wandb.init(
      # Set the project where this run will be logged
      project="basic-intro",
      # We pass a run name (otherwise it’ll be randomly assigned, like sunshine-lollypop-10)
      name=f"experiment_{run}",
      # Track hyperparameters and run metadata
      config={
      "learning_rate": 0.02,
      "architecture": "CNN",
      "dataset": "CIFAR-100",
      "epochs": 10,
      })

  # This simple block simulates a training loop logging metrics
  epochs = 10
  offset = random.random() / 5
  for epoch in range(2, epochs):
      acc = 1 - 2 ** -epoch - random.random() / epoch - offset
      loss = 2 ** -epoch + random.random() / epoch + offset

      # 2️. Log metrics from your script to W&B
      wandb.log({"acc": acc, "loss": loss})

  # Mark the run as finished
  wandb.finish()

# Debug Process- To Be Continued

In [None]:
!pip install --upgrade datasets


In [None]:
print(type(qa_dataset))
print(qa_dataset)
print(qa_dataset.column_names)
print(qa_dataset.num_rows)


In [None]:
qa_dataset = qa_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=["context", "question", "answers"]
)


In [None]:
print(type(qa_dataset))


In [None]:
qa_dataset = Dataset.from_dict(data_dict)
qa_dataset = qa_dataset.train_test_split(test_size=0.2)


In [None]:
base_url = "https://api.fda.gov/drug/label.json"
params = {"search": "brand_name:Keytruda", "limit": 3}
response = requests.get(base_url, params=params)
data = response.json()
results = data.get("results", [])
clinical_texts = []

for record in results:
    section_14 = record.get("clinical_studies", [""])[0]
    clinical_texts.append(section_14 if section_14 else "No data found")


In [None]:
print(len(results))
if not results:
    print("No data found from FDA API!")
else:
    print("Sample record keys:", results[0].keys())


In [None]:
print("Number of clinical_texts:", len(clinical_texts))
if len(clinical_texts) > 0:
    print("First item in clinical_texts:", clinical_texts[0][:300])


In [None]:
contexts, questions, answers = [], [], []

for text in clinical_texts:
    contexts.append(text)
    questions.append("What key clinical trials ...?")
    answers.append({
        "text": [text],
        "answer_start": [0]
    })


In [None]:
print("contexts:", contexts)
print("questions:", questions)
print("answers:", answers)


In [None]:
if len(contexts) == 0:
    print("No data to create a dataset!")
else:
    data_dict = {
        "context": contexts,
        "question": questions,
        "answers": answers
    }
    qa_dataset = Dataset.from_dict(data_dict)
    print(qa_dataset)


In [None]:
data_dict = {
    "context": [
        "Section 14: Some clinical data about Keytruda in KEYNOTE-006..."
    ],
    "question": [
        "What key clinical trials are described in Section 14?"
    ],
    "answers": [
        {
            "text": ["Some clinical data about Keytruda in KEYNOTE-006..."],
            "answer_start": [13]
        }
    ]
}
qa_dataset = Dataset.from_dict(data_dict)
print("Columns:", qa_dataset.column_names)
print("Num rows:", qa_dataset.num_rows)
print(qa_dataset[0])


In [None]:
params = {
  "search": "brand_name:KEYTRUDA",
  "limit": 3
}


In [None]:
print("API Raw JSON:", data)


In [None]:
params = {
    "search": "brand_name:KEYTRUDA",
    "limit": 3
}
url = "https://api.fda.gov/drug/label.json"
response = requests.get(url, params=params)
print(response.json())


In [None]:
data_dict = {
    "context": ["Section 14: The efficacy of KEYTRUDA in KEYNOTE-006..."],
    "question": ["What key clinical trials are described in Section 14?"],
    "answers": [
        {
            "text": ["The efficacy of KEYTRUDA in KEYNOTE-006..."],
            "answer_start": [13]
        }
    ]
}
qa_dataset = Dataset.from_dict(data_dict)
print(qa_dataset)


In [None]:
# Option A: openfda.brand_name
params = {
    "search": "openfda.brand_name:KEYTRUDA",
    "limit": 5
}

# Option B: generic_name or substance_name
params = {
    "search": "openfda.generic_name:pembrolizumab",
    "limit": 5
}

# Option C: broad search using brand name or partial
params = {
    "search": "KEYTRUDA",  # or brand_name:KEY* etc.
    "limit": 5
}


In [None]:
response = requests.get(base_url, params=params)
data = response.json()

print(data)  # Inspect full output
if "results" in data:
    print("Found", len(data["results"]), "results.")
else:
    print("No results:", data)


In [None]:
if "results" in data:
    print("Sample keys in first record:", data["results"][0].keys())
    # Check if there's a 'clinical_studies' or 'description', etc.


In [None]:
data_dict = {
    "context": [
        "Section 14: The efficacy of KEYTRUDA (pembrolizumab) was investigated in KEYNOTE-006..."
    ],
    "question": [
        "What key clinical trials are described in Section 14?"
    ],
    "answers": [
        {
            "text": ["The efficacy of KEYTRUDA (pembrolizumab) was investigated in KEYNOTE-006..."],
            "answer_start": [13]
        }
    ]
}
qa_dataset = Dataset.from_dict(data_dict)
print(qa_dataset)
# Then map -> set_format -> train as you intended.
