<a href="https://colab.research.google.com/github/christophergaughan/Bioinformatics-Code/blob/main/Copy_of_VascuLogic_Model_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Suggestions for Improvement

### 1. Enhanced Training Dataset
* To train a neural network effectively, I need a high-quality, labeled dataset. I'll try to combine data from the following:
Open-Source Biomedical Databases:
    * PubMed Central (PMC): Open-access full-text articles from PubMed.
    * CORD-19: COVID-19 research dataset with abstracts and full-text papers.
    * BioASQ: Biomedical QA datasets, including PubMed citations.
    * MIMIC-III: Clinical database with electronic health records (if applicable).
### Custom Labeling:
* Use your current workflow to label query-aspect pairs manually. For example:
    * Positive examples: Relevant abstracts with high similarity scores.
    * Negative examples: Irrelevant abstracts.

### 2. Target Task for Neural Network
* Focus on a specific task such as:
    * Abstract Classification: Classify PubMed abstracts into predefined categories like "Therapeutic Targets" or "Clinical Trials."
    * Query-Result Relevance Ranking: Predict relevance scores between a query and abstract text.
    * Abstract Summarization: Improve over existing summarization with fine-tuned models.

### 3. Neural Network Architecture
* Transformer-Based Models: Utilize pretrained models like BioBERT or PubMedBERT, fine-tuned for:
    * Classification (e.g., abstract categories).
    * Sequence similarity (e.g., query-abstract relevance).
    * Custom Layers: Add dense layers for specific tasks like ranking or classification.
    * Summarization Model Enhancement: Fine-tune models like T5 or Pegasus for domain-specific summarization.

### 4. Integration with Current Workflow
* Modify workflow to:
    * Extract relevant training data from open sources.
    * Preprocess the data (e.g., tokenization, query-abstract pairs).
    * Train and evaluate the neural network.
    * Replace or supplement the existing query expansion, summarization, or ranking modules with the trained model.

# Steps for Version 2 Notebook

1. ### Data Collection:

    * Write code to fetch, clean, and preprocess data from open-source databases.
    * Augment with your existing queries and PubMed results for a richer dataset.

2. ### Preprocessing:

    * Implement tokenization, synonym expansion, and labeling pipelines.
    * Use tools like `nltk`, `spacy`, or transformers for efficient preprocessing.

3. ### Neural Network Design:

    * Fine-tune BioBERT or similar models on your dataset.
    * Evaluate with metrics like accuracy (for classification) or NDCG (for ranking).

4. ### Validation and Deployment:

    * Validate the trained model on a holdout dataset.
    * Replace or supplement the components in the existing workflow with the neural network outputs.

5. ### Feedback Loop:

    * Use the neural network predictions to update and refine the dataset iteratively.

## Implementation Strategy
### I propose we:

1. Design the data collection and preprocessing pipeline for training data.
2. Set up a BioBERT-based neural network for a chosen task.
3. Integrate evaluation metrics and a feedback mechanism.
Optimize for scalability (e.g., batch processing, GPU utilization)

# PyTorch-Specific Approach for Version 2
## 1. Data Collection & Preprocessing
1. Data Sources:
    * Pull data from PubMed, PMC, CORD-19, and other open datasets.
    * Format as pairs of queries and abstracts, with labels for training.
## Preprocessing with PyTorch:
    * Tokenize data using transformers (Hugging Face library) with models like BioBERT or PubMedBERT.
    * Use PyTorch DataLoader for batching and efficient data pipeline setup.
## 2. Neural Network Setup
### Model Architecture:
    * Base Model: Fine-tune pretrained transformer models (e.g., BioBERT) for tasks such as:
    * Query-to-abstract relevance classification.
    * Abstract summarization or classification into predefined categories.
    * Custom Layers: Add dense layers for specific tasks with activation functions (e.g., ReLU, Softmax).
### Loss Functions:
    * Binary Cross-Entropy Loss for relevance classification.
    * Cross-Entropy Loss for multi-class categorization.
    * MSE or custom loss for ranking tasks.
## 3. Training and Validation
    * Use PyTorch training loops or torch.nn.Module with Trainer abstraction for efficient training.
### GPU Acceleration:
    * Utilize CUDA for efficient computation (model.to(device) and .cuda()).
### Metrics:
    * Accuracy, F1 Score for classification.
### NDCG or MAP for ranking.
**Techniques:**
    * Early stopping to prevent overfitting.
    * Learning rate scheduling with torch.optim.lr_scheduler.
## 4. Integration and Workflow Update
* Replace parts of the existing workflow with PyTorch models:
    * Query expansion with a neural network.
    * Abstract classification or ranking with a trained model.
    * Summarization using fine-tuned T5 or Pegasus on PyTorch.
    * <u>Use PyTorch outputs to populate the Excel file grid.</u>

**load the excel sheet into pandas dataframe**

In [None]:
import pandas as pd

# Define the file path
file_path = '/content/drive/MyDrive/vasculogic/Updated_indications_and_assets.xlsx'

# Load the Excel file into a pandas DataFrame
vasc_df = pd.read_excel(file_path)

# Display the first few rows of the dataframe to ensure it is loaded correctly
vasc_df.head()


#### Start with our rules for how to handle the excel sheet

In [None]:
import re
import pandas as pd
from torch.utils.data import Dataset, DataLoader

def group_queries(data):
    """
    Groups subqueries under their respective major topics.
    Rows starting with a number (e.g., '1.', '2.') are considered major topics.
    Subqueries are grouped under the last identified major topic.
    """
    grouped_data = []
    current_topic = None

    for item in data:
        if re.match(r'^\d+\.', str(item)):  # Check if the item starts with a number followed by '.'
            current_topic = item
        elif current_topic:
            grouped_data.append((item, current_topic))  # Append subquery with its topic

    return grouped_data

# Example Usage
# Assuming `df` is a pandas DataFrame with the zeroth column containing the queries
top_indication_col_0 = vasc_df.iloc[:, 0].tolist()  # Extract the zeroth column as a list
grouped_queries = group_queries(top_indication_col_0)

# Convert to a PyTorch-compatible Dataset
class QueryDataset(Dataset):
    def __init__(self, grouped_queries):
        self.data = grouped_queries

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        query, topic = self.data[idx]
        return {"query": query, "topic": topic}

# Create a DataLoader
query_dataset = QueryDataset(grouped_queries)
query_loader = DataLoader(query_dataset, batch_size=16, shuffle=True)

# Example iteration
for batch in query_loader:
    print(batch)


## Immediate Next Steps
### Set Up Data Collection Pipelines:

* Start with scripts to pull data from open-source biomedical databases like PubMed, PMC, and CORD-19.
* Focus on creating a clean and structured dataset in a format suitable for PyTorch training (e.g., CSV or JSON with fields for queries, abstracts, and labels).
### Preprocess the Collected Data:

* Tokenize the text using Hugging Face's transformers library and ensure compatibility with models like BioBERT or PubMedBERT.
* Split data into training, validation, and testing sets.

### Integrate Preprocessing into Workflow:

* Create PyTorch-compatible Dataset and DataLoader classes for batching and efficient loading.

### Define the Neural Network Task:

* Decide on a specific task for initial implementation (e.g., query-to-abstract relevance classification).
* Prepare the data labels accordingly.

**Step 1: Data Collection**

Write a function to pull data from open-source databases:

In [None]:
import requests

def fetch_pubmed_abstracts(query, max_results=100):
    """
    Fetches abstracts from PubMed based on a search query.
    """
    base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
    params = {
        "db": "pubmed",
        "term": query,
        "retmax": max_results,
        "retmode": "json",
    }
    response = requests.get(base_url, params=params)
    data = response.json()
    return data.get("esearchresult", {}).get("idlist", [])

# Example usage
abstract_ids = fetch_pubmed_abstracts("Cancer therapy", max_results=10)
print(abstract_ids)


**Step 2: Data Preprocessing**

Use Hugging Face transformers for tokenization:

In [None]:
from transformers import AutoTokenizer

# Load BioBERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-v1.1")

def preprocess_text(text):
    """
    Tokenizes text for input to BioBERT.
    """
    tokens = tokenizer(
        text,
        max_length=512,
        truncation=True,
        padding="max_length",
        return_tensors="pt",
    )
    return tokens


**Step 3: PyTorch Dataset and DataLoader**

Structure the data for training:

In [None]:
import torch  # Add this import
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer

# Load BioBERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-v1.1")

def preprocess_text(text):
    """
    Tokenizes text for input to BioBERT.
    """
    tokens = tokenizer(
        text,
        max_length=512,
        truncation=True,
        padding="max_length",
        return_tensors="pt",
    )
    return tokens

class PubMedDataset(Dataset):
    def __init__(self, queries, abstracts, labels):
        self.queries = queries
        self.abstracts = abstracts
        self.labels = labels

    def __len__(self):
        return len(self.queries)

    def __getitem__(self, idx):
        return {
            "query": preprocess_text(self.queries[idx]),
            "abstract": preprocess_text(self.abstracts[idx]),
            "label": torch.tensor(self.labels[idx], dtype=torch.float),
        }

# Example usage
queries = ["Cancer treatment", "Diabetes therapy"]
abstracts = ["Abstract text 1", "Abstract text 2"]
labels = [1, 0]

dataset = PubMedDataset(queries, abstracts, labels)
data_loader = DataLoader(dataset, batch_size=2, shuffle=True)

for batch in data_loader:
    print(batch)


### Next Steps
#### 1. Expand Data Collection
* Automate fetching data from open-source biomedical datasets like PubMed or PMC.
* Write functions to download and parse abstracts based on queries and save them in a structured format (e.g., CSV or JSON).

#### 2. Prepare a Unified Dataset
* Organize the collected data into three key fields:
* Query: The search term or question.
* Abstract: The fetched text from the data source.
* Label: Relevance or category for supervised learning.

#### 3. Integrate Preprocessing
* Preprocess the text data (queries and abstracts) using tokenization and other techniques to prepare it for PyTorch.

Dataset Preparation

Combine queries, abstracts, and labels into a structured format:

In [None]:
import requests
import time

def fetch_pubmed_abstracts(query, max_results=10, api_delay=1.0):
    """
    Fetch abstracts from PubMed based on the query.
    """
    base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
    params = {
        "db": "pubmed",
        "term": query,
        "retmax": max_results,
        "retmode": "json",
    }
    response = requests.get(base_url, params=params)
    id_list = response.json().get("esearchresult", {}).get("idlist", [])

    if not id_list:
        return []

    # Fetch abstracts for the IDs
    abstracts = []
    fetch_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
    for pubmed_id in id_list:
        fetch_params = {
            "db": "pubmed",
            "id": pubmed_id,
            "retmode": "xml",
        }
        fetch_response = requests.get(fetch_url, params=fetch_params)
        # Parse the abstract
        if "<AbstractText>" in fetch_response.text:
            start = fetch_response.text.find("<AbstractText>") + len("<AbstractText>")
            end = fetch_response.text.find("</AbstractText>")
            abstract = fetch_response.text[start:end]
            abstracts.append(abstract)
        time.sleep(api_delay)  # Respect PubMed API usage limits

    return abstracts

# Example usage
query = "Cancer therapy"
abstracts = fetch_pubmed_abstracts(query, max_results=5)
print(abstracts)


Dataset Preparation

Combine queries, abstracts, and labels into a structured format:

In [None]:
import pandas as pd

# Example data
queries = ["Cancer therapy", "Diabetes treatment"]
abstracts = [
    "Abstract about cancer therapy.",
    "Abstract about diabetes treatment.",
]
labels = [1, 0]  # Example labels: 1 = relevant, 0 = not relevant

# Create a DataFrame
data = {"query": queries, "abstract": abstracts, "label": labels}
dataset_df = pd.DataFrame(data)

# Save to CSV for reuse
dataset_df.to_csv("training_dataset.csv", index=False)
print(dataset_df.head())


 Integration with Preprocessing

Use the `PubMedDataset` class and `DataLoader` to prepare batches:

In [None]:
from torch.utils.data import DataLoader

# Load dataset
dataset = PubMedDataset(dataset_df["query"].tolist(), dataset_df["abstract"].tolist(), dataset_df["label"].tolist())

# Create DataLoader
data_loader = DataLoader(dataset, batch_size=2, shuffle=True)

for batch in data_loader:
    print(batch)


## Automating Data Collection

write a script to fetch abstracts from PubMed or other open databases using APIs. For simplicity, we focus on PubMed using their E-utilities API.

In [None]:
import requests
import pandas as pd
import time

def fetch_pubmed_abstracts(query, max_results=10, api_delay=1.0):
    """
    Fetch abstracts from PubMed based on a search query.
    """
    # Step 1: Search for article IDs
    base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
    search_params = {
        "db": "pubmed",
        "term": query,
        "retmax": max_results,
        "retmode": "json",
    }
    search_response = requests.get(base_url, params=search_params)
    id_list = search_response.json().get("esearchresult", {}).get("idlist", [])

    # Step 2: Fetch abstracts for the IDs
    abstracts = []
    fetch_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
    for pubmed_id in id_list:
        fetch_params = {
            "db": "pubmed",
            "id": pubmed_id,
            "retmode": "xml",
        }
        fetch_response = requests.get(fetch_url, params=fetch_params)
        if "<AbstractText>" in fetch_response.text:
            start = fetch_response.text.find("<AbstractText>") + len("<AbstractText>")
            end = fetch_response.text.find("</AbstractText>")
            abstract = fetch_response.text[start:end]
            abstracts.append(abstract)
        time.sleep(api_delay)  # Avoid overloading the API

    return abstracts

# Example queries
queries = ["Cancer therapy", "Diabetes treatment"]

# Collect abstracts for all queries
all_data = []
for query in queries:
    abstracts = fetch_pubmed_abstracts(query, max_results=5)
    for abstract in abstracts:
        all_data.append({"query": query, "abstract": abstract, "label": 1})  # Label 1 for relevant

# Convert to DataFrame
dataset_df = pd.DataFrame(all_data)
dataset_df.to_csv("expanded_training_dataset.csv", index=False)
print(dataset_df.head())


Preprocessing the Data

Once the data is collected, <u>*preprocess*</u> to ensure it is tokenized and ready for training.

In [None]:
from transformers import AutoTokenizer

# Load tokenizer (e.g., BioBERT tokenizer)
tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-v1.1")

def preprocess_text_column(df, column_name):
    """
    Tokenize a column of text in a DataFrame.
    """
    tokenized = df[column_name].apply(
        lambda x: tokenizer(
            x, max_length=512, truncation=True, padding="max_length", return_tensors="pt"
        )
    )
    return tokenized

# Preprocess 'query' and 'abstract' columns
dataset_df["query_tokens"] = preprocess_text_column(dataset_df, "query")
dataset_df["abstract_tokens"] = preprocess_text_column(dataset_df, "abstract")
print(dataset_df.head())


Integration with PyTorch DataLoader

Use the `PubMedDataset` class developed earlier to load and batch the dataset.



In [None]:
from torch.utils.data import DataLoader

# Define Dataset and DataLoader
dataset = PubMedDataset(
    queries=dataset_df["query"].tolist(),
    abstracts=dataset_df["abstract"].tolist(),
    labels=dataset_df["label"].tolist()
)

data_loader = DataLoader(dataset, batch_size=4, shuffle=True)

# Example iteration
for batch in data_loader:
    print(batch)


In [None]:
from transformers import AutoTokenizer

# Load BioBERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-v1.1")

# Example queries and abstracts
queries = ["Cancer therapy", "Diabetes treatment"]
abstracts = [
    "This abstract discusses innovative cancer therapy approaches.",
    "Recent advances in diabetes treatment are highlighted in this abstract."
]

# Tokenize queries
query_tokens = tokenizer(
    queries,
    max_length=512,
    truncation=True,
    padding="max_length",
    return_tensors="pt"
)

# Tokenize abstracts
abstract_tokens = tokenizer(
    abstracts,
    max_length=512,
    truncation=True,
    padding="max_length",
    return_tensors="pt"
)

# Print tensor shapes
print("Query input_ids shape:", query_tokens["input_ids"].shape)
print("Query attention_mask shape:", query_tokens["attention_mask"].shape)
print("Abstract input_ids shape:", abstract_tokens["input_ids"].shape)
print("Abstract attention_mask shape:", abstract_tokens["attention_mask"].shape)


## Neural Network Architecture
### Base Model:

* Use BioBERT as the base model for feature extraction.
* Extract the [CLS] token representation from the final hidden state.

### Custom Layers:

* Add a fully connected (dense) layer to project the [CLS] token embedding to the desired output dimension.
* Use a dropout layer to mitigate overfitting.

### Output:

* Apply a sigmoid activation function for binary classification.


In [None]:
import torch
from torch import nn
from transformers import AutoModel

class BioBERTClassifier(nn.Module):
    def __init__(self, model_name="dmis-lab/biobert-v1.1", dropout_rate=0.3):
        super(BioBERTClassifier, self).__init__()
        # Load BioBERT model
        self.bert = AutoModel.from_pretrained(model_name)
        self.dropout = nn.Dropout(dropout_rate)
        self.classifier = nn.Linear(self.bert.config.hidden_size, 1)  # Binary classification

    def forward(self, input_ids, attention_mask):
        # Forward pass through BioBERT
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        cls_output = outputs.last_hidden_state[:, 0, :]  # Extract [CLS] token
        # Apply dropout and classification layer
        cls_output = self.dropout(cls_output)
        logits = self.classifier(cls_output)
        return torch.sigmoid(logits).squeeze(-1)  # Sigmoid activation for binary output


# Debugging- getting problems with:

The error RuntimeError: CUDA error: device-side assert triggered is usually related to data issues when using CUDA. In your case, it is likely triggered within the BioBERT model during the forward pass due to:

1. Invalid Input: The input data (dummy_input_ids or dummy_attention_mask) might contain values outside the expected range, like negative indices, leading to an assertion failure in the CUDA kernel.
2. Shape Mismatch: There might be a mismatch between the shapes of the dummy input tensors and what the model expects, triggering an assertion.
3. Data Type: The input tensors might not be in the expected data type (e.g., torch.int64 for input_ids).
4. Out of Memory: If your GPU is running out of memory, it can also trigger CUDA errors. Make sure your input sizes are manageable and no unnecessary tensors are lingering in the device memory.
5. CUDA context issue: If a previous CUDA operation failed silently, it can sometimes surface as a seemingly unrelated error in a later call.


In [None]:
device = torch.device("cpu")  # Force use of CPU
model = BioBERTClassifier().to(device)


In [None]:
# Initialize the model without moving to the device
model = BioBERTClassifier()
print("Model initialized successfully.")


In [None]:
if torch.cuda.is_available():
    print("CUDA is available.")
    print(f"Current device: {torch.cuda.current_device()}")
    print(f"Device name: {torch.cuda.get_device_name(0)}")
else:
    print("CUDA is not available.")


In [None]:
for name, param in model.named_parameters():
    print(f"Parameter {name}: {param.shape}")


In [None]:
vocab_size = tokenizer.vocab_size  # Get vocabulary size
print("Vocabulary Size:", vocab_size)
print("Max Token ID in input_ids:", torch.max(tokenized_inputs["input_ids"]))
print("Min Token ID in input_ids:", torch.min(tokenized_inputs["input_ids"]))



In [None]:
device = torch.device("cpu")  # Use CPU for debugging
dummy_input_ids = tokenized_inputs["input_ids"].to(device)
dummy_attention_mask = tokenized_inputs["attention_mask"].to(device)
model = BioBERTClassifier().to(device)
outputs = model(dummy_input_ids, dummy_attention_mask)
print("Outputs:", outputs)


In [None]:
print("Data type of input_ids:", tokenized_inputs["input_ids"].dtype)
print("Data type of attention_mask:", tokenized_inputs["attention_mask"].dtype)


In [None]:
%env CUDA_LAUNCH_BLOCKING=1


In [None]:
print("Data type of input_ids:", tokenized_inputs["input_ids"].dtype)
print("Shape of input_ids:", tokenized_inputs["input_ids"].shape)
print("Input IDs:", tokenized_inputs["input_ids"])


In [None]:
model = BioBERTClassifier()
print("Model parameters:", list(model.named_parameters()))


In [None]:
dummy_input_ids = tokenized_inputs["input_ids"].to(device)
dummy_attention_mask = tokenized_inputs["attention_mask"].to(device).float()


In [None]:
print("Allocated memory:", torch.cuda.memory_allocated())
print("Reserved memory:", torch.cuda.memory_reserved())


In [None]:
torch.cuda.empty_cache()


In [None]:
device = torch.device("cpu")
dummy_input_ids = tokenized_inputs["input_ids"].to(device)
dummy_attention_mask = tokenized_inputs["attention_mask"].to(device).float()

# Move model to CPU
model = BioBERTClassifier().to(device)

# Forward pass
outputs = model(dummy_input_ids, dummy_attention_mask)
print("Outputs:", outputs)


In [None]:
torch.cuda.empty_cache()


In [None]:
device = torch.device("cuda:0")  # Or "cuda:1" if you have a second GPU


In [None]:
tokenized_inputs = tokenizer(
    dummy_texts,
    max_length=128,  # Reduced from 512
    padding="max_length",
    truncation=True,
    return_tensors="pt"
)


In [None]:
print(f"Allocated memory: {torch.cuda.memory_allocated()} bytes")
print(f"Reserved memory: {torch.cuda.memory_reserved()} bytes")


In [None]:
device = torch.device("cpu")


In [None]:
device = torch.device("cpu")  # Force use of CPU

# Move tensors to CPU
dummy_input_ids = tokenized_inputs["input_ids"].to(device)
dummy_attention_mask = tokenized_inputs["attention_mask"].to(device).float()

# Move model to CPU
model = BioBERTClassifier().to(device)

# Forward pass
outputs = model(dummy_input_ids, dummy_attention_mask)
print("Outputs:", outputs)


In [None]:
tokenized_inputs = tokenizer(
    dummy_texts,
    max_length=64,  # Further reduce max_length
    padding="max_length",
    truncation=True,
    return_tensors="pt"
)


In [None]:
# Use a single input example for debugging
dummy_texts = ["This is a test sentence."]
tokenized_inputs = tokenizer(
    dummy_texts,
    max_length=64,  # Use smaller max_length
    padding="max_length",
    truncation=True,
    return_tensors="pt"
)


In [None]:
torch.cuda.empty_cache()
torch.cuda.init()


In [None]:
if torch.cuda.is_available():
    print(f"CUDA is available. Device: {torch.cuda.get_device_name(0)}")
else:
    print("CUDA is not available.")


In [None]:
device = torch.device("cpu")  # Force use of CPU

# Move tensors to CPU
dummy_input_ids = tokenized_inputs["input_ids"].to(device)
dummy_attention_mask = tokenized_inputs["attention_mask"].to(device).float()

# Move model to CPU
model = BioBERTClassifier().to(device)

# Forward pass
outputs = model(dummy_input_ids, dummy_attention_mask)
print("Outputs:", outputs)


In [None]:
tokenized_inputs = tokenizer(
    dummy_texts,
    max_length=64,  # Further reduce max_length
    padding="max_length",
    truncation=True,
    return_tensors="pt"
)


In [None]:
# Use a single input example for debugging
dummy_texts = ["This is a test sentence."]
tokenized_inputs = tokenizer(
    dummy_texts,
    max_length=64,  # Use smaller max_length
    padding="max_length",
    truncation=True,
    return_tensors="pt"
)


In [None]:
torch.cuda.empty_cache()
torch.cuda.init()


In [None]:
if torch.cuda.is_available():
    print(f"CUDA is available. Device: {torch.cuda.get_device_name(0)}")
else:
    print("CUDA is not available.")


In [None]:
dummy_texts = ["Hello world."]
tokenized_inputs = tokenizer(
    dummy_texts,
    max_length=32,  # Minimal size
    padding="max_length",
    truncation=True,
    return_tensors="pt"
)

# Move tensors and model to GPU
dummy_input_ids = tokenized_inputs["input_ids"].to(device)
dummy_attention_mask = tokenized_inputs["attention_mask"].to(device).float()
model = BioBERTClassifier().to(device)

outputs = model(dummy_input_ids, dummy_attention_mask)
print("Outputs:", outputs)


In [None]:
class SimpleClassifier(nn.Module):
    def __init__(self, input_dim=32):  # Match the sequence length
        super(SimpleClassifier, self).__init__()
        self.linear = nn.Linear(input_dim, 1)

    def forward(self, input_ids, attention_mask):
        # Summing over the sequence dimension as a simple aggregation
        aggregated_input = input_ids.sum(dim=1)  # Shape: [batch_size, input_dim]
        return torch.sigmoid(self.linear(aggregated_input))

# Dummy tensors for simplified testing
dummy_input_ids = torch.rand((1, 32)).to(device)  # Simulating embeddings with shape [batch_size, sequence_length]
dummy_attention_mask = torch.ones((1, 32)).to(device).float()  # Shape [batch_size, sequence_length]

# Test with SimpleClassifier
model = SimpleClassifier(input_dim=32).to(device)
outputs = model(dummy_input_ids, dummy_attention_mask)
print("Simple model outputs:", outputs)


In [None]:
class SimpleClassifier(nn.Module):
    def __init__(self, input_dim=32):  # Match sequence length
        super(SimpleClassifier, self).__init__()
        self.linear = nn.Linear(input_dim, 1)  # Input size matches sequence length

    def forward(self, input_ids, attention_mask):
        # Pass the input directly to the linear layer
        return torch.sigmoid(self.linear(input_ids))  # Shape: [batch_size, 1]

# Dummy tensors for simplified testing
dummy_input_ids = torch.rand((1, 32)).to(device)  # Simulated embeddings with shape [batch_size, sequence_length]
dummy_attention_mask = torch.ones((1, 32)).to(device).float()  # Shape [batch_size, sequence_length]

# Test with SimpleClassifier
model = SimpleClassifier(input_dim=32).to(device)
outputs = model(dummy_input_ids, dummy_attention_mask)
print("Simple model outputs:", outputs)


## Try another class

In [None]:
class SimpleClassifier(nn.Module):
    def __init__(self, input_dim=32):  # Match the sequence length
        super(SimpleClassifier, self).__init__()
        self.linear = nn.Linear(input_dim, 1)

    def forward(self, input_ids, attention_mask):
        return torch.sigmoid(self.linear(input_ids))

dummy_input_ids = torch.rand((1, 32)).to(device)  # Simulated embeddings
dummy_attention_mask = torch.ones((1, 32)).to(device).float()
model = SimpleClassifier(input_dim=32).to(device)
outputs = model(dummy_input_ids, dummy_attention_mask)
print("Simple model outputs:", outputs)


In [None]:
# Transpose the input
transposed_input = dummy_input_ids.transpose(1, 0)  # Switch dimensions
print("Transposed input shape:", transposed_input.shape)


In [None]:
dummy_input_ids = torch.rand((1, 32)).to(device)  # Shape [batch_size, sequence_length]


In [None]:
tokenized_inputs = tokenizer(
    ["This is a test sentence."],
    max_length=32,
    padding="max_length",
    truncation=True,
    return_tensors="pt"
)
dummy_input_ids = tokenized_inputs["input_ids"].to(device)  # Shape [batch_size, sequence_length]


In [None]:
print("dummy_input_ids:", dummy_input_ids)
print("Shape:", dummy_input_ids.shape)


In [None]:
# Dummy input
dummy_input_ids = torch.rand((1, 32)).to(device)  # Shape [batch_size, sequence_length]

# Transpose the input
transposed_input = dummy_input_ids.transpose(1, 0)  # Switch dimensions
print("Transposed input shape:", transposed_input.shape)

# Example usage with a simple classifier
class SimpleClassifier(nn.Module):
    def __init__(self, input_dim=32):
        super(SimpleClassifier, self).__init__()
        self.linear = nn.Linear(input_dim, 1)

    def forward(self, input_ids):
        return torch.sigmoid(self.linear(input_ids))

# Create model
model = SimpleClassifier(input_dim=32).to(device)
outputs = model(transposed_input)  # Ensure shapes align with the model
print("Model outputs:", outputs)


# Debugging concluded- we had a mismatch in matrix multipplication

In [None]:
# Dummy input with correct shape
dummy_input_ids = torch.rand((1, 32)).to(device)  # Shape [batch_size, input_dim]

# Example usage with a simple classifier
class SimpleClassifier(nn.Module):
    def __init__(self, input_dim=32):
        super(SimpleClassifier, self).__init__()
        self.linear = nn.Linear(input_dim, 1)

    def forward(self, input_ids):
        return torch.sigmoid(self.linear(input_ids))

# Create model
model = SimpleClassifier(input_dim=32).to(device)
outputs = model(dummy_input_ids)  # Directly use dummy_input_ids
print("Model outputs:", outputs)


In [None]:
# Dummy input with transpose
dummy_input_ids = torch.rand((1, 32)).to(device)  # Shape [batch_size, sequence_length]
transposed_input = dummy_input_ids.transpose(0, 1)  # Shape [sequence_length, batch_size]

# Example usage with a simple classifier
class SimpleClassifier(nn.Module):
    def __init__(self, input_dim=32):
        super(SimpleClassifier, self).__init__()
        self.linear = nn.Linear(1, input_dim)  # Adapt input dimension for transposed input

    def forward(self, input_ids):
        return torch.sigmoid(self.linear(input_ids))

# Create model
model = SimpleClassifier(input_dim=32).to(device)
outputs = model(transposed_input)  # Ensure shapes align with the model
print("Model outputs:", outputs)


## Final Adjustments
We Want Binary Classification:
Ensure input and model are configured as follows:

In [None]:
# Dummy input with shape [batch_size, input_dim]
dummy_input_ids = torch.rand((1, 32)).to(device)  # Shape [1, 32]

# SimpleClassifier for binary classification
class SimpleClassifier(nn.Module):
    def __init__(self, input_dim=32):
        super(SimpleClassifier, self).__init__()
        self.linear = nn.Linear(input_dim, 1)

    def forward(self, input_ids):
        return torch.sigmoid(self.linear(input_ids))

# Create model
model = SimpleClassifier(input_dim=32).to(device)
outputs = model(dummy_input_ids)
print("Model outputs:", outputs)


In [None]:
# Dummy input with shape [batch_size, sequence_length, embedding_dim]
dummy_input_ids = torch.rand((1, 32, 64)).to(device)  # Shape [1, 32, 64]

# Sequence-level SimpleClassifier
class SequenceClassifier(nn.Module):
    def __init__(self, input_dim=64):
        super(SequenceClassifier, self).__init__()
        self.linear = nn.Linear(input_dim, 1)

    def forward(self, input_ids):
        # Apply linear layer to each token
        return torch.sigmoid(self.linear(input_ids))  # Shape: [batch_size, sequence_length, 1]

# Create model
model = SequenceClassifier(input_dim=64).to(device)
outputs = model(dummy_input_ids)
print("Model outputs:", outputs)


A. For Binary Classification
My task involves classifying an entire input (e.g., determining if a query and abstract are relevant), proceed with the binary classifier:

Use the single output ([`0.4945`]) as the predicted score.
Apply a threshold (e.g., `0.5`) to convert probabilities into class labels (`0` or `1`).
B. For Sequence-Level Predictions
Task involves token-level classification (e.g., tagging words in a sequence), continue with the sequence classifier:

Interpret the 3D tensor output `[batch_size, sequence_length, 1]`.
Flatten the output if necessary, e.g., `outputs.squeeze(-1)` for `shape [batch_size, sequence_length]`.

In [None]:
# Apply a threshold to classify
predicted_class = (outputs > 0.5).long()  # Convert probability to class label
print("Predicted class:", predicted_class)


In [None]:
# Squeeze to remove the last dimension
sequence_scores = outputs.squeeze(-1)  # Shape: [batch_size, sequence_length]

# Apply thresholding for token-level classification
token_predictions = (sequence_scores > 0.5).long()  # Binary class labels
print("Token predictions:", token_predictions)


## A. Binary Classification
    * If the task is document- or sentence-level classification, focus on the overall predicted class (e.g., `outputs > 0.5`).

    * For training or evaluation:
        * Use Binary Cross-Entropy Loss for optimizing binary classification.
        * Evaluate using metrics such as accuracy, precision, recall, and F1 score.
## B. Token-Level Classification
If the task is sequence tagging, use the Token predictions output to:
Evaluate individual tokens for their predicted classes.
Align predictions with the input tokens for interpretability.


Example for Metrics Evaluation

Binary Classification:

In [None]:
# Ground truth (example)
true_class = torch.tensor([1])

# Accuracy
accuracy = (predicted_class.squeeze(-1) == true_class).float().mean().item()
print(f"Accuracy: {accuracy:.2f}")


In [None]:
# Ground truth (example for sequence-level labels)
true_tokens = torch.tensor([[1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
                             1, 1, 1, 0, 1, 1, 0, 1]])

# Token-level accuracy
token_accuracy = (token_predictions == true_tokens).float().mean().item()
print(f"Token Accuracy: {token_accuracy:.2f}")


## 1. Analyze Data Distribution
* Check the class distribution for both token-level and binary classification tasks.
* Address any imbalances by oversampling the minority class or undersampling the majority class.

## 2. Add More Data
* Expand your training dataset with real examples from PubMed, BioASQ, or other biomedical datasets to improve generalizability.

## 3. Fine-Tune Hyperparameters
### For Binary Classification:

* Adjust the learning rate, batch size, and dropout rate.
* Experiment with different optimizers like AdamW or learning rate schedules.
For Token-Level Classification:

* Experiment with the sequence length (max_length) to capture more contextual information.

## 4. Incorporate Advanced Metrics
* Evaluate with precision, recall, and F1 score for a deeper understanding of model performance:

In [None]:
from sklearn.metrics import classification_report

# Example for binary classification
y_true = [1, 0, 1, 1, 0]  # Replace with ground truth labels
y_pred = [1, 0, 1, 0, 0]  # Replace with model predictions
print(classification_report(y_true, y_pred, target_names=["Class 0", "Class 1"]))


## Code for Fine-Tuning
Data Preparation and Tokenization
Assume `data.csv` contains two columns: `text` (queries/abstracts) and `label` (0 or 1).

In [None]:
print(vasc_df.head())


In [None]:
# Define the actual column names
text_column = "Background References (Tile, Year)"  # Replace with the correct column
label_column = "Preveleance Rank (H=3, M=2, L=1)"  # Replace with the correct column

# Extract texts and labels
texts = vasc_df[text_column].astype(str).tolist()
label_mapping = {"H=3": 3, "M=2": 2, "L=1": 1}
labels = vasc_df[label_column].map(label_mapping).astype(float).tolist()

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-v1.1")

# Dataset class
class TextDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        encoded = self.tokenizer(
            self.texts[idx],
            max_length=self.max_length,
            padding="max_length",
            truncation=True,
            return_tensors="pt",
        )
        item = {
            "input_ids": encoded["input_ids"].squeeze(0),
            "attention_mask": encoded["attention_mask"].squeeze(0),
            "label": torch.tensor(self.labels[idx], dtype=torch.float),
        }
        return item

# Create Dataset and DataLoader
dataset = TextDataset(texts, labels, tokenizer)
dataloader = DataLoader(dataset, batch_size=16, shuffle=True)


In [None]:
print(vasc_df.columns)


In [None]:
text_column = "'Biomarkers'"  # Update based on your column


In [None]:
# Define the mapping
label_column = "Preveleance Rank (H=3, M=2, L=1)"
label_mapping = {"H=3": 3, "M=2": 2, "L=1": 1}  # Map categorical ranks to numbers

# Apply mapping to the column
labels = vasc_df[label_column].map(label_mapping)

# Check for unmapped or missing values
if labels.isnull().any():
    print("Warning: Some labels were not mapped. Check the data for unexpected values.")
    print("Unmapped values:", vasc_df[label_column][labels.isnull()].unique())

# Convert labels to a list of floats (ready for PyTorch)
labels = labels.astype(float).tolist()


In [None]:
vasc_df = vasc_df[vasc_df[label_column].isin(label_mapping.keys())]


In [None]:
print("Sample labels:", labels[:10])


In [None]:
# Define the mapping
label_column = "Preveleance Rank (H=3, M=2, L=1)"
label_mapping = {"H=3": 3, "M=2": 2, "L=1": 1}  # Map categorical ranks to numbers

# Apply mapping to the column
labels = vasc_df[label_column].map(label_mapping)

# Check for unmapped or missing values
if labels.isnull().any():
    print("Warning: Some labels were not mapped. Check the data for unexpected values.")
    print("Unmapped values:", vasc_df[label_column][labels.isnull()].unique())

# Convert labels to a list of floats (ready for PyTorch)
labels = labels.astype(float).tolist()


In [None]:
vasc_df = vasc_df[vasc_df[label_column].isin(label_mapping.keys())]


In [None]:
print("Sample labels:", labels[:10])


In [None]:
# Replace with the correct column name
print(vasc_df[["Development Drug Count", "Clinical Burden Rank (Trial Duration, N) *PI"]].head())


In [None]:
label_column = "Development Drug Count"
texts = vasc_df["Biomarkers"].astype(str).tolist()
labels = vasc_df[label_column].astype(float).tolist()


In [None]:
print(vasc_df[["Development Drug Count", "Clinical Burden Rank (Trial Duration, N) *PI"]].isnull().sum())


In [None]:
# Filter rows with non-missing values
filtered_df = vasc_df.dropna(subset=["Development Drug Count", "Clinical Burden Rank (Trial Duration, N) *PI"])

# Check filtered data
print(filtered_df[["Development Drug Count", "Clinical Burden Rank (Trial Duration, N) *PI"]].head())


In [None]:
# Define label column
label_column = "Development Drug Count"

# Extract texts and labels
texts = filtered_df["Biomarkers"].astype(str).tolist()
labels = filtered_df[label_column].astype(float).tolist()

# Verify
print("Sample texts:", texts[:5])
print("Sample labels:", labels[:5])


In [None]:
texts = vasc_df["Biomarkers"].astype(str).tolist()


In [None]:
from transformers import AutoModel, AutoTokenizer
import torch

# Load BioBERT
model_name = "dmis-lab/biobert-v1.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Generate embeddings for each row
embeddings = []
for text in texts:
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128, padding="max_length")
    outputs = model(**inputs)
    cls_embedding = outputs.last_hidden_state[:, 0, :]  # [CLS] token
    embeddings.append(cls_embedding.detach().numpy())

# Use embeddings for classification, ranking, etc.


In [None]:
# Example: Assign random relevance scores (replace with model-based logic)
vasc_df["Relevance Score"] = [float(e[0, 0]) for e in embeddings]  # Replace with predictions

# Save to Excel
vasc_df.to_excel("updated_vasc_df.xlsx", index=False)


In [None]:
from transformers import AutoTokenizer
from sklearn.model_selection import train_test_split

# Select input and target columns
text_column = "Biomarkers"
label_column = "Preveleance Rank (H=3, M=2, L=1)"

# Extract texts and labels
texts = vasc_df[text_column].astype(str).tolist()
label_mapping = {"H=3": 3, "M=2": 2, "L=1": 1}
labels = vasc_df[label_column].map(label_mapping).astype(float).tolist()

# Split into train and validation sets
train_texts, val_texts, train_labels, val_labels = train_test_split(
    texts, labels, test_size=0.2, random_state=42
)

# Tokenize the data
tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-v1.1")
def tokenize_texts(texts, tokenizer, max_length=128):
    return tokenizer(
        texts, max_length=max_length, padding="max_length", truncation=True, return_tensors="pt"
    )

train_encodings = tokenize_texts(train_texts, tokenizer)
val_encodings = tokenize_texts(val_texts, tokenizer)


In [None]:
print("Texts:", texts[:5])
print("Labels:", labels[:5])


In [None]:
print(vasc_df[["Biomarkers", "Preveleance Rank (H=3, M=2, L=1)"]].head())
