<a href="https://colab.research.google.com/github/christophergaughan/Bioinformatics-Code/blob/main/GNN_Antibiotics_expanded.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install chembl_webresource_client



# Overview of the GNN Regression Approach for Antibiotic Prediction

In this study, we aim to use **Graph Neural Networks (GNNs)** to predict the **bioactivity** of chemical compounds as potential antibiotics. Instead of classifying compounds as "active" or "inactive," we will treat this as a **regression problem** where the GNN model will predict the **IC50** value directly. The **IC50** value represents the concentration at which a compound produces 50% of its maximal effect, and it is a widely used measure of compound potency.

#### Why Use Regression Instead of Classification?
- **Granularity of Data**: Predicting the IC50 value directly allows us to capture more nuanced information about the compound's activity. This provides greater granularity than a simple binary label and can help identify promising candidates that might be highly effective.
- **Flexible Analysis**: By working with continuous targets, we retain all available bioactivity data, which can be useful for downstream analyses such as ranking compounds by potency or defining custom thresholds for activity.
- **Model Performance**: With more information retained in the dataset, the GNN model can better learn relationships between compound structure and potency, potentially improving the model's accuracy and utility.

#### Data Preparation Steps
1. **Data Retrieval from ChEMBL**: We will use the **ChEMBL Python client** to retrieve compound information, including **molecular structure (SMILES)**, **molecular properties** (e.g., molecular weight, LogP, HBA), and **bioactivity data (IC50)**. We will use compounds that have an **IC50 value** available to provide a continuous target for the regression model.

2. **Dataset Creation**: We will store the retrieved data in a **pandas DataFrame** and ensure each compound includes the necessary features (e.g., SMILES, molecular properties, IC50 value). The IC50 value will be used as the target for training the regression model.

3. **GNN Model Setup**: We will define a **Graph Neural Network (GNN)** that takes molecular graph representations as input and predicts a single continuous output (IC50 value). The GNN will consist of **graph convolutional layers** (using **GCNConv**) followed by a **fully connected layer** to produce the IC50 value.

4. **Training the Model**: We will train the model using a **regression loss function**, such as **Mean Squared Error (MSE)** or **Mean Absolute Error (MAE)**, to minimize the difference between the predicted and true IC50 values.

5. **Data Normalization**: Since IC50 values can vary significantly in magnitude, we may apply a **log transformation** (e.g., log(IC50)) to improve the model's learning process and numerical stability.

#### Expected Outcome
The GNN model will predict IC50 values for new compounds, allowing us to assess their potential as antibiotics based on potency. By predicting continuous values rather than simple classifications, we aim to identify promising compounds with greater precision, helping guide future experimental validation.


### Data Retrieval Criteria for the Study

To ensure the quality and relevance of the dataset used in this study, we applied specific filtering criteria when retrieving compound information from ChEMBL:

- **Small Molecule Compounds**: We focused on retrieving **small molecule compounds** by filtering out biotherapeutics (`biotherapeutic__isnull=True`). This ensures that we only include compounds suitable for small molecule drug discovery.
- **Phase 3 Clinical Trials or Higher**: We selected compounds that have reached at least **Phase 3 clinical trials** (`max_phase__gte=3`). Reaching Phase 3 indicates that the compound has demonstrated sufficient **safety and efficacy** in earlier stages of testing. Compounds at this stage have undergone rigorous evaluation in human trials and are more likely to be effective and safe for use.

These criteria help us ensure that the dataset consists of well-characterized compounds with promising potential for use as antibiotics.



## Binning Compounds Based on Regression Analysis: A key assumption of our model and what it means

Once we have used the GNN model to predict **IC50 values** for the compounds, we can further categorize these compounds into **'active'** and **'inactive'** groups based on the predicted IC50 values. This binning process helps in defining a clear threshold for activity and allows us to identify compounds that are most promising for further development.

- **Threshold Definition**: We can define a specific IC50 threshold to determine whether a compound is considered **'active'** or **'inactive'**. For example, compounds with **IC50 <= 1000 nM (1 µM)** could be labeled as **'active'**, while those with **IC50 > 1000 nM** could be labeled as **'inactive'**. Implicit in our assumption is that molecules with **IC50 <= 1 µM** are likely to be effective, but those with higher IC50 values are not necessarily completely inefficacious. We are aiming to identify antibiotic small molecules with **high binding efficiency**.
- **Post-Processing Step**: This binning can be performed as a post-processing step after regression, allowing us to create binary labels for downstream analysis or decision-making.
- **Advantages**: By first predicting IC50 values and then binning them, we retain the flexibility to experiment with different thresholds and refine the criteria for classifying compounds. This allows us to capture more nuanced insights about compound efficacy and adjust the analysis as needed.

This approach ensures that we are able to both quantify the potency of the compounds and categorize them effectively, providing valuable insights for future experimental validation and prioritization.

So this is a not a perfect method of binning our results into , however it is *a* beginning. We have to strat from somewhere



In [None]:
# Install the ChEMBL web resource client
!pip install chembl_webresource_client

# Import required libraries
from chembl_webresource_client.new_client import new_client
import pandas as pd
import requests
from time import sleep
import numpy as np

# Initialize the ChEMBL client
molecule = new_client.molecule
activity = new_client.activity

# Define a search for small molecules with activity data
# Limiting the query to the first 100 compounds for testing purposes
molecules = molecule.filter(biotherapeutic__isnull=True, max_phase__gte=3)[:100]

# Create an empty list to store results
data = []

# Fetch details for each molecule and bioactivity data
for idx, compound in enumerate(molecules):
    try:
        # Extract compound properties
        molecule_chembl_id = compound.get('molecule_chembl_id')
        molecule_name = None
        if compound.get('molecule_properties'):
            molecule_name = compound['molecule_properties'].get('full_molformula')

        smiles = None
        if compound.get('molecule_structures'):
            smiles = compound['molecule_structures'].get('canonical_smiles')

        molecular_weight = None
        logp = None
        hba = None
        hbd = None
        aromatic_rings = None
        if compound.get('molecule_properties'):
            molecular_weight = compound['molecule_properties'].get('mw_freebase')
            logp = compound['molecule_properties'].get('alogp')
            hba = compound['molecule_properties'].get('hba')
            hbd = compound['molecule_properties'].get('hbd')
            aromatic_rings = compound['molecule_properties'].get('aromatic_rings')

        # Fetch bioactivity data for the compound
        activities = activity.filter(molecule_chembl_id=molecule_chembl_id)

        # Extract IC50 value if available
        ic50_value = None
        for act in activities:
            if act['standard_type'] == 'IC50' and act['standard_value'] is not None:
                ic50_value = float(act['standard_value'])
                break  # Use the first available IC50 value

        # Append the data only if SMILES and IC50 value are available
        if smiles and ic50_value is not None:
            data.append({
                'Molecule ChEMBL ID': molecule_chembl_id,
                'Molecule Name': molecule_name,
                'SMILES': smiles,
                'Molecular Weight': molecular_weight,
                'LogP': logp,
                'HBA': hba,
                'HBD': hbd,
                'Aromatic Rings': aromatic_rings,
                'IC50 (nM)': ic50_value
            })

        # Print progress every 10 compounds
        if idx % 10 == 0:
            print(f"Processed {idx + 1} compounds so far...")

    except requests.exceptions.RequestException as e:
        print(f"API request failed for compound {molecule_chembl_id}: {e}, retrying in 5 seconds...")
        sleep(5)  # Wait for 5 seconds and then retry

# Convert to a DataFrame
df = pd.DataFrame(data)

# Add Log IC50 (nM) column to the DataFrame
df['Log IC50 (nM)'] = df['IC50 (nM)'].apply(lambda x: np.log10(x))

# Display the first few rows to verify
df.head()

## Data Exploration:

Visualize the data to better understand distributions and relationships. For example, plot histograms of Molecular Weight, LogP, and Log IC50 (nM) to understand the variation in these features:

In [None]:
import matplotlib.pyplot as plt

# Plot histogram of Log IC50 values
plt.hist(df['Log IC50 (nM)'], bins=20, color='blue', edgecolor='black')
plt.xlabel('Log IC50 (nM)')
plt.ylabel('Frequency')
plt.title('Distribution of Log IC50 Values')
plt.show()




## Observations:
### Log IC50 Values:

Most of the Log IC50 values fall between 1 and 5, indicating that most compounds have IC50 values between 10 nM and 100,000 nM.
There are a few outliers with very high Log IC50 values (e.g., greater than 10).
Data Skewness:

The distribution is right-skewed, with a concentration of compounds in the range of lower Log IC50 values, and a few outliers with very high IC50 values. This indicates that while many compounds have reasonable activity, a few may be quite ineffective.
## Next Steps:
Handle Outliers:

Outliers with very high IC50 values may affect model training negatively. You could consider removing or capping these outliers to improve model robustness:

In [None]:
# Remove compounds with Log IC50 values above a threshold (e.g., 10)
df_filtered = df[df['Log IC50 (nM)'] <= 10]



## Balance the Dataset:

If the binary classification for active/inactive is highly imbalanced, consider balancing the dataset by undersampling inactive compounds or oversampling active ones:

In [None]:
# Create binary labels based on IC50 threshold
df['Active'] = df['IC50 (nM)'].apply(lambda x: 1 if x <= 1000 else 0)

# Check class distribution
print(df['Active'].value_counts())


## Model Training Preparation:

Now that we have insights into the distribution, we can move forward with splitting the dataset into training and testing sets for model training.
We can also use SMILES to create molecular graphs for GNN input. Tools like RDKit can be used for this purpose.
## Visualize Relationships:

To further understand how features like LogP, Molecular Weight, etc., relate to IC50, you could use scatter plots:

In [None]:
import seaborn as sns

# Scatter plot of Molecular Weight vs Log IC50
sns.scatterplot(data=df, x='Molecular Weight', y='Log IC50 (nM)')
plt.title('Molecular Weight vs Log IC50 (nM)')
plt.show()


### Handling Outliers and Dataset Balancing

To proceed effectively with our dataset, we need to address outliers and any class imbalance issues.

1. **Handle Outliers**:
   - We can observe that there are some extreme values in **Log IC50** (e.g., values greater than **10**). These values can negatively affect the model's ability to learn relationships.
   - Therefore, we will filter out compounds with **Log IC50 values greater than 10** to remove these outliers and focus on the more relevant range.
   - Filtering code:
     ```python
     df_filtered = df[df['Log IC50 (nM)'] <= 10]  # Filtering out extreme values
     ```

2. **Balance the Dataset**:
   - From the class distribution, we observed that the dataset is slightly imbalanced, with **52 inactive compounds** and **43 active compounds**.
   - To balance the dataset, we will use the **SMOTE** (Synthetic Minority Over-sampling Technique) method to generate synthetic examples for the minority class.
   - This approach will ensure our model is trained with a more balanced dataset and thus be better able to learn the differences between active and inactive compounds.
   - Balancing code:
     ```python
     !pip install imbalanced-learn
     from imblearn.over_sampling import SMOTE

     # Features and target extraction
     X = df_filtered[['Molecular Weight', 'LogP', 'HBA', 'HBD', 'Aromatic Rings', 'Log IC50 (nM)']]
     y = df_filtered['Active']

     # Oversample the minority class using SMOTE
     smote = SMOTE(sampling_strategy='auto')
     X_resampled, y_resampled = smote.fit_resample(X, y)

     # Create a balanced DataFrame
     df_balanced = pd.concat([pd.DataFrame(X_resampled, columns=X.columns), pd.Series(y_resampled, name='Active')], axis=1)
     ```

3. **Visualize Balanced Data**:
   - After balancing, we will visualize the new class distribution to confirm that we have an equal number of **active** and **inactive** compounds.
   - Visualization code:
     ```python
     print(df_balanced['Active'].value_counts())
     ```


In [None]:
df_filtered = df[df['Log IC50 (nM)'] <= 10]  # Filtering out extreme values


In [None]:
!pip install imbalanced-learn
from imblearn.over_sampling import SMOTE

# Features and target extraction
X = df_filtered[['Molecular Weight', 'LogP', 'HBA', 'HBD', 'Aromatic Rings', 'Log IC50 (nM)']]
y = df_filtered['Active']

# Oversample the minority class using SMOTE
smote = SMOTE(sampling_strategy='auto')
X_resampled, y_resampled = smote.fit_resample(X, y)

# Create a balanced DataFrame
df_balanced = pd.concat([pd.DataFrame(X_resampled, columns=X.columns), pd.Series(y_resampled, name='Active')], axis=1)


In [None]:
print(df_balanced['Active'].value_counts())


### Preparing for Graph Neural Network (GNN) Model Training

1. **Convert SMILES to Molecular Graphs**:
   - To train a GNN, we need to convert the **SMILES** representation of each compound into a molecular graph.
   - We can use **RDKit** to convert SMILES into molecular graphs that are compatible with frameworks like **PyTorch Geometric**.
   - Conversion code:
     ```python
     !pip install rdkit

     from rdkit import Chem

     # Convert SMILES to RDKit molecule objects
     df_balanced['Mol'] = df_balanced['SMILES'].apply(lambda x: Chem.MolFromSmiles(x))
     ```

2. **Model Training Preparation**:
   - With the dataset filtered, balanced, and converted into molecular graphs, we can proceed to split the dataset into **training** and **testing** sets and prepare the data for **GNN model training**.


In [None]:
!pip install rdkit


In [None]:
!pip install imbalanced-learn
from imblearn.over_sampling import SMOTE

# Ensure SMILES is present in the DataFrame for later concatenation
if 'SMILES' in df_filtered.columns:
    X = df_filtered[['Molecular Weight', 'LogP', 'HBA', 'HBD', 'Aromatic Rings', 'Log IC50 (nM)']]
    y = df_filtered['Active']
    smiles = df_filtered['SMILES']  # Extract the SMILES column separately
else:
    raise KeyError("SMILES column not found in df_filtered DataFrame.")

# Oversample the minority class using SMOTE
smote = SMOTE(sampling_strategy='auto')
X_resampled, y_resampled = smote.fit_resample(X, y)

# Create a balanced DataFrame and add the SMILES back
df_balanced = pd.concat([pd.DataFrame(X_resampled, columns=X.columns),
                         pd.Series(y_resampled, name='Active'),
                         smiles.reset_index(drop=True)], axis=1)


In [None]:
!pip install rdkit

from rdkit import Chem

# Convert SMILES to RDKit molecule objects
if 'SMILES' in df_balanced.columns:
    df_balanced['Mol'] = df_balanced['SMILES'].apply(lambda x: Chem.MolFromSmiles(x) if pd.notna(x) else None)
else:
    raise KeyError("SMILES column not found in df_balanced DataFrame.")


## Next Steps for GNN Model Training
Now that we have handled outliers, balanced the dataset, and converted SMILES strings into molecular graphs, we are ready to proceed with Graph Neural Network (GNN) model training. Here’s what we need to do next:

1. Split the Dataset into Training and Testing Sets
Before training our GNN, we need to split the data into training and testing sets. This allows us to train the model and evaluate its performance on unseen data.
Use Scikit-learn to split the dataset:

In [None]:
from sklearn.model_selection import train_test_split

# Split data into training and testing sets (80% training, 20% testing)
train_data, test_data = train_test_split(df_balanced, test_size=0.2, random_state=42)

print(f"Training set size: {len(train_data)}")
print(f"Testing set size: {len(test_data)}")


2. Create Graph Representations for GNN

We need to convert our molecular data into graph representations that are compatible with frameworks like PyTorch Geometric.
This involves extracting graph information such as nodes (atoms) and edges (bonds) from each molecule.

In [None]:
!pip install torch_geometric

import torch
from torch_geometric.data import Data
from rdkit.Chem import rdmolops

# Function to convert RDKit Mol object to PyTorch Geometric Data object
def mol_to_graph(mol):
    if mol is None:
        return None
    atom_features = [atom.GetAtomicNum() for atom in mol.GetAtoms()]
    edge_index = []
    for bond in mol.GetBonds():
        start, end = bond.GetBeginAtomIdx(), bond.GetEndAtomIdx()
        edge_index.append([start, end])
        edge_index.append([end, start])  # Add both directions

    edge_index = torch.tensor(edge_index, dtype=torch.long).t().contiguous()
    x = torch.tensor(atom_features, dtype=torch.float).view(-1, 1)

    return Data(x=x, edge_index=edge_index)

# Convert training and testing data
train_data['graph'] = train_data['Mol'].apply(mol_to_graph)
test_data['graph'] = test_data['Mol'].apply(mol_to_graph)



3. Prepare Data for GNN Training
* The training and testing data can now be prepared for use with PyTorch Geometric.
* We will create a list of graph objects to feed into our GNN model.

In [None]:
# Filter out None graphs (in case of failed conversions)
train_graphs = [graph for graph in train_data['graph'] if graph is not None]
test_graphs = [graph for graph in test_data['graph'] if graph is not None]

# Print the number of valid graphs
print(f"Number of training graphs: {len(train_graphs)}")
print(f"Number of testing graphs: {len(test_graphs)}")


4. Define the GNN Model

* Define the architecture for the GNN model using PyTorch and PyTorch Geometric. The model will take in molecular graphs and output a predicted IC50 value.

In [None]:
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import GCNConv

class GNNModel(nn.Module):
    def __init__(self):
        super(GNNModel, self).__init__()
        self.conv1 = GCNConv(1, 16)
        self.conv2 = GCNConv(16, 32)
        self.fc1 = nn.Linear(32, 1)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = F.relu(self.conv1(x, edge_index))
        x = F.relu(self.conv2(x, edge_index))
        x = torch.mean(x, dim=0)  # Global mean pooling
        x = self.fc1(x)
        return x


5. Train the GNN Model
* Now that we have defined our model, we can proceed with training.
* Set up the optimizer, loss function, and the training loop.


In [None]:
import torch
from torch_geometric.data import Data

# Function to convert RDKit Mol object to PyTorch Geometric Data object, with target y value
def mol_to_graph(mol, target):
    if mol is None:
        return None
    atom_features = [atom.GetAtomicNum() for atom in mol.GetAtoms()]
    edge_index = []
    for bond in mol.GetBonds():
        start, end = bond.GetBeginAtomIdx(), bond.GetEndAtomIdx()
        edge_index.append([start, end])
        edge_index.append([end, start])  # Add both directions

    edge_index = torch.tensor(edge_index, dtype=torch.long).t().contiguous()
    x = torch.tensor(atom_features, dtype=torch.float).view(-1, 1)

    # Add target y value to the Data object
    y = torch.tensor([target], dtype=torch.float)

    return Data(x=x, edge_index=edge_index, y=y)

# Convert training and testing data, adding target y value
train_data['graph'] = train_data.apply(lambda row: mol_to_graph(row['Mol'], row['Log IC50 (nM)']), axis=1)
test_data['graph'] = test_data.apply(lambda row: mol_to_graph(row['Mol'], row['Log IC50 (nM)']), axis=1)


## Updated Training Loop:


In [None]:
# Training loop
model.train()
for epoch in range(100):
    total_loss = 0
    for data in train_data['graph']:
        if data is None:  # Skip any None values
            continue
        data = data.to(device)
        optimizer = Adam(model.parameters(), lr=0.0001)  # Reduced learning rate
        output = model(data)
        loss = criterion(output, data.y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1}, Loss: {total_loss:.4f}")


It looks like the model is improving, but the loss reduction is still slow and might be plateauing again. The following suggestions can help improve convergence and model performance:

## Suggestions for Further Improvement:
Increase Model Capacity:

Adding more layers or increasing the size of hidden units can help capture complex relationships in the data better.

In [None]:
class GNNModel(nn.Module):
    def __init__(self):
        super(GNNModel, self).__init__()
        self.conv1 = GCNConv(1, 64)  # Increased number of channels
        self.conv2 = GCNConv(64, 128)
        self.conv3 = GCNConv(128, 128)  # Adding an additional GCN layer
        self.fc1 = nn.Linear(128, 1)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = F.relu(self.conv1(x, edge_index))
        x = F.relu(self.conv2(x, edge_index))
        x = F.relu(self.conv3(x, edge_index))
        x = torch.mean(x, dim=0)  # Global mean pooling
        x = self.fc1(x)
        return x


# Use a Learning Rate Scheduler:

A learning rate scheduler can help dynamically adjust the learning rate during training to avoid getting stuck in plateaus.

In [None]:
from torch.optim.lr_scheduler import StepLR

scheduler = StepLR(optimizer, step_size=10, gamma=0.7)  # Reduce LR every 10 epochs


## Batch Normalization:

Adding batch normalization layers can help stabilize training by normalizing the outputs of the convolution layers.

In [None]:
from torch_geometric.nn import BatchNorm

class GNNModel(nn.Module):
    def __init__(self):
        super(GNNModel, self).__init__()
        self.conv1 = GCNConv(1, 64)
        self.batch_norm1 = BatchNorm(64)
        self.conv2 = GCNConv(64, 128)
        self.batch_norm2 = BatchNorm(128)
        self.conv3 = GCNConv(128, 128)
        self.batch_norm3 = BatchNorm(128)
        self.fc1 = nn.Linear(128, 1)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = F.relu(self.batch_norm1(self.conv1(x, edge_index)))
        x = F.relu(self.batch_norm2(self.conv2(x, edge_index)))
        x = F.relu(self.batch_norm3(self.conv3(x, edge_index)))
        x = torch.mean(x, dim=0)
        x = self.fc1(x)
        return x


## Experiment with Different Pooling Strategies:

* Global mean pooling is one way to aggregate node features, but other pooling strategies like max pooling or sum pooling could work better for the data.

In [None]:
x = torch.sum(x, dim=0)  # Global sum pooling


## Updated GNN Model with Sum Pooling:
* Make sure the pooling operation is done within the `forward()` method of the model, where `x` and `edge_index` are properly defined.

Here is an updated version of the GNN model that uses global sum pooling:

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import GCNConv, global_add_pool

class GNNModel(nn.Module):
    def __init__(self):
        super(GNNModel, self).__init__()
        self.conv1 = GCNConv(1, 64)  # Increased number of channels
        self.conv2 = GCNConv(64, 128)
        self.conv3 = GCNConv(128, 128)  # Adding an additional GCN layer
        self.fc1 = nn.Linear(128, 1)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = F.relu(self.conv1(x, edge_index))
        x = F.relu(self.conv2(x, edge_index))
        x = F.relu(self.conv3(x, edge_index))
        x = global_add_pool(x, data.batch)  # Global sum pooling
        x = self.fc1(x)
        return x


## Explanation:
* Global Pooling: The global_add_pool function from PyTorch Geometric takes care of the pooling operation by summing all the node features for each graph. This way, we don't need to manually define x outside of the forward method.
* Batch Handling: Using data.batch allows for batch processing, which helps in dealing with multiple graphs at once during training.
## Training the Updated Model:
Make sure to use a DataLoader that can handle batch processing:

In [None]:
from torch_geometric.loader import DataLoader

train_loader = DataLoader(train_graphs, batch_size=16, shuffle=True)

# Updated training loop
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = GNNModel().to(device)
optimizer = Adam(model.parameters(), lr=0.0001)
criterion = nn.MSELoss()

model.train()
for epoch in range(100):
    total_loss = 0
    for batch in train_loader:
        batch = batch.to(device)
        optimizer.zero_grad()
        output = model(batch)
        loss = criterion(output, batch.y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1}, Loss: {total_loss:.4f}")


The AttributeError: `'NoneType' object has no attribute 'size'` error indicates that some of the `batch.y` values are `None`. This happens if the target value was not properly assigned when creating the graph objects or if there are invalid entries in the dataset.

## How to Fix This:
1. Check Target Assignment in Graph Objects:

Ensure that each PyTorch Geometric `Data` object has a valid target value `(y)` when converting molecules into graph representations.
2. Modify the Graph Conversion Function:

Update the `mol_to_graph` function to include the target value only if it is not None and the molecule is valid.
## Updated Graph Conversion Code:

In [None]:
import torch
from torch_geometric.data import Data

# Function to convert RDKit Mol object to PyTorch Geometric Data object, with target y value
def mol_to_graph(mol, target):
    if mol is None or target is None:
        return None  # Return None if mol or target is invalid
    atom_features = [atom.GetAtomicNum() for atom in mol.GetAtoms()]
    edge_index = []
    for bond in mol.GetBonds():
        start, end = bond.GetBeginAtomIdx(), bond.GetEndAtomIdx()
        edge_index.append([start, end])
        edge_index.append([end, start])  # Add both directions

    edge_index = torch.tensor(edge_index, dtype=torch.long).t().contiguous()
    x = torch.tensor(atom_features, dtype=torch.float).view(-1, 1)

    # Add target y value to the Data object
    y = torch.tensor([target], dtype=torch.float)

    return Data(x=x, edge_index=edge_index, y=y)

# Convert training and testing data, adding target y value
train_data['graph'] = train_data.apply(lambda row: mol_to_graph(row['Mol'], row['Log IC50 (nM)']), axis=1)
test_data['graph'] = test_data.apply(lambda row: mol_to_graph(row['Mol'], row['Log IC50 (nM)']), axis=1)

# Filter out None graphs (in case of failed conversions)
train_graphs = [graph for graph in train_data['graph'] if graph is not None]
test_graphs = [graph for graph in test_data['graph'] if graph is not None]


## Training Loop Update:
Make sure the training loop **skips any invalid data** to avoid issues with missing target values.



In [None]:
# Training loop
model.train()
for epoch in range(100):
    total_loss = 0
    for batch in train_loader:
        # Skip batches that might have invalid data
        if batch.y is None:
            continue

        batch = batch.to(device)
        optimizer.zero_grad()
        output = model(batch)
        loss = criterion(output, batch.y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1}, Loss: {total_loss:.4f}")


### Handling Outliers and Dataset Balancing

To proceed effectively with our dataset, we need to address outliers and any class imbalance issues.

1. **Handle Outliers**:
   - We can observe that there are some extreme values in **Log IC50** (e.g., values greater than **10**). These values can negatively affect the model's ability to learn relationships.
   - Therefore, we will filter out compounds with **Log IC50 values greater than 10** to remove these outliers and focus on the more relevant range.
   - Filtering code:
     ```python
     df_filtered = df[df['Log IC50 (nM)'] <= 10]  # Filtering out extreme values
     ```

2. **Balance the Dataset**:
   - From the class distribution, we observed that the dataset is slightly imbalanced, with **52 inactive compounds** and **43 active compounds**.
   - To balance the dataset, we will use the **SMOTE** (Synthetic Minority Over-sampling Technique) method to generate synthetic examples for the minority class.
   - This approach will ensure our model is trained with a more balanced dataset and thus be better able to learn the differences between active and inactive compounds.
   - Balancing code:
     ```python
     !pip install imbalanced-learn
     from imblearn.over_sampling import SMOTE

     # Features and target extraction
     if 'SMILES' in df_filtered.columns:
         X = df_filtered[['Molecular Weight', 'LogP', 'HBA', 'HBD', 'Aromatic Rings', 'Log IC50 (nM)']]
         y = df_filtered['Active']
     else:
         raise KeyError("SMILES column not found in df_filtered DataFrame.")

     # Oversample the minority class using SMOTE
     smote = SMOTE(sampling_strategy='auto')
     X_resampled, y_resampled = smote.fit_resample(X, y)

     # Create a balanced DataFrame
     df_balanced = pd.concat([pd.DataFrame(X_resampled, columns=X.columns), pd.Series(y_resampled, name='Active'), df_filtered['SMILES'].reset_index(drop=True)], axis=1)
     ```

3. **Visualize Balanced Data**:
   - After balancing, we will visualize the new class distribution to confirm that we have an equal number of **active** and **inactive** compounds.
   - Visualization code:
     ```python
     print(df_balanced['Active'].value_counts())
     ```


In [None]:
df_filtered = df[df['Log IC50 (nM)'] <= 10]  # Filtering out extreme values


In [None]:
!pip install imbalanced-learn
from imblearn.over_sampling import SMOTE

# Features and target extraction
if 'SMILES' in df_filtered.columns:
    X = df_filtered[['Molecular Weight', 'LogP', 'HBA', 'HBD', 'Aromatic Rings', 'Log IC50 (nM)']]
    y = df_filtered['Active']
else:
    raise KeyError("SMILES column not found in df_filtered DataFrame.")

# Oversample the minority class using SMOTE
smote = SMOTE(sampling_strategy='auto')
X_resampled, y_resampled = smote.fit_resample(X, y)

# Create a balanced DataFrame
df_balanced = pd.concat([pd.DataFrame(X_resampled, columns=X.columns), pd.Series(y_resampled, name='Active'), df_filtered['SMILES'].reset_index(drop=True)], axis=1)

In [None]:
print(df_balanced['Active'].value_counts())



### Preparing for Graph Neural Network (GNN) Model Training

1. **Convert SMILES to Molecular Graphs**:
   - To train a GNN, we need to convert the **SMILES** representation of each compound into a molecular graph.
   - We can use **RDKit** to convert SMILES into molecular graphs that are compatible with frameworks like **PyTorch Geometric**.
   - Conversion code:
     ```python
     !pip install rdkit

     from rdkit import Chem

     # Convert SMILES to RDKit molecule objects
     if 'SMILES' in df_balanced.columns:
         df_balanced['Mol'] = df_balanced['SMILES'].apply(lambda x: Chem.MolFromSmiles(x) if pd.notna(x) else None)
     else:
         raise KeyError("SMILES column not found in df_balanced DataFrame.")
     ```

2. **Model Training Preparation**:
   - With the dataset filtered, balanced, and converted into molecular graphs, we can proceed to split the dataset into **training** and **testing** sets and prepare the data for **GNN model training**.
   - Splitting code:
     ```python
     from sklearn.model_selection import train_test_split

     # Split data into training and testing sets (80% training, 20% testing)
     train_data, test_data = train_test_split(df_balanced, test_size=0.2, random_state=42)
     ```

3. **Create Graph Representations for GNN**:
   - Convert the molecules in the training and testing sets into graph representations using **PyTorch Geometric**.
   - Graph conversion code:
     ```python
     import torch
     from torch_geometric.data import Data
     from rdkit.Chem import rdmolops

     # Function to convert RDKit Mol object to PyTorch Geometric Data object
     def mol_to_graph(mol, target):
         if mol is None or target is None:
             return None  # Return None if mol or target is invalid
         atom_features = [atom.GetAtomicNum() for atom in mol.GetAtoms()]
         edge_index = []
         for bond in mol.GetBonds():
             start, end = bond.GetBeginAtomIdx(), bond.GetEndAtomIdx()
             edge_index.append([start, end])
             edge_index.append([end, start])  # Add both directions

         edge_index = torch.tensor(edge_index, dtype=torch.long).t().contiguous()
         x = torch.tensor(atom_features, dtype=torch.float).view(-1, 1)

         # Add target y value to the Data object
         y = torch.tensor([target], dtype=torch.float)

         return Data(x=x, edge_index=edge_index, y=y)

     # Convert training and testing data, adding target y value
     train_data['graph'] = train_data.apply(lambda row: mol_to_graph(row['Mol'], row['Log IC50 (nM)']), axis=1)
     test_data['graph'] = test_data.apply(lambda row: mol_to_graph(row['Mol'], row['Log IC50 (nM)']), axis=1)

     # Filter out None graphs (in case of failed conversions)
     train_graphs = [graph for graph in train_data['graph'] if graph is not None]
     test_graphs = [graph for graph in test_data['graph'] if graph is not None]
     ```


In [None]:
!pip install rdkit

from rdkit import Chem

# Convert SMILES to RDKit molecule objects
if 'SMILES' in df_balanced.columns:
    df_balanced['Mol'] = df_balanced['SMILES'].apply(lambda x: Chem.MolFromSmiles(x) if pd.notna(x) else None)
else:
    raise KeyError("SMILES column not found in df_balanced DataFrame.")

In [None]:
from sklearn.model_selection import train_test_split

# Split data into training and testing sets (80% training, 20% testing)
train_data, test_data = train_test_split(df_balanced, test_size=0.2, random_state=42)

In [None]:
import torch
from torch_geometric.data import Data
from rdkit.Chem import rdmolops

# Function to convert RDKit Mol object to PyTorch Geometric Data object
def mol_to_graph(mol, target):
    if mol is None or target is None:
        return None  # Return None if mol or target is invalid
    atom_features = [atom.GetAtomicNum() for atom in mol.GetAtoms()]
    edge_index = []
    for bond in mol.GetBonds():
        start, end = bond.GetBeginAtomIdx(), bond.GetEndAtomIdx()
        edge_index.append([start, end])
        edge_index.append([end, start])  # Add both directions

    edge_index = torch.tensor(edge_index, dtype=torch.long).t().contiguous()
    x = torch.tensor(atom_features, dtype=torch.float).view(-1, 1)

    # Add target y value to the Data object
    y = torch.tensor([target], dtype=torch.float)

    return Data(x=x, edge_index=edge_index, y=y)

# Convert training and testing data, adding target y value
train_data['graph'] = train_data.apply(lambda row: mol_to_graph(row['Mol'], row['Log IC50 (nM)']), axis=1)
test_data['graph'] = test_data.apply(lambda row: mol_to_graph(row['Mol'], row['Log IC50 (nM)']), axis=1)

# Filter out None graphs (in case of failed conversions)
train_graphs = [graph for graph in train_data['graph'] if graph is not None]
test_graphs = [graph for graph in test_data['graph'] if graph is not None]

1. Split Data into Training and Testing Sets
We will split the balanced dataset into training and testing sets. This allows us to train the model and evaluate its performance.

In [None]:
from sklearn.model_selection import train_test_split

# Split data into training and testing sets (80% training, 20% testing)
train_data, test_data = train_test_split(df_balanced, test_size=0.2, random_state=42)


## Convert Molecules to Graph Representations for PyTorch Geometric
We need to convert the molecules into graph representations that the GNN can use.

In [None]:
import torch
from torch_geometric.data import Data

# Function to convert RDKit Mol object to PyTorch Geometric Data object
def mol_to_graph(mol, target):
    if mol is None or target is None:
        return None  # Return None if mol or target is invalid
    atom_features = [atom.GetAtomicNum() for atom in mol.GetAtoms()]
    edge_index = []
    for bond in mol.GetBonds():
        start, end = bond.GetBeginAtomIdx(), bond.GetEndAtomIdx()
        edge_index.append([start, end])
        edge_index.append([end, start])  # Add both directions

    edge_index = torch.tensor(edge_index, dtype=torch.long).t().contiguous()
    x = torch.tensor(atom_features, dtype=torch.float).view(-1, 1)

    # Add target y value to the Data object
    y = torch.tensor([target], dtype=torch.float)

    return Data(x=x, edge_index=edge_index, y=y)

# Convert training and testing data, adding target y value
train_data['graph'] = train_data.apply(lambda row: mol_to_graph(row['Mol'], row['Log IC50 (nM)']), axis=1)
test_data['graph'] = test_data.apply(lambda row: mol_to_graph(row['Mol'], row['Log IC50 (nM)']), axis=1)

# Filter out None graphs (in case of failed conversions)
train_graphs = [graph for graph in train_data['graph'] if graph is not None]
test_graphs = [graph for graph in test_data['graph'] if graph is not None]


# Create a DataLoader for Training
We will use a DataLoader to handle the batching for training the GNN.

In [None]:
from torch_geometric.loader import DataLoader

# Create DataLoader objects for training and testing datasets
train_loader = DataLoader(train_graphs, batch_size=16, shuffle=True)
test_loader = DataLoader(test_graphs, batch_size=16, shuffle=False)


## Define and Train the GNN Model
Now, let's define a simple Graph Neural Network using PyTorch Geometric and train it.

In [None]:
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
from torch import nn, optim

# Define the GNN model
class GNNModel(nn.Module):
    def __init__(self):
        super(GNNModel, self).__init__()
        self.conv1 = GCNConv(1, 64)  # Increased number of channels
        self.conv2 = GCNConv(64, 128)
        self.fc1 = nn.Linear(128, 1)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = F.relu(self.conv1(x, edge_index))
        x = F.relu(self.conv2(x, edge_index))
        x = torch.mean(x, dim=0)  # Global mean pooling
        x = self.fc1(x)
        return x

# Initialize model, loss function, and optimizer
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = GNNModel().to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.MSELoss()

# Training loop
model.train()
for epoch in range(100):
    total_loss = 0
    for batch in train_loader:
        batch = batch.to(device)
        optimizer.zero_grad()
        output = model(batch)
        loss = criterion(output, batch.y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1}, Loss: {total_loss:.4f}")


The warning message we are receiving is due to a mismatch in the dimensions of the output tensor from the model and the target tensor (y). Specifically, the model is outputting a tensor of size [1], while the target size is [batch_size].

To fix this, we need to make sure that the output from our model matches the size of our target tensor (y). Here are a couple of modifications to make:

## Update the GNN Model to Output the Correct Dimension
Global Pooling Per Graph: Use global pooling like global_mean_pool to handle batch-wise pooling, and modify the model to output the correct dimensions per batch.


In [None]:
from torch_geometric.nn import global_mean_pool

# Updated GNN model
class GNNModel(nn.Module):
    def __init__(self):
        super(GNNModel, self).__init__()
        self.conv1 = GCNConv(1, 64)
        self.conv2 = GCNConv(64, 128)
        self.fc1 = nn.Linear(128, 1)

    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch
        x = F.relu(self.conv1(x, edge_index))
        x = F.relu(self.conv2(x, edge_index))
        x = global_mean_pool(x, batch)  # Global mean pooling per graph in the batch
        x = self.fc1(x)
        return x.view(-1)  # Output size should match batch size


# Updated Training Loop
Ensure your training loop remains the same since the output dimensions now match the target dimensions.

In [None]:
# Training loop
model.train()
for epoch in range(100):
    total_loss = 0
    for batch in train_loader:
        batch = batch.to(device)
        optimizer.zero_grad()
        output = model(batch)
        loss = criterion(output, batch.y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1}, Loss: {total_loss:.4f}")


The training loss is reducing somewhat, but the values are fluctuating quite a bit, which may indicate that the model is not converging well. Here are a few suggestions for further improving your model training:

## 1. Learning Rate Tuning

The learning rate might be too high, which can lead to oscillations. Consider reducing the learning rate to stabilize the training process. You can try values like 0.0001 or use a learning rate scheduler.

In [None]:
# Use a lower learning rate
optimizer = optim.Adam(model.parameters(), lr=0.0001)


## 2. Learning Rate Scheduler
Implement a learning rate scheduler to reduce the learning rate gradually if the loss plateaus.

In [None]:
from torch.optim.lr_scheduler import ReduceLROnPlateau

# Add a learning rate scheduler
scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=5, verbose=True)

# Update the training loop to include the scheduler
for epoch in range(100):
    total_loss = 0
    for batch in train_loader:
        batch = batch.to(device)
        optimizer.zero_grad()
        output = model(batch)
        loss = criterion(output, batch.y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    print(f"Epoch {epoch+1}, Loss: {total_loss:.4f}")
    scheduler.step(total_loss)


## 3. Batch Normalization
Add batch normalization layers after each graph convolutional layer to help stabilize training and potentially improve convergence.

In [None]:
from torch_geometric.nn import BatchNorm

class GNNModel(nn.Module):
    def __init__(self):
        super(GNNModel, self).__init__()
        self.conv1 = GCNConv(1, 64)
        self.bn1 = BatchNorm(64)
        self.conv2 = GCNConv(64, 128)
        self.bn2 = BatchNorm(128)
        self.fc1 = nn.Linear(128, 1)

    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch
        x = F.relu(self.bn1(self.conv1(x, edge_index)))
        x = F.relu(self.bn2(self.conv2(x, edge_index)))
        x = global_mean_pool(x, batch)  # Global mean pooling per graph in the batch
        x = self.fc1(x)
        return x.view(-1)  # Output size should match batch size


## 4. Gradient Clipping
Use gradient clipping to prevent exploding gradients, which can help stabilize training if there are sudden spikes in the loss.

In [None]:
# Training loop with gradient clipping
for epoch in range(100):
    total_loss = 0
    for batch in train_loader:
        batch = batch.to(device)
        optimizer.zero_grad()
        output = model(batch)
        loss = criterion(output, batch.y)
        loss.backward()

        # Clip gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1}, Loss: {total_loss:.4f}")


The loss values seem to be fluctuating quite a bit, and the model does not appear to be consistently reducing the loss or converging to a satisfactory level. Here are some further suggestions to improve training stability and reduce fluctuations:

## 1. Try a Lower Learning Rate
While you've already adjusted the learning rate, consider lowering it even further. A smaller learning rate (e.g., 0.00005) might stabilize the training further.



In [None]:
optimizer = optim.Adam(model.parameters(), lr=0.00005)


## 2. Use Weight Decay (L2 Regularization)
Adding weight decay to the optimizer can help prevent overfitting by penalizing large weights, which may lead to smoother convergence.

In [None]:
optimizer = optim.Adam(model.parameters(), lr=0.00005, weight_decay=1e-5)


## 3. Increase the Batch Size
A larger batch size can help reduce the variance in the gradient estimates, leading to more stable training. Try increasing the batch size to 32 or 64, if the A100 on Google Colab GPU allows it.

In [None]:
train_loader = DataLoader(train_graphs, batch_size=32, shuffle=True)
test_loader = DataLoader(test_graphs, batch_size=32, shuffle=False)


## 4. Use Dropout Regularization
To avoid overfitting, add dropout layers after the graph convolutional layers. Dropout helps in regularizing the network and makes it less likely to memorize the training data.



In [None]:
class GNNModel(nn.Module):
    def __init__(self):
        super(GNNModel, self).__init__()
        self.conv1 = GCNConv(1, 64)
        self.bn1 = BatchNorm(64)
        self.dropout1 = nn.Dropout(p=0.3)
        self.conv2 = GCNConv(64, 128)
        self.bn2 = BatchNorm(128)
        self.dropout2 = nn.Dropout(p=0.3)
        self.fc1 = nn.Linear(128, 1)

    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch
        x = F.relu(self.bn1(self.conv1(x, edge_index)))
        x = self.dropout1(x)
        x = F.relu(self.bn2(self.conv2(x, edge_index)))
        x = self.dropout2(x)
        x = global_mean_pool(x, batch)  # Global mean pooling per graph in the batch
        x = self.fc1(x)
        return x.view(-1)  # Output size should match batch size


## 5. Check Gradient Clipping
If your gradients are occasionally exploding, gradient clipping can help by limiting the maximum value of gradients during training. Keep using it as it's beneficial.

In [None]:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)


## 6. Switch to a Different Optimizer
If Adam is not working well, you can try RMSprop or SGD with a momentum term. These optimizers might help achieve more stable training in certain scenarios.

In [None]:
# Switch to RMSprop
optimizer = optim.RMSprop(model.parameters(), lr=0.00005)


## 7. Implement Early Stopping
If the validation loss starts to increase for several epochs, it indicates overfitting. Implement early stopping to stop training if no improvements are seen after a set number of epochs.

In [None]:
# Pseudo-code for early stopping
best_loss = float('inf')
patience = 10
trigger_times = 0

for epoch in range(100):
    total_loss = 0
    for batch in train_loader:
        batch = batch.to(device)
        optimizer.zero_grad()
        output = model(batch)
        loss = criterion(output, batch.y)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        total_loss += loss.item()

    print(f"Epoch {epoch+1}, Loss: {total_loss:.4f}")

    # Early stopping check
    if total_loss < best_loss:
        best_loss = total_loss
        trigger_times = 0
    else:
        trigger_times += 1
        if trigger_times >= patience:
            print("Early stopping triggered")
            break


### It looks like the loss is decreasing more consistently now, which suggests that the adjustments are working well, and the model is converging more effectively. The early stopping was also appropriately triggered, which prevents overfitting and helps save time.



## Evaluate the Model:

* Now that the model is trained, it's time to evaluate its performance on the test dataset.
* Calculate metrics such as Mean Squared Error **(MSE)**, Mean Absolute Error **(MAE)**, and **R²** score to understand how well the model generalizes to unseen data.

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

model.eval()
true_values = []
predicted_values = []

with torch.no_grad():
    for batch in test_loader:
        batch = batch.to(device)
        output = model(batch)
        true_values.extend(batch.y.cpu().numpy())
        predicted_values.extend(output.cpu().numpy())

# Calculate metrics
mse = mean_squared_error(true_values, predicted_values)
mae = mean_absolute_error(true_values, predicted_values)
r2 = r2_score(true_values, predicted_values)

print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"R² Score: {r2:.4f}")


The error indicates that the true_values and predicted_values lists have different lengths, which prevents you from calculating the evaluation metrics.

This issue might be because the model's output size does not match the batch size during evaluation, which can happen if there is only one value being returned instead of a full batch of predictions.

Here's how you can fix it:

## Debugging the Mismatch Issue
1. Check Batch Size in the Model Output: The model might be outputting only one value for the entire batch instead of separate predictions for each sample. Update the model to ensure that it outputs the correct number of values.

Make sure that the model outputs a prediction for each graph in the batch:



In [None]:
class GNNModel(nn.Module):
    def __init__(self):
        super(GNNModel, self).__init__()
        self.conv1 = GCNConv(1, 64)
        self.bn1 = BatchNorm(64)
        self.dropout1 = nn.Dropout(p=0.3)
        self.conv2 = GCNConv(64, 128)
        self.bn2 = BatchNorm(128)
        self.dropout2 = nn.Dropout(p=0.3)
        self.fc1 = nn.Linear(128, 1)

    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch
        x = F.relu(self.bn1(self.conv1(x, edge_index)))
        x = self.dropout1(x)
        x = F.relu(self.bn2(self.conv2(x, edge_index)))
        x = self.dropout2(x)
        x = global_mean_pool(x, batch)  # Global mean pooling per graph in the batch
        x = self.fc1(x)
        return x  # Output shape should be [batch_size, 1]


Update Evaluation Code to Flatten Predictions: During evaluation, ensure that the output and true values are reshaped properly to match the expected dimensions.

In [None]:
model.eval()
true_values = []
predicted_values = []

with torch.no_grad():
    for batch in test_loader:
        batch = batch.to(device)
        output = model(batch)
        true_values.extend(batch.y.cpu().numpy().flatten())
        predicted_values.extend(output.cpu().numpy().flatten())

# Calculate metrics
mse = mean_squared_error(true_values, predicted_values)
mae = mean_absolute_error(true_values, predicted_values)
r2 = r2_score(true_values, predicted_values)

print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"R² Score: {r2:.4f}")


The error persists because there is still a mismatch between the number of true values and the predicted values. It seems like the model is outputting one value per batch instead of individual values for each graph in the batch. Here are some updates to ensure everything matches correctly:

## Debugging Steps
1. Ensure Correct Global Pooling: Double-check the pooling operation to ensure it's pooling each graph within the batch separately, rather than the entire batch into a single output.

2. Modify Output to Handle Batched Graphs Properly: Instead of using global_mean_pool incorrectly, which could potentially collapse multiple graphs into a single output, make sure that the model outputs predictions for each graph in the batch.

## Model and Evaluation Adjustments:
Adjusted Model Code

In [None]:
from torch_geometric.nn import global_mean_pool

class GNNModel(nn.Module):
    def __init__(self):
        super(GNNModel, self).__init__()
        self.conv1 = GCNConv(1, 64)
        self.bn1 = BatchNorm(64)
        self.dropout1 = nn.Dropout(p=0.3)
        self.conv2 = GCNConv(64, 128)
        self.bn2 = BatchNorm(128)
        self.dropout2 = nn.Dropout(p=0.3)
        self.fc1 = nn.Linear(128, 1)

    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch
        x = F.relu(self.bn1(self.conv1(x, edge_index)))
        x = self.dropout1(x)
        x = F.relu(self.bn2(self.conv2(x, edge_index)))
        x = self.dropout2(x)
        x = global_mean_pool(x, batch)  # Pool each graph in the batch individually
        x = self.fc1(x)
        return x.view(-1)  # Ensure output has shape [batch_size]


## Updated Evaluation Code
We need to ensure that both true_values and predicted_values are aligned properly. Make sure the model predicts an output for each graph, and that you collect the true and predicted values in a way that matches.

In [None]:
model.eval()
true_values = []
predicted_values = []

with torch.no_grad():
    for batch in test_loader:
        batch = batch.to(device)
        output = model(batch)  # Output should have shape [batch_size]
        true_values.extend(batch.y.cpu().numpy().flatten())  # True values for all graphs in batch
        predicted_values.extend(output.cpu().numpy().flatten())  # Predicted values for all graphs in batch

# Ensure consistent lengths of true and predicted values
assert len(true_values) == len(predicted_values), "Mismatch between true and predicted values length."

# Calculate metrics
mse = mean_squared_error(true_values, predicted_values)
mae = mean_absolute_error(true_values, predicted_values)
r2 = r2_score(true_values, predicted_values)

print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"R² Score: {r2:.4f}")


The `AssertionError` indicates that there is still a mismatch in the number of true values and predicted values. This might be due to:

Empty or Missing Predictions: Some batches might not have valid predictions, which could be caused by failed conversions or filtering during data preparation.
Batches with Uneven Sizes: There may be an issue where some batches are smaller or dropped during training, leading to inconsistencies.
## Steps to Debug and Fix
1. Add Debugging to Check Batch Sizes
Add a print statement in the evaluation loop to verify the size of `batch.y` and `output`.

In [None]:
model.eval()
true_values = []
predicted_values = []

with torch.no_grad():
    for batch in test_loader:
        batch = batch.to(device)
        output = model(batch)

        # Debugging: Print sizes of batch.y and output
        print(f"Batch size (true values): {batch.y.shape[0]}, Output size: {output.shape[0]}")

        # Ensure both batch.y and output are valid before appending
        if batch.y.shape[0] == output.shape[0]:
            true_values.extend(batch.y.cpu().numpy().flatten())
            predicted_values.extend(output.cpu().numpy().flatten())
        else:
            print("Skipping batch due to size mismatch.")

# Calculate metrics if true_values and predicted_values have consistent lengths
if len(true_values) == len(predicted_values):
    mse = mean_squared_error(true_values, predicted_values)
    mae = mean_absolute_error(true_values, predicted_values)
    r2 = r2_score(true_values, predicted_values)

    print(f"Mean Squared Error (MSE): {mse:.4f}")
    print(f"Mean Absolute Error (MAE): {mae:.4f}")
    print(f"R² Score: {r2:.4f}")
else:
    print("Final true_values and predicted_values lengths do not match.")


The issue we're encountering indicates that all of the batches are being skipped due to a size mismatch, which results in no valid data being available for evaluation (`true_values` and `predicted_values` are both empty).

## Root Cause:
The model is outputting a single value instead of a value for each graph in the batch. This typically happens when the pooling operation (e.g., `global_mean_pool`) is not applied correctly, causing the entire batch to be reduced to a single output rather than producing one output per graph.

## Solution:
1. Ensure Correct Batch-wise Output:
* The pooling operation (global_mean_pool) should be applied to each graph in the batch individually, and the final linear layer should output a prediction for each graph.
2. Revisit the Model's Forward Pass:
* Ensure that the model outputs a tensor of shape [batch_size, 1] rather than [1]. The issue is likely due to incorrectly pooling over the entire batch, rather than each graph.
## Model Update:
Here's an updated version of the model to ensure that each graph gets an individual output:

In [None]:
from torch_geometric.nn import global_mean_pool

class GNNModel(nn.Module):
    def __init__(self):
        super(GNNModel, self).__init__()
        self.conv1 = GCNConv(1, 64)
        self.bn1 = BatchNorm(64)
        self.dropout1 = nn.Dropout(p=0.3)
        self.conv2 = GCNConv(64, 128)
        self.bn2 = BatchNorm(128)
        self.dropout2 = nn.Dropout(p=0.3)
        self.fc1 = nn.Linear(128, 1)

    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch
        x = F.relu(self.bn1(self.conv1(x, edge_index)))
        x = self.dropout1(x)
        x = F.relu(self.bn2(self.conv2(x, edge_index)))
        x = self.dropout2(x)
        x = global_mean_pool(x, batch)  # Apply global mean pooling for each graph in the batch
        x = self.fc1(x)
        return x.view(-1)  # Make sure the output has the correct shape [batch_size]


## Key Changes:
* Global Mean Pooling: Make sure `global_mean_pool` is used correctly to create a feature vector for each graph in the batch, not the entire batch.
* Return Output with Correct Shape: Use `.view(-1)` to ensure that the output tensor has a shape of `[batch_size]`.

## Re-run the Evaluation:
After updating the model, try the evaluation code again:

In [None]:
model.eval()
true_values = []
predicted_values = []

with torch.no_grad():
    for batch in test_loader:
        batch = batch.to(device)
        output = model(batch)

        # Ensure both batch.y and output are valid before appending
        if batch.y.shape[0] == output.shape[0]:
            true_values.extend(batch.y.cpu().numpy().flatten())
            predicted_values.extend(output.cpu().numpy().flatten())
        else:
            print("Skipping batch due to size mismatch.")

# Calculate metrics if true_values and predicted_values have consistent lengths
if len(true_values) > 0 and len(true_values) == len(predicted_values):
    mse = mean_squared_error(true_values, predicted_values)
    mae = mean_absolute_error(true_values, predicted_values)
    r2 = r2_score(true_values, predicted_values)

    print(f"Mean Squared Error (MSE): {mse:.4f}")
    print(f"Mean Absolute Error (MAE): {mae:.4f}")
    print(f"R² Score: {r2:.4f}")
else:
    print("Final true_values and predicted_values lengths do not match or are empty.")


Still getting "Skipping batch due to size mismatch," it indicates that all batches are producing a size mismatch between `true_values` and `predicted_values`. This suggests that the model's output size is not matching what is expected, potentially due to improper handling of the batch during forward propagation or a misunderstanding in the shape of the output after pooling.

To address this, let’s take a step-by-step approach to debug and fix it:

## Step 1: Check Data Batching and Output Sizes
* Verify the Size of Batch Components: Print the sizes of the input features and labels (`data.x`, `data.y`, and `data.batch`) at different stages of the forward pass.
* Ensure Proper Pooling: The pooling function (`global_mean_pool`) should produce a feature vector for each graph in the batch.
## Step 2: Print Intermediate Tensor Sizes in the Model
Add print statements inside the model to understand the shapes of the intermediate tensors:



In [None]:
class GNNModel(nn.Module):
    def __init__(self):
        super(GNNModel, self).__init__()
        self.conv1 = GCNConv(1, 64)
        self.bn1 = BatchNorm(64)
        self.dropout1 = nn.Dropout(p=0.3)
        self.conv2 = GCNConv(64, 128)
        self.bn2 = BatchNorm(128)
        self.dropout2 = nn.Dropout(p=0.3)
        self.fc1 = nn.Linear(128, 1)

    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch

        print(f"Input x shape: {x.shape}")  # Print input feature size
        x = F.relu(self.bn1(self.conv1(x, edge_index)))
        x = self.dropout1(x)

        print(f"Shape after conv1 and dropout: {x.shape}")
        x = F.relu(self.bn2(self.conv2(x, edge_index)))
        x = self.dropout2(x)

        print(f"Shape after conv2 and dropout: {x.shape}")
        x = global_mean_pool(x, batch)  # Global mean pooling per graph in the batch

        print(f"Shape after global mean pool: {x.shape}")
        x = self.fc1(x)

        print(f"Output shape before reshape: {x.shape}")
        return x.view(-1)  # Make sure the output has the correct shape [batch_size]


## Step 3: Debug the Data Loader and Model Output
When running the evaluation loop, add similar print statements to see if the output matches the expected shape:

In [None]:
model.eval()
true_values = []
predicted_values = []

with torch.no_grad():
    for batch in test_loader:
        batch = batch.to(device)
        output = model(batch)

        # Print the shapes of batch.y and output
        print(f"True values shape (batch.y): {batch.y.shape}")
        print(f"Predicted values shape (output): {output.shape}")

        # Ensure both batch.y and output are valid before appending
        if batch.y.shape[0] == output.shape[0]:
            true_values.extend(batch.y.cpu().numpy().flatten())
            predicted_values.extend(output.cpu().numpy().flatten())
        else:
            print("Skipping batch due to size mismatch.")

# Calculate metrics if true_values and predicted_values have consistent lengths
if len(true_values) > 0 and len(true_values) == len(predicted_values):
    mse = mean_squared_error(true_values, predicted_values)
    mae = mean_absolute_error(true_values, predicted_values)
    r2 = r2_score(true_values, predicted_values)

    print(f"Mean Squared Error (MSE): {mse:.4f}")
    print(f"Mean Absolute Error (MAE): {mae:.4f}")
    print(f"R² Score: {r2:.4f}")
else:
    print("Final true_values and predicted_values lengths do not match or are empty.")


The print statements show that the model is outputting a single value (`torch.Size([1]`)) for the entire batch, rather than a value for each graph in the batch (`torch.Size([19]`)). This means the `global_mean_pool` function is aggregating the entire batch into one output rather than keeping the predictions for individual graphs.

## Fixing the Pooling and Output Issue
We need to ensure that each graph in the batch gets its own output, which means `global_mean_pool` should be applied to each graph separately.

Here’s how we can fix the model and make it output predictions for each graph in the batch:

## Corrected Model
1. Apply Proper Pooling for Batched Graphs: The pooling function should create separate feature vectors for each graph in the batch.
2. Ensure Output Matches Batch Size: Modify the model so that the output dimension matches the number of graphs in each batch.

In [None]:
from torch_geometric.nn import global_mean_pool

class GNNModel(nn.Module):
    def __init__(self):
        super(GNNModel, self).__init__()
        self.conv1 = GCNConv(1, 64)
        self.bn1 = BatchNorm(64)
        self.dropout1 = nn.Dropout(p=0.3)
        self.conv2 = GCNConv(64, 128)
        self.bn2 = BatchNorm(128)
        self.dropout2 = nn.Dropout(p=0.3)
        self.fc1 = nn.Linear(128, 1)  # Outputs a scalar value for each graph

    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch

        # Apply first GCN layer
        x = F.relu(self.bn1(self.conv1(x, edge_index)))
        x = self.dropout1(x)

        # Apply second GCN layer
        x = F.relu(self.bn2(self.conv2(x, edge_index)))
        x = self.dropout2(x)

        # Pool the graph-level embeddings
        x = global_mean_pool(x, batch)  # Pool each graph in the batch

        # Final fully connected layer to output the predicted value for each graph
        x = self.fc1(x)

        return x.view(-1)  # Ensure output shape is [batch_size]


In [None]:
model.eval()
true_values = []
predicted_values = []

with torch.no_grad():
    for batch in test_loader:
        batch = batch.to(device)
        output = model(batch)

        # Debugging: Print sizes of batch.y and output
        print(f"True values shape (batch.y): {batch.y.shape}")
        print(f"Predicted values shape (output): {output.shape}")

        # Ensure both batch.y and output are valid before appending
        if batch.y.shape[0] == output.shape[0]:
            true_values.extend(batch.y.cpu().numpy().flatten())
            predicted_values.extend(output.cpu().numpy().flatten())
        else:
            print("Skipping batch due to size mismatch.")

# Calculate metrics if true_values and predicted_values have consistent lengths
if len(true_values) > 0 and len(true_values) == len(predicted_values):
    mse = mean_squared_error(true_values, predicted_values)
    mae = mean_absolute_error(true_values, predicted_values)
    r2 = r2_score(true_values, predicted_values)

    print(f"Mean Squared Error (MSE): {mse:.4f}")
    print(f"Mean Absolute Error (MAE): {mae:.4f}")
    print(f"R² Score: {r2:.4f}")
else:
    print("Final true_values and predicted_values lengths do not match or are empty.")


The issue persists, meaning that the model is still outputting a single value instead of multiple values for each graph in the batch. This suggests that the pooling or linear layer is incorrectly aggregating the entire batch. Let’s address this step-by-step:

## Step-by-Step Debugging and Fixing:
Understand the Data Flow:

The input to the model (data) represents multiple graphs.
After applying convolution layers, the pooling operation (global_mean_pool) should aggregate node features per graph in the batch to create a feature vector for each graph.
Batch Processing and Proper Pooling:

global_mean_pool(x, batch) should aggregate nodes that belong to the same graph into a single feature vector for each graph, resulting in a tensor of size [batch_size, hidden_dim].


In [None]:
from torch_geometric.nn import global_mean_pool

class GNNModel(nn.Module):
    def __init__(self):
        super(GNNModel, self).__init__()
        self.conv1 = GCNConv(1, 64)
        self.bn1 = BatchNorm(64)
        self.dropout1 = nn.Dropout(p=0.3)
        self.conv2 = GCNConv(64, 128)
        self.bn2 = BatchNorm(128)
        self.dropout2 = nn.Dropout(p=0.3)
        self.fc1 = nn.Linear(128, 1)  # Outputs a scalar value for each graph

    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch

        # Apply first GCN layer
        x = F.relu(self.bn1(self.conv1(x, edge_index)))
        x = self.dropout1(x)

        print(f"Shape after first GCN layer: {x.shape}")  # Expected: [num_nodes, 64]

        # Apply second GCN layer
        x = F.relu(self.bn2(self.conv2(x, edge_index)))
        x = self.dropout2(x)

        print(f"Shape after second GCN layer: {x.shape}")  # Expected: [num_nodes, 128]

        # Apply global mean pooling
        x = global_mean_pool(x, batch)  # Pool node features per graph

        print(f"Shape after global mean pooling: {x.shape}")  # Expected: [batch_size, 128]

        # Apply fully connected layer to produce final output for each graph
        x = self.fc1(x)

        print(f"Output shape after fc1: {x.shape}")  # Expected: [batch_size, 1]

        return x.view(-1)  # Ensure output shape is [batch_size]



In [None]:
model.eval()
true_values = []
predicted_values = []

with torch.no_grad():
    for batch in test_loader:
        batch = batch.to(device)
        output = model(batch)

        # Debugging: Print sizes of batch.y and output
        print(f"True values shape (batch.y): {batch.y.shape}")
        print(f"Predicted values shape (output): {output.shape}")

        # Ensure both batch.y and output are valid before appending
        if batch.y.shape[0] == output.shape[0]:
            true_values.extend(batch.y.cpu().numpy().flatten())
            predicted_values.extend(output.cpu().numpy().flatten())
        else:
            print("Skipping batch due to size mismatch.")

# Calculate metrics if true_values and predicted_values have consistent lengths
if len(true_values) > 0 and len(true_values) == len(predicted_values):
    mse = mean_squared_error(true_values, predicted_values)
    mae = mean_absolute_error(true_values, predicted_values)
    r2 = r2_score(true_values, predicted_values)

    print(f"Mean Squared Error (MSE): {mse:.4f}")
    print(f"Mean Absolute Error (MAE): {mae:.4f}")
    print(f"R² Score: {r2:.4f}")
else:
    print("Final true_values and predicted_values lengths do not match or are empty.")


In [None]:
from torch_geometric.nn import global_mean_pool

class GNNModel(nn.Module):
    def __init__(self):
        super(GNNModel, self).__init__()
        self.conv1 = GCNConv(1, 64)
        self.bn1 = BatchNorm(64)
        self.dropout1 = nn.Dropout(p=0.3)
        self.conv2 = GCNConv(64, 128)
        self.bn2 = BatchNorm(128)
        self.dropout2 = nn.Dropout(p=0.3)
        self.fc1 = nn.Linear(128, 1)  # Outputs a scalar value for each graph

    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch

        # Apply first GCN layer
        x = F.relu(self.bn1(self.conv1(x, edge_index)))
        x = self.dropout1(x)

        print(f"Shape after first GCN layer: {x.shape}")  # Expected: [num_nodes, 64]

        # Apply second GCN layer
        x = F.relu(self.bn2(self.conv2(x, edge_index)))
        x = self.dropout2(x)

        print(f"Shape after second GCN layer: {x.shape}")  # Expected: [num_nodes, 128]

        # Apply global mean pooling
        x = global_mean_pool(x, batch)  # Pool node features per graph

        print(f"Shape after global mean pooling: {x.shape}")  # Expected: [batch_size, 128]

        # Apply fully connected layer to produce final output for each graph
        x = self.fc1(x)

        print(f"Output shape after fc1: {x.shape}")  # Expected: [batch_size, 1]

        return x  # Ensure output shape is [batch_size, 1]



In [None]:
model.eval()
true_values = []
predicted_values = []

with torch.no_grad():
    for batch in test_loader:
        batch = batch.to(device)
        output = model(batch)  # Output shape should be [batch_size, 1]

        # Debugging: Print sizes of batch.y and output
        print(f"True values shape (batch.y): {batch.y.shape}")  # Expected: [batch_size]
        print(f"Predicted values shape (output): {output.shape}")  # Expected: [batch_size, 1]

        # Ensure both batch.y and output are valid before appending
        if batch.y.shape[0] == output.shape[0]:
            true_values.extend(batch.y.cpu().numpy().flatten())
            predicted_values.extend(output.cpu().numpy().flatten())
        else:
            print("Skipping batch due to size mismatch.")

# Calculate metrics if true_values and predicted_values have consistent lengths
if len(true_values) > 0 and len(true_values) == len(predicted_values):
    mse = mean_squared_error(true_values, predicted_values)
    mae = mean_absolute_error(true_values, predicted_values)
    r2 = r2_score(true_values, predicted_values)

    print(f"Mean Squared Error (MSE): {mse:.4f}")
    print(f"Mean Absolute Error (MAE): {mae:.4f}")
    print(f"R² Score: {r2:.4f}")
else:
    print("Final true_values and predicted_values lengths do not match or are empty.")


In [None]:
from torch_geometric.nn import global_mean_pool

class GNNModel(nn.Module):
    def __init__(self):
        super(GNNModel, self).__init__()
        self.conv1 = GCNConv(1, 64)
        self.bn1 = BatchNorm(64)
        self.dropout1 = nn.Dropout(p=0.3)
        self.conv2 = GCNConv(64, 128)
        self.bn2 = BatchNorm(128)
        self.dropout2 = nn.Dropout(p=0.3)
        self.fc1 = nn.Linear(128, 1)  # Outputs a scalar value for each graph

    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch

        # Apply first GCN layer
        x = F.relu(self.bn1(self.conv1(x, edge_index)))
        x = self.dropout1(x)

        print(f"Shape after first GCN layer: {x.shape}")  # Expected: [num_nodes, 64]

        # Apply second GCN layer
        x = F.relu(self.bn2(self.conv2(x, edge_index)))
        x = self.dropout2(x)

        print(f"Shape after second GCN layer: {x.shape}")  # Expected: [num_nodes, 128]

        # Apply global mean pooling
        x = global_mean_pool(x, batch)  # Pool node features per graph

        print(f"Shape after global mean pooling: {x.shape}")  # Expected: [batch_size, 128]

        # Apply fully connected layer to produce final output for each graph
        x = self.fc1(x)

        print(f"Output shape after fc1: {x.shape}")  # Expected: [batch_size, 1]

        return x.squeeze(-1)  # Ensure output shape is [batch_size]



In [None]:
model.eval()
true_values = []
predicted_values = []

with torch.no_grad():
    for batch in test_loader:
        batch = batch.to(device)
        output = model(batch)  # Output shape should be [batch_size]

        # Debugging: Print sizes of batch.y and output
        print(f"True values shape (batch.y): {batch.y.shape}")  # Expected: [batch_size]
        print(f"Predicted values shape (output): {output.shape}")  # Expected: [batch_size]

        # Ensure both batch.y and output are valid before appending
        if batch.y.shape[0] == output.shape[0]:
            true_values.extend(batch.y.cpu().numpy().flatten())
            predicted_values.extend(output.cpu().numpy().flatten())
        else:
            print("Skipping batch due to size mismatch.")

# Calculate metrics if true_values and predicted_values have consistent lengths
if len(true_values) > 0 and len(true_values) == len(predicted_values):
    mse = mean_squared_error(true_values, predicted_values)
    mae = mean_absolute_error(true_values, predicted_values)
    r2 = r2_score(true_values, predicted_values)

    print(f"Mean Squared Error (MSE): {mse:.4f}")
    print(f"Mean Absolute Error (MAE): {mae:.4f}")
    print(f"R² Score: {r2:.4f}")
else:
    print("Final true_values and predicted_values lengths do not match or are empty.")


In [None]:
from torch_geometric.nn import global_mean_pool

class GNNModel(nn.Module):
    def __init__(self):
        super(GNNModel, self).__init__()
        self.conv1 = GCNConv(1, 64)
        self.bn1 = BatchNorm(64)
        self.dropout1 = nn.Dropout(p=0.3)
        self.conv2 = GCNConv(64, 128)
        self.bn2 = BatchNorm(128)
        self.dropout2 = nn.Dropout(p=0.3)
        self.fc1 = nn.Linear(128, 1)  # Outputs a scalar value for each graph

    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch

        # Debug: Print batch information
        print(f"Batch tensor content: {batch}")  # Expect values from 0 to batch_size - 1

        # Apply first GCN layer
        x = F.relu(self.bn1(self.conv1(x, edge_index)))
        x = self.dropout1(x)

        print(f"Shape after first GCN layer: {x.shape}")  # Expected: [num_nodes, 64]

        # Apply second GCN layer
        x = F.relu(self.bn2(self.conv2(x, edge_index)))
        x = self.dropout2(x)

        print(f"Shape after second GCN layer: {x.shape}")  # Expected: [num_nodes, 128]

        # Apply global mean pooling
        x = global_mean_pool(x, batch)  # Pool node features per graph

        print(f"Shape after global mean pooling: {x.shape}")  # Expected: [batch_size, 128]

        # Apply fully connected layer to produce final output for each graph
        x = self.fc1(x)

        print(f"Output shape after fc1: {x.shape}")  # Expected: [batch_size, 1]

        return x.squeeze(-1)  # Ensure output shape is [batch_size]



In [None]:
model.eval()
true_values = []
predicted_values = []

with torch.no_grad():
    for batch in test_loader:
        batch = batch.to(device)
        output = model(batch)  # Output shape should be [batch_size]

        # Debugging: Print sizes of batch.y and output
        print(f"True values shape (batch.y): {batch.y.shape}")  # Expected: [batch_size]
        print(f"Predicted values shape (output): {output.shape}")  # Expected: [batch_size]

        # Ensure both batch.y and output are valid before appending
        if batch.y.shape[0] == output.shape[0]:
            true_values.extend(batch.y.cpu().numpy().flatten())
            predicted_values.extend(output.cpu().numpy().flatten())
        else:
            print("Skipping batch due to size mismatch.")

# Calculate metrics if true_values and predicted_values have consistent lengths
if len(true_values) > 0 and len(true_values) == len(predicted_values):
    mse = mean_squared_error(true_values, predicted_values)
    mae = mean_absolute_error(true_values, predicted_values)
    r2 = r2_score(true_values, predicted_values)

    print(f"Mean Squared Error (MSE): {mse:.4f}")
    print(f"Mean Absolute Error (MAE): {mae:.4f}")
    print(f"R² Score: {r2:.4f}")
else:
    print("Final true_values and predicted_values lengths do not match or are empty.")


In [None]:
print(f"Batch tensor unique values: {batch.unique()}")  # Should contain multiple unique graph IDs


In [None]:
print(f"Batch tensor unique values: {batch.unique()}")  # Should contain multiple unique graph IDs


In [None]:
print(f"Batch tensor unique values: {batch.unique().tolist()}")  # Should contain multiple unique graph IDs


In [None]:
# Instead of using batch.unique(), let's print the raw batch tensor
print(f"Batch tensor: {batch}")  # Expect values from 0 to batch_size - 1

# Inspect the type and details
print(f"Batch type: {type(batch)}")  # Should be a torch.Tensor
print(f"Batch shape: {batch.shape}")  # Should provide the number of nodes, each assigned to a graph index


In [None]:
# Print batch details
print(f"Batch tensor: {batch}")  # Print general batch details

# Access and print specific attributes of the batch
print(f"Batch attribute values: {batch.batch}")  # This gives node-level mapping to graphs
print(f"Batch type: {type(batch.batch)}")  # Type of the batch attribute (should be torch.Tensor)
print(f"Batch shape: {batch.batch.shape}")  # Should give the number of nodes

# Extract unique graph IDs in the batch
unique_values = torch.unique(batch.batch)
print(f"Unique graph IDs in batch tensor: {unique_values.tolist()}")  # List of unique graph IDs


In [None]:
class GNNModel(nn.Module):
    def __init__(self):
        super(GNNModel, self).__init__()
        self.conv1 = GCNConv(1, 64)
        self.bn1 = BatchNorm(64)
        self.dropout1 = nn.Dropout(p=0.3)
        self.conv2 = GCNConv(64, 128)
        self.bn2 = BatchNorm(128)
        self.dropout2 = nn.Dropout(p=0.3)
        self.fc1 = nn.Linear(128, 1)  # Outputs a scalar value for each graph

    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch

        # Apply first GCN layer
        x = F.relu(self.bn1(self.conv1(x, edge_index)))
        x = self.dropout1(x)

        # Apply second GCN layer
        x = F.relu(self.bn2(self.conv2(x, edge_index)))
        x = self.dropout2(x)

        # Apply global mean pooling
        x = global_mean_pool(x, batch)  # Shape after pooling: [batch_size, 128]

        # Apply fully connected layer to produce final output for each graph
        x = self.fc1(x)  # Shape after fc1: [batch_size, 1]

        # Return output with shape [batch_size] by squeezing the last dimension
        return x.squeeze(-1)  # Ensure output shape is [batch_size]



In [None]:
model.train()
for epoch in range(100):
    total_loss = 0
    for batch in train_loader:
        batch = batch.to(device)

        optimizer.zero_grad()
        output = model(batch)  # Output shape: [batch_size]

        # Debugging: Print shapes of true values and output
        print(f"True values shape (batch.y): {batch.y.shape}")  # Expected: [batch_size]
        print(f"Predicted values shape (output): {output.shape}")  # Expected: [batch_size]

        # Ensure shapes match before computing loss
        if output.shape == batch.y.shape:
            loss = criterion(output, batch.y)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
        else:
            print("Skipping batch due to size mismatch.")

    print(f"Epoch {epoch + 1}, Loss: {total_loss:.4f}")



In [None]:
class GNNModel(nn.Module):
    def __init__(self):
        super(GNNModel, self).__init__()
        self.conv1 = GCNConv(1, 64)
        self.bn1 = BatchNorm(64)
        self.dropout1 = nn.Dropout(p=0.3)
        self.conv2 = GCNConv(64, 128)
        self.bn2 = BatchNorm(128)
        self.dropout2 = nn.Dropout(p=0.3)
        self.fc1 = nn.Linear(128, 1)  # Corrected output to ensure one value per graph

    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch

        # Apply first GCN layer
        x = F.relu(self.bn1(self.conv1(x, edge_index)))
        x = self.dropout1(x)

        # Apply second GCN layer
        x = F.relu(self.bn2(self.conv2(x, edge_index)))
        x = self.dropout2(x)

        # Apply global mean pooling to get a single representation for each graph
        x = global_mean_pool(x, batch)  # Shape should be [batch_size, 128]

        # Apply fully connected layer to get a scalar value for each graph
        x = self.fc1(x)  # Shape should be [batch_size, 1]

        # Squeeze the last dimension to match target shape
        return x.squeeze(-1)  # Final shape should be [batch_size]


In [None]:
model.train()
for epoch in range(100):
    total_loss = 0
    for batch in train_loader:
        batch = batch.to(device)

        optimizer.zero_grad()
        output = model(batch)  # Output shape should be [batch_size]

        # Check if the output matches the target shape
        if output.shape == batch.y.shape:
            loss = criterion(output, batch.y)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
        else:
            print(f"Skipping batch due to size mismatch. Output shape: {output.shape}, True values shape: {batch.y.shape}")

    print(f"Epoch {epoch + 1}, Loss: {total_loss:.4f}")


In [None]:
class GNNModel(nn.Module):
    def __init__(self):
        super(GNNModel, self).__init__()
        self.conv1 = GCNConv(1, 64)
        self.bn1 = BatchNorm(64)
        self.dropout1 = nn.Dropout(p=0.3)
        self.conv2 = GCNConv(64, 128)
        self.bn2 = BatchNorm(128)
        self.dropout2 = nn.Dropout(p=0.3)
        self.fc1 = nn.Linear(128, 1)  # Corrected output size to ensure one value per graph

    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch

        # Apply GCN layers
        x = F.relu(self.bn1(self.conv1(x, edge_index)))
        x = self.dropout1(x)
        x = F.relu(self.bn2(self.conv2(x, edge_index)))
        x = self.dropout2(x)

        # Apply global mean pooling to get a single representation for each graph
        x = global_mean_pool(x, batch)  # Shape should be [batch_size, 128]

        # Apply the fully connected layer to get a scalar value for each graph
        x = self.fc1(x)  # Shape should be [batch_size, 1]

        # Squeeze the last dimension to match target shape
        return x.squeeze(-1)  # Final shape should be [batch_size]


In [None]:
model.train()
for epoch in range(100):
    total_loss = 0
    for batch in train_loader:
        batch = batch.to(device)

        optimizer.zero_grad()
        output = model(batch)  # Output shape should be [batch_size]

        # Check if the output matches the target shape
        if output.shape == batch.y.shape:
            loss = criterion(output, batch.y)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
        else:
            print(f"Skipping batch due to size mismatch. Output shape: {output.shape}, True values shape: {batch.y.shape}")

    print(f"Epoch {epoch + 1}, Loss: {total_loss:.4f}")


In [None]:
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import global_mean_pool, GCNConv

class GNNModel(nn.Module):
    def __init__(self):
        super(GNNModel, self).__init__()
        self.conv1 = GCNConv(1, 64)
        self.bn1 = nn.BatchNorm1d(64)
        self.conv2 = GCNConv(64, 128)
        self.bn2 = nn.BatchNorm1d(128)
        self.fc1 = nn.Linear(128, 1)

    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch

        # Apply GCN layers
        x = F.relu(self.bn1(self.conv1(x, edge_index)))
        x = F.relu(self.bn2(self.conv2(x, edge_index)))

        # Global mean pooling to get graph-level representation
        x = global_mean_pool(x, batch)  # Shape: [num_graphs, 128]

        # Fully connected layer to predict value for each graph
        x = self.fc1(x)  # Shape: [num_graphs, 1]

        return x.squeeze()  # Final shape: [num_graphs]


In [None]:
model.train()
for epoch in range(100):
    total_loss = 0
    for batch in train_loader:
        batch = batch.to(device)

        optimizer.zero_grad()
        output = model(batch)  # Output shape should be [batch_size]

        # Check if the output matches the target shape
        if output.shape == batch.y.shape:
            loss = criterion(output, batch.y)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
        else:
            print(f"Skipping batch due to size mismatch. Output shape: {output.shape}, True values shape: {batch.y.shape}")

    print(f"Epoch {epoch + 1}, Loss: {total_loss:.4f}")


In [None]:
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import global_mean_pool, GCNConv

class GNNModel(nn.Module):
    def __init__(self):
        super(GNNModel, self).__init__()
        self.conv1 = GCNConv(1, 64)
        self.bn1 = nn.BatchNorm1d(64)
        self.conv2 = GCNConv(64, 128)
        self.bn2 = nn.BatchNorm1d(128)
        self.fc1 = nn.Linear(128, 1)

    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch

        # Apply GCN layers
        x = F.relu(self.bn1(self.conv1(x, edge_index)))
        x = F.relu(self.bn2(self.conv2(x, edge_index)))

        # Global mean pooling to get graph-level representation
        x = global_mean_pool(x, batch)  # Shape: [num_graphs, 128]

        # Fully connected layer to predict value for each graph
        x = self.fc1(x)  # Shape: [num_graphs, 1]

        return x.view(-1)  # Shape: [num_graphs]


In [None]:
model.train()
for epoch in range(100):
    total_loss = 0
    for batch in train_loader:
        batch = batch.to(device)

        optimizer.zero_grad()
        output = model(batch)  # Output shape should be [batch_size]

        # Make sure the output matches the target shape
        if output.shape == batch.y.shape:
            loss = criterion(output, batch.y)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
        else:
            print(f"Skipping batch due to size mismatch. Output shape: {output.shape}, True values shape: {batch.y.shape}")

    print(f"Epoch {epoch + 1}, Loss: {total_loss:.4f}")


In [None]:
# Example to ensure the output matches the batch size:
def forward(self, x):
    # Assuming x is of shape [batch_size, input_features]
    # The output should be of shape [batch_size, output_features]
    x = self.layer(x)  # Ensure this layer produces output for each sample
    return x


In [None]:
# Inside the training loop
for data in train_loader:
    # Assuming data is a tuple of (inputs, labels)
    inputs, labels = data
    print("Input shape:", inputs.shape)
    print("Label shape:", labels.shape)
    outputs = model(inputs)
    print("Output shape:", outputs.shape)

    if outputs.shape == labels.shape:
        loss = criterion(outputs, labels)
        # Proceed with backpropagation
    else:
        print(f"Skipping batch due to size mismatch: output shape {outputs.shape}, label shape {labels.shape}")


In [None]:
for data in train_loader:
    print("Data:", data)
    break  # Just to see one example


In [None]:
inputs, labels, additional_info = data


In [None]:
inputs = data['inputs']
labels = data['labels']


In [None]:
inputs = data.x  # Node features
labels = data.y  # Labels

print("Input shape:", inputs.shape)
print("Label shape:", labels.shape)



In [None]:
for data in train_loader:
    # Extract inputs and labels from the DataBatch object
    inputs = data.x           # Node features (inputs)
    labels = data.y           # Labels
    edge_index = data.edge_index  # Graph connectivity, if needed

    # Print the shapes to verify
    print("Input shape:", inputs.shape)
    print("Label shape:", labels.shape)
    print("Edge index shape:", edge_index.shape)

    # Your model training code here
    outputs = model(inputs, edge_index)  # You may need to pass edge_index if it's required by your model

    # Assuming you have a loss function
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()


In [None]:
import torch.nn.functional as F
import torch_geometric.nn as pyg_nn

class GNNModel(torch.nn.Module):
    def __init__(self):
        super(GNNModel, self).__init__()
        # Example layers
        self.conv1 = pyg_nn.GCNConv(1, 16)  # Change to your desired configuration
        self.conv2 = pyg_nn.GCNConv(16, 32)
        self.fc = torch.nn.Linear(32, 1)    # Output layer

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = self.conv2(x, edge_index)
        x = F.relu(x)
        x = pyg_nn.global_mean_pool(x, batch=None)  # Aggregate the graph embeddings if needed
        x = self.fc(x)
        return x
