For this example, we will use the "Protein Solubility Dataset" from the UCI Machine Learning Repository, which provides protein sequences labeled as soluble or insoluble. We will encode these sequences into numerical features and use them to train our model.  This lab demonstrates how to use a neural network in PyTorch to predict protein solubility from amino acid composition data. We will use real protein sequences, encode them into numerical features, and train a simple neural network to classify the sequences as soluble or insoluble. This basic framework can be expanded to more complex models and datasets for various biotech applications.

## Lab 1: Basic Biotech Application of Neural Networks using PyTorch

In this Jupyter notebook, we explore a fundamental application of neural networks in biotechnology using PyTorch. This lab demonstrates how neural networks can be used to predict the solubility of proteins based on their amino acid composition.

### Part 1: Understanding the Dataset

The dataset consists of protein sequences represented by their amino acid composition. Each sequence is encoded into numerical features by calculating the frequency of each of the 20 standard amino acids. The labels indicate whether the protein is soluble (1) or insoluble (0).

**Dataset Loading and Encoding:**


In [7]:
import pandas as pd
import torch
from sklearn.preprocessing import LabelEncoder

# Load the dataset (replace 'dataset_url' with the actual URL or file path)
dataset_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00265/CASP.csv"
data = pd.read_csv(dataset_url)

# Display the first few rows to understand the structure
print(data.head())

# Assuming the last column contains the labels and all preceding columns are features
features = data.iloc[:, :-1].values  # All columns except the last one
labels = data.iloc[:, -1].values     # The last column

# Encode the labels (if necessary, here we assume they are already numerical)
# If labels are categorical strings, you might need to use LabelEncoder, as shown below
label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(labels)

# Convert features and labels to PyTorch tensors
features = torch.tensor(features, dtype=torch.float32)
labels = torch.tensor(encoded_labels, dtype=torch.float32).view(-1, 1)

# Display the first few feature vectors and labels for verification
print(features[:5])
print(labels[:5])




     RMSD        F1       F2       F3        F4            F5        F6  \
0  17.284  13558.30  4305.35  0.31754  162.1730  1.872791e+06  215.3590   
1   6.021   6191.96  1623.16  0.26213   53.3894  8.034467e+05   87.2024   
2   9.275   7725.98  1726.28  0.22343   67.2887  1.075648e+06   81.7913   
3  15.851   8424.58  2368.25  0.28111   67.8325  1.210472e+06  109.4390   
4   7.962   7460.84  1736.94  0.23280   52.4123  1.021020e+06   94.5234   

        F7   F8       F9  
0  4287.87  102  27.0302  
1  3328.91   39  38.5468  
2  2981.04   29  38.8119  
3  3248.22   70  39.0651  
4  2814.42   41  39.9147  
tensor([[1.7284e+01, 1.3558e+04, 4.3054e+03, 3.1754e-01, 1.6217e+02, 1.8728e+06,
         2.1536e+02, 4.2879e+03, 1.0200e+02],
        [6.0210e+00, 6.1920e+03, 1.6232e+03, 2.6213e-01, 5.3389e+01, 8.0345e+05,
         8.7202e+01, 3.3289e+03, 3.9000e+01],
        [9.2750e+00, 7.7260e+03, 1.7263e+03, 2.2343e-01, 6.7289e+01, 1.0756e+06,
         8.1791e+01, 2.9810e+03, 2.9000e+01],
      

    Loading and Inspecting the Data:
        The dataset is loaded using pd.read_csv(). The print(data.head()) statement helps us inspect the first few rows to verify the structure.

    Extracting Features and Labels:
        features = data.iloc[:, :-1].values: This selects all columns except the last one, assuming these are the features.
        labels = data.iloc[:, -1].values: This selects the last column as the labels.

    Encoding Labels (if necessary):
        If the labels are categorical strings, LabelEncoder is used to convert them into numerical format. If they are already numerical, this step may be unnecessary.

    Conversion to PyTorch Tensors:
        The features and labels are converted to PyTorch tensors using torch.tensor(), which is necessary for training with PyTorch.

    Verification:
        The first few entries of the features and labels are printed to ensure they have been loaded and processed correctly.


### Part 2: Defining the Neural Network

In [9]:
import torch.nn as nn

class SolubilityPredictor(nn.Module):
    def __init__(self, input_size):
        super(SolubilityPredictor, self).__init__()
        # Define a hidden layer with 50 units
        self.hidden = nn.Linear(input_size, 50)
        # Define an output layer with 1 unit
        self.output = nn.Linear(50, 1)
        # Sigmoid activation function
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        # Apply ReLU activation function after hidden layer
        x = torch.relu(self.hidden(x))
        # Apply the output layer and then sigmoid activation function
        x = self.output(x)
        return self.sigmoid(x)

# Define the input size based on the number of features
input_size = features.shape[1]
model = SolubilityPredictor(input_size)


Imports:

    torch: Main PyTorch library.
    torch.nn: Module containing neural network layers and loss functions.

SolubilityPredictor Class:

    This class inherits from torch.nn.Module, making it a PyTorch neural network model.
    __init__(self, input_size):
        This method initializes the model, setting up the layers.
        self.hidden = nn.Linear(input_size, 50): Creates a fully connected layer with input_size inputs and 50 outputs.
        self.output = nn.Linear(50, 1): Creates a fully connected output layer with 50 inputs and 1 output.
        self.sigmoid = nn.Sigmoid(): A sigmoid activation function to ensure the output is between 0 and 1.

Forward Method:

    forward(self, x): This method defines the forward pass of the network. It determines how data flows through the network.
    x = torch.relu(self.hidden(x)): Applies a ReLU activation function after the hidden layer.
    x = self.output(x): Passes the data through the output layer.
    return self.sigmoid(x): Applies a sigmoid activation function to the output, useful for binary classification tasks where the output needs to be between 0 and 1.

Model Instantiation:

    input_size = features.shape[1]: The number of input features is determined from the dataset.
    model = SolubilityPredictor(input_size): An instance of the SolubilityPredictor model is created with the specified input size.

This corrected approach uses the dataset's numerical features directly for training a neural network, appropriate for tasks such as regression or classification depending on the label's nature.

This neural network model is designed for a binary classification task, predicting protein solubility. The ReLU activation function is used in the hidden layer to introduce non-linearity, and the sigmoid activation function in the output layer ensures that the output is a probability value between 0 and 1, suitable for binary classification.


### Part 3: Training the model

We train the model using the dataset. The loss function used is binary cross-entropy, and we use stochastic gradient descent (SGD) as the optimizer.

In [14]:
import torch
import torch.nn as nn
import torch.optim as optim

# Assuming labels are in the range [0, class_count - 1] (e.g., 0, 1, 2, ...)
# Normalize labels to [0, 1] range if they are not already
if labels.max() > 1:
    labels = (labels == labels.max()).float()

# Check the label range
print(f"Label range: {labels.min()} to {labels.max()}")

# Define loss function and optimizer
criterion = nn.BCELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Training loop
num_epochs = 100
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(features)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()

    if (epoch+1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')


Label range: 0.0 to 1.0
Epoch [10/100], Loss: 0.0022
Epoch [20/100], Loss: 0.0022
Epoch [30/100], Loss: 0.0022
Epoch [40/100], Loss: 0.0022
Epoch [50/100], Loss: 0.0022
Epoch [60/100], Loss: 0.0022
Epoch [70/100], Loss: 0.0022
Epoch [80/100], Loss: 0.0022
Epoch [90/100], Loss: 0.0022
Epoch [100/100], Loss: 0.0022


In [None]:
Part 4: Evaluating the Model

In [15]:
# Evaluate the model
with torch.no_grad():
    predictions = model(features)
    predicted_classes = (predictions > 0.5).float()
    accuracy = (predicted_classes == labels).sum() / len(labels)
    print(f'Accuracy: {accuracy.item():.4f}')


Accuracy: 1.0000


In [13]:
import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Load the dataset (replace 'dataset_url' with the actual URL or file path)
dataset_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00265/CASP.csv"
data = pd.read_csv(dataset_url)

# Display the first few rows to understand the structure
print(data.head())

# Assuming the last column contains the labels and all preceding columns are features
features = data.iloc[:, :-1].values  # All columns except the last one
labels = data.iloc[:, -1].values     # The last column

# Encode the labels if they are not binary (if necessary)
label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(labels)

# Check if labels are binary, convert to float if necessary
if encoded_labels.max() > 1:
    encoded_labels = (encoded_labels == encoded_labels.max()).astype(float)

# Convert features and labels to PyTorch tensors
features = torch.tensor(features, dtype=torch.float32)
labels = torch.tensor(encoded_labels, dtype=torch.float32).view(-1, 1)

# Display the first few feature vectors and labels for verification
print(features[:5])
print(labels[:5])

class SolubilityPredictor(nn.Module):
    def __init__(self, input_size):
        super(SolubilityPredictor, self).__init__()
        self.hidden1 = nn.Linear(input_size, 100)
        self.hidden2 = nn.Linear(100, 50)
        self.output = nn.Linear(50, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = torch.relu(self.hidden1(x))
        x = torch.relu(self.hidden2(x))
        x = self.output(x)
        return self.sigmoid(x)

# Define the input size based on the number of features
input_size = features.shape[1]
model = SolubilityPredictor(input_size)

# Define loss function and optimizer
criterion = nn.BCELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Training loop
num_epochs = 100
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(features)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()

    if (epoch+1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

# Evaluate the model
with torch.no_grad():
    predictions = model(features)
    predicted_classes = (predictions > 0.5).float()
    accuracy = (predicted_classes == labels).sum() / len(labels)
    print(f'Accuracy: {accuracy.item():.4f}')


     RMSD        F1       F2       F3        F4            F5        F6  \
0  17.284  13558.30  4305.35  0.31754  162.1730  1.872791e+06  215.3590   
1   6.021   6191.96  1623.16  0.26213   53.3894  8.034467e+05   87.2024   
2   9.275   7725.98  1726.28  0.22343   67.2887  1.075648e+06   81.7913   
3  15.851   8424.58  2368.25  0.28111   67.8325  1.210472e+06  109.4390   
4   7.962   7460.84  1736.94  0.23280   52.4123  1.021020e+06   94.5234   

        F7   F8       F9  
0  4287.87  102  27.0302  
1  3328.91   39  38.5468  
2  2981.04   29  38.8119  
3  3248.22   70  39.0651  
4  2814.42   41  39.9147  
tensor([[1.7284e+01, 1.3558e+04, 4.3054e+03, 3.1754e-01, 1.6217e+02, 1.8728e+06,
         2.1536e+02, 4.2879e+03, 1.0200e+02],
        [6.0210e+00, 6.1920e+03, 1.6232e+03, 2.6213e-01, 5.3389e+01, 8.0345e+05,
         8.7202e+01, 3.3289e+03, 3.9000e+01],
        [9.2750e+00, 7.7260e+03, 1.7263e+03, 2.2343e-01, 6.7289e+01, 1.0756e+06,
         8.1791e+01, 2.9810e+03, 2.9000e+01],
      

The provided outputs indicate that the model is training and achieving an accuracy of 100%, with a very low and stable loss throughout the epochs. This result suggests that the model may be overfitting the training data, especially since the dataset's labels are all zeros, leading to a lack of diversity in the target values.


Key Observations and Considerations:

    All Zero Labels:
        The labels tensor indicates that all samples have a label of 0. This means the model might only learn to predict the majority class, which is 0, leading to artificially high accuracy.

    Overfitting:
        With a consistent loss and perfect accuracy, the model may not be generalizing well to unseen data. Overfitting occurs when the model learns to predict training data perfectly but fails to generalize to new, unseen data.

    Data Imbalance or Labeling Issue:
        If all labels are zero, there might be a problem with data imbalance or an issue in how labels were processed or loaded. In a real-world scenario, it's important to ensure that labels correctly represent the classes you're trying to predict.

Next Steps:

    Verify Label Distribution:
        Check the label distribution to ensure diversity. If there's an imbalance, consider using techniques like oversampling, undersampling, or adjusting class weights during training.

    Use a Validation Set:
        Implement a validation set to monitor the model's performance on unseen data. This helps identify overfitting and ensures the model generalizes well.

    Introduce Regularization:
        If overfitting is suspected, introduce regularization techniques such as dropout, weight decay, or early stopping.

    Check Data Processing:
        Ensure that the data processing pipeline, especially the labeling process, is correct and that the labels accurately reflect the desired output classes.

Here’s a brief example to check label distribution:

In [16]:
# Check label distribution
unique_labels, counts = torch.unique(labels, return_counts=True)
print(f"Label Distribution: {dict(zip(unique_labels.tolist(), counts.tolist()))}")


Label Distribution: {0.0: 45729, 1.0: 1}


And here's an example of how to split the data into training and validation sets:

In [21]:
import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# Load the dataset (replace 'dataset_url' with the actual URL or file path)
dataset_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00265/CASP.csv"
data = pd.read_csv(dataset_url)

# Display the first few rows to understand the structure
print(data.head())

# Assuming the last column contains the labels and all preceding columns are features
features = data.iloc[:, :-1].values  # All columns except the last one
labels = data.iloc[:, -1].values     # The last column

# Encode the labels if they are not binary (if necessary)
label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(labels)

# Check if labels are binary, convert to float if necessary
if encoded_labels.max() > 1:
    encoded_labels = (encoded_labels == encoded_labels.max()).astype(float)

# Convert features and labels to PyTorch tensors
features = torch.tensor(features, dtype=torch.float32)
labels = torch.tensor(encoded_labels, dtype=torch.float32).view(-1, 1)

# Display the first few feature vectors and labels for verification
print(features[:5])
print(labels[:5])

# Split the data into training and validation sets
features_train, features_val, labels_train, labels_val = train_test_split(features, labels, test_size=0.2, random_state=42)

# Convert to PyTorch tensors properly
features_train = features_train.clone().detach()
labels_train = labels_train.clone().detach().view(-1, 1)
features_val = features_val.clone().detach()
labels_val = labels_val.clone().detach().view(-1, 1)

class SolubilityPredictor(nn.Module):
    def __init__(self, input_size):
        super(SolubilityPredictor, self).__init__()
        self.hidden1 = nn.Linear(input_size, 100)
        self.hidden2 = nn.Linear(100, 50)
        self.output = nn.Linear(50, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = torch.relu(self.hidden1(x))
        x = torch.relu(self.hidden2(x))
        x = self.output(x)
        return self.sigmoid(x)

# Define the input size based on the number of features
input_size = features_train.shape[1]
model = SolubilityPredictor(input_size)

# Define loss function and optimizer
criterion = nn.BCELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Training loop
num_epochs = 100
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(features_train)
    loss = criterion(outputs, labels_train)
    loss.backward()
    optimizer.step()

    if (epoch+1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

# Evaluate the model
with torch.no_grad():
    predictions = model(features_val)
    predicted_classes = (predictions > 0.5).float()
    accuracy = (predicted_classes == labels_val).sum() / len(labels_val)
    print(f'Validation Accuracy: {accuracy.item():.4f}')




     RMSD        F1       F2       F3        F4            F5        F6  \
0  17.284  13558.30  4305.35  0.31754  162.1730  1.872791e+06  215.3590   
1   6.021   6191.96  1623.16  0.26213   53.3894  8.034467e+05   87.2024   
2   9.275   7725.98  1726.28  0.22343   67.2887  1.075648e+06   81.7913   
3  15.851   8424.58  2368.25  0.28111   67.8325  1.210472e+06  109.4390   
4   7.962   7460.84  1736.94  0.23280   52.4123  1.021020e+06   94.5234   

        F7   F8       F9  
0  4287.87  102  27.0302  
1  3328.91   39  38.5468  
2  2981.04   29  38.8119  
3  3248.22   70  39.0651  
4  2814.42   41  39.9147  
tensor([[1.7284e+01, 1.3558e+04, 4.3054e+03, 3.1754e-01, 1.6217e+02, 1.8728e+06,
         2.1536e+02, 4.2879e+03, 1.0200e+02],
        [6.0210e+00, 6.1920e+03, 1.6232e+03, 2.6213e-01, 5.3389e+01, 8.0345e+05,
         8.7202e+01, 3.3289e+03, 3.9000e+01],
        [9.2750e+00, 7.7260e+03, 1.7263e+03, 2.2343e-01, 6.7289e+01, 1.0756e+06,
         8.1791e+01, 2.9810e+03, 2.9000e+01],
      

The results show that the model is achieving a perfect accuracy of 100% on the validation set, and the loss remains constant at a very low value throughout the training process. Additionally, the labels tensor shows all zeros, indicating that the dataset might be highly imbalanced, with most or all of the labels being the same.

    Imbalanced Dataset:
        The dataset appears to be imbalanced, with only one class present (all zeros). This means that the model learns to predict only the majority class, which results in a misleadingly high accuracy.

    Label Distribution Check:
        Ensure that the dataset actually contains more than one class. If all labels are zero, the dataset is not suitable for training a binary classification model, as it lacks diversity in the labels.

    Feature and Label Verification:
        Double-check the source and processing of the labels to ensure that they are being loaded and processed correctly. There may be an issue with how the labels were encoded or extracted from the dataset.

    Potential Data Quality Issue:
        The issue could stem from a problem in the data acquisition or preprocessing pipeline. Verify the data source and ensure the data accurately reflects the intended features and labels.

Corrective Steps:

    Inspect the Original Dataset:
        Before any processing, ensure the original dataset has the correct labels. It may be necessary to manually inspect a subset of the data to confirm its accuracy.

    Re-verify Data Processing Steps:
        Go through each step of the data processing pipeline to confirm that features and labels are extracted correctly and that no unintended transformations are applied.

    Consider Collecting More Data:
        If the dataset truly lacks diversity, consider collecting more data that includes a broader range of classes. This will help in building a model that can generalize well across different classes.

    Use a Different Metric:
        In the case of imbalanced datasets, accuracy may not be a reliable metric. Consider using other metrics like precision, recall, F1-score, or AUC-ROC to better understand the model's performance.