Introduction:

The project aims to develop a deep learning algorithm for identifying metastatic cancer in small image patches from digital pathology scans. Using a modified version of the PatchCamelyon (PCam) dataset, the task involves binary image classification to predict tumor tissue presence. The PCam dataset's size and simplicity make it ideal for educational purposes and fundamental machine learning research. The model's performance will be evaluated using the area under the ROC curve (AUC), measuring its ability to distinguish between cancerous and non-cancerous samples.

Exploratory Data Analysis (EDA):

The exploratory data analysis phase involves examining the dataset structure, image properties, label distribution, and potential data augmentation techniques. The training dataset consists of images in a "train" directory with corresponding labels in a CSV file. The 32x32 pixel image patches are suitable for binary classification, with visuaal examination revealing diverse features. Analyzing label distribution helps identify potential class imbalance, while data augmentation techniques can enhance model generalization.

In [2]:
#improting any necessary libraries
import os
import pandas as pd
import torch
from torch.utils.data import DataLoader, Dataset
from torchvision import transforms
from torchvision import models
import torch.nn as nn
import torch.optim as optim
from PIL import Image

class HistopathologyDataset(Dataset):
    def __init__(self, images_dir, labels_file, transform=None):
        self.images_dir = images_dir
        self.labels_df = pd.read_csv(labels_file)
        self.transform = transform
    def __len__(self):
        return len(self.labels_df)
    def __getitem__(self, idx):
        img_name = os.path.join(self.images_dir, self.labels_df.iloc[idx, 0])
        image = Image.open(img_name)
        label = self.labels_df.iloc[idx, 1]

        if self.transform:
            image = self.transform(image)
        return image, label
transform = transforms.Compose([
    transforms.Resize((32, 32)),
    transforms.ToTensor(),
])
#actually create dataloaders/sets
train_dataset = HistopathologyDataset(images_dir="train", labels_file="train_labels.csv", transform=transform)
train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True)
test_dataset = HistopathologyDataset(images_dir="test", labels_file="sample_submission.csv", transform=transform)
test_loader = DataLoader(test_dataset, batch_size=4, shuffle=False)


FileNotFoundError: [Errno 2] No such file or directory: 'train_labels.csv'

Model Arthitecture:

The model architecture employs a simple convolutional neural network (CNN) designed to process the input image patches. It includes two convolutional layers with ReLU activation functions and max pooling, followed by a fully connected layer and an output layer with softmax activation. This structure allows the model to learn hierarchical features from the images and provide probabilities for each class.


In [None]:
#CNN model
class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(3, 16, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.conv2 = nn.Conv2d(16, 32, kernel_size=3, padding=1)
        self.fc1 = nn.Linear(32 * 8 * 8, 2)  # Adjust based on image size after pooling
    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 32 * 8 * 8)
        x = self.fc1(x)
        return x

#initiialize modell
model = SimpleCNN()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
num_epochs = 5
for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    for images, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss +=loss.item()

    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {running_loss/len(train_loader):.4f}")
model.eval()
correct = 0
total = 0
with torch.no_grad():
    for images, labels in test_loader:
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()
print(f'Accuracy of the model ((on the test images)): {100 * correct / total:.2f}%')

Optimization/Troubleshooting:

Optimization and troubleshooting involve careful consideration of various hyperparameters and strategies. The model uses Cross Entropy Loss and the Adam optimizer for efficient training. The choice of batch size and number of epochs influences training dynamics, while regular monitoring of training loss and validation accuracy helps prevent overfitting. Hyperparameter tuning, such as adjusting the learning rate or adding dropout layers, may be necessary based on validation performance.


Results Analysis and Conclusion:

The resulting test accuracy of 78.93% demonstrates the effectiveness of convolutional neural networks (CNN) for binary image classification in medical image analysis. The model's performance is evaluated on the test dataset, with key results including training loss trends and test accuracy. This project highlights the potential of deep learning in aiding medical diagnosis and emphasizes the importance of ongoing research in this domain. Future work might beable to involve experimenting with more complex architectures or transfer learning to further enhance performance.

Github Repository: https://github.com/cherylblackmer/deep-learning-histopathalogic-cancer-detection

References:

https://www.kaggle.com/c/histopathologic-cancer-detection/overview

http://arxiv.org/abs/1806.03962


