<a href="https://colab.research.google.com/github/WiHi1131/Histopathologic-Cancer-Detection-Report/blob/main/Histopathic_Cancer_Detection_Report.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to the Histopathologic Cancer Detection Challenge
## Background
The Histopathologic Cancer Detection competition, hosted on Kaggle, presented a significant challenge in the field of medical image analysis. The competition began on November 16, 2018, and concluded on March 30, 2019. Its primary objective was to develop algorithms capable of identifying metastatic cancer from small image patches extracted from larger digital pathology scans. This report will detail my own attempt at identifying metastatic cancer from the same images as an exercise designed to showcase my knowledge and use of machine learning techniques.

## Dataset Overview and Key Characteristics
The dataset for this competition is a refined version of the PatchCamelyon (PCam) benchmark dataset, with duplicates removed for enhanced accuracy and challenge. This dataset is significant due to its size, simplicity, and potential for various machine learning research applications.

### Dataset Details:

- Size: The dataset includes a substantial number of images, making it suitable for training robust models.
  - Test Set: Contains 57,500 .tif images, each approximately 27.94 kB in size.
  - Training Set: Comprises 220,000 .tif images, with each image being of the same size as those in the test set.
- Data Structure:
  - Image Files: All images are in .tif format, which is common for high-quality pathology images.
  - Labels:
    - Training Labels: Provided in a train_labels.csv file. A positive label indicates the presence of tumor tissue in the central 32x32 pixel region of an image patch.
    - Sample Submission: A sample submission file in .csv format is included to guide the format of competition submissions.
- Clinical Relevance: The challenge focuses on detecting metastasis in pathology images, a critical task in cancer diagnosis.
- Task Design: The problem is structured as a binary image classification task, akin to other popular datasets like CIFAR-10 and MNIST.
- Feasibility for Training: Despite its large size, the dataset is structured to allow efficient training on standard hardware, including single GPU setups.
- Research Potential: The dataset's structure and challenge make it a valuable resource for exploring key areas in machine learning, including model uncertainty, active learning, and explainability.

## Acknowledgements
The dataset was provided by Bas Veeling, with additional contributions from Babak Ehteshami Bejnordi, Geert Litjens, and Jeroen van der Laak. It is essential to reference specific papers if this dataset is used in scientific publications.

##Evaluation Metric
The primary evaluation metric for this competition is the area under the Receiver Operating Characteristic (ROC) curve. This metric assesses the performance of the prediction models based on their ability to distinguish between the two classes: presence or absence of tumor tissue.

##Submission Format
Participants are required to predict the probability that the central 32x32 pixel region of each image patch contains at least one pixel of tumor tissue. The submission file should follow the format: id, label, where each id corresponds to an image in the test set, and label is the predicted probability.

This competition offers a unique opportunity to apply and enhance machine learning skills in a clinically significant domain, paving the way for advancements in medical diagnostics through AI.

In [10]:
import os
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

In [6]:
from google.colab import files

#kaggle.json
uploaded = files.upload()
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json


In [7]:
!kaggle competitions download -c histopathologic-cancer-detection
import zipfile


Downloading histopathologic-cancer-detection.zip to /content
100% 6.30G/6.31G [01:15<00:00, 183MB/s]
100% 6.31G/6.31G [01:15<00:00, 89.9MB/s]


In [8]:
with zipfile.ZipFile('/content/histopathologic-cancer-detection.zip', 'r') as zip_ref:
    zip_ref.extractall('/content/histopathologic_dataset')