<a href="https://colab.research.google.com/github/finlaycm/tensorflow_tumor_detection/blob/master/part2_given_patches_df_extract_patch_images_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
import os
import pathlib

colab_root_dir = '/content'
drive_dir='/content/drive'
project_root_dir = os.path.join(drive_dir,'My Drive','deeplearning','cancer_classification')


slide_dir=os.path.join(project_root_dir,'myslides')
slide_files = [os.path.join(slide_dir,s) for s in os.listdir(slide_dir)]
mask_dir=os.path.join(project_root_dir,'mymasks')
mask_files = [os.path.join(mask_dir,m) for m in os.listdir(mask_dir)]
xml_dir=os.path.join(project_root_dir,'myannotations')
xml_files = [os.path.join(xml_dir,x) for x in os.listdir(xml_dir)]

patch_dir = pathlib.Path(os.path.join(colab_root_dir,'patches'))
drive.mount(drive_dir)
os.makedirs(patch_dir,exist_ok=True)


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## About

This starter code shows how to read slides and tumor masks from the [CAMELYON16](https://camelyon17.grand-challenge.org/Data/) dataset. It will install [OpenSlide](https://openslide.org/) in Colab (the only non-Python dependency). Note that OpenSlide also includes a [DeepZoom viewer](https://github.com/openslide/openslide-python/tree/master/examples/deepzoom), shown in class. To use that, you'll need to install and run OpenSlide locally on your computer.

### Training data

The original slides and annotations are in an unusual format. I converted a bunch of them for you, so you can read them with OpenSlide as shown in this notebook. This [folder](https://drive.google.com/drive/folders/1rwWL8zU9v0M27BtQKI52bF6bVLW82RL5?usp=sharing) contains all the slides and tumor masks I converted (and these should be *plenty* for your project). If you'd like more beyond this, you'll need to use ASAP as described on the competition website to convert it into an appropriate format. 

Note that even with the starter code, it will take some effort to understand how to work with this data (the various zoom levels, and the coordinate system). Happy to help in OH if you're stuck.

### Reminder

The goal for your project is to build a thoughtful, end-to-end prototype - not to match the accuracy from the [paper](https://arxiv.org/abs/1703.02442), or use all the available data. 


In [None]:
# Install the OpenSlide C library and Python bindings
!apt-get install openslide-tools
!pip install openslide-python

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following packages were automatically installed and are no longer required:
  cuda-cufft-10-1 cuda-cufft-dev-10-1 cuda-curand-10-1 cuda-curand-dev-10-1
  cuda-cusolver-10-1 cuda-cusolver-dev-10-1 cuda-cusparse-10-1
  cuda-cusparse-dev-10-1 cuda-license-10-2 cuda-npp-10-1 cuda-npp-dev-10-1
  cuda-nsight-10-1 cuda-nsight-compute-10-1 cuda-nsight-systems-10-1
  cuda-nvgraph-10-1 cuda-nvgraph-dev-10-1 cuda-nvjpeg-10-1
  cuda-nvjpeg-dev-10-1 cuda-nvrtc-10-1 cuda-nvrtc-dev-10-1 cuda-nvvp-10-1
  libcublas10 libnvidia-common-430 nsight-compute-2019.5.0
  nsight-systems-2019.5.2
Use 'apt autoremove' to remove them.
The following additional packages will be installed:
  libopenslide0
Suggested packages:
  libtiff-tools
The following NEW packages will be installed:
  libopenslide0 openslide-tools
0 upgraded, 2 newly installed, 0 to remove and 5 not upgraded.
Need to get 92.5 kB of archives.
After t

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from openslide import open_slide, __library_version__ as openslide_version
import os
from PIL import Image
from skimage.color import rgb2gray
import cv2 as cv
import json
import pathlib
import random
import pandas as pd
import time
import shutil
from time import gmtime, strftime


In [None]:
_=[os.makedirs(os.path.join(patch_dir,s,l), exist_ok=True) for s in ['train','val','test'] for l in ['positive','negative']]
patches_df = pd.read_pickle(os.path.join(project_root_dir,'patches_df.pkl'))


In [None]:
level = 0
c = 0
total_patches = patches_df.shape[0]
for s in slide_files:
  slide_name = pathlib.Path(s).stem
  print(slide_name)
  print(strftime("%Y-%m-%d %H:%M:%S", gmtime()))
  ###read slide from 
  slide=open_slide(s)
  print('Slide dimensions : {}'.format(slide.level_dimensions[level]))
  patches_in_slide = patches_df[patches_df['slide_name']==slide_name].shape[0]
  print('Processing slide: {} with total number of patches = {}'.format(slide_name, patches_in_slide))
  for index, row in patches_df[patches_df['slide_name']==slide_name].iterrows():
    x,y,w,h = row[5:9]
    patch_path = os.path.join(patch_dir,row[11],row[10],row[4]+'.jpg')
    im = slide.read_region((x,y), level, (w, h))
    im = im.convert('RGB') # drop the alpha channel
    im = np.asarray(im)
    assert im.shape == (h, w, 3)  
    Image.fromarray(im).save(patch_path)  
  c += patches_in_slide
  todo = total_patches - c
  print('Total patches: {}\nExtracted patches so far: {}\nLeft to extract: {}'.format(total_patches,c,todo))
  print(strftime("%Y-%m-%d %H:%M:%S", gmtime()))
  print('...'*30)


tumor_031
2019-12-08 04:24:18
Slide dimensions : (97792, 221184)
Processing slide: tumor_031 with total number of patches = 6545
Total patches: 93850
Extracted patches so far: 6545
Left to extract: 87305
2019-12-08 04:31:29
..........................................................................................
tumor_012
2019-12-08 04:31:29
Slide dimensions : (97792, 215552)
Processing slide: tumor_012 with total number of patches = 40
Total patches: 93850
Extracted patches so far: 6585
Left to extract: 87265
2019-12-08 04:31:38
..........................................................................................
tumor_019
2019-12-08 04:31:38
Slide dimensions : (97792, 219648)
Processing slide: tumor_019 with total number of patches = 4
Total patches: 93850
Extracted patches so far: 6589
Left to extract: 87261
2019-12-08 04:31:40
..........................................................................................
tumor_002
2019-12-08 04:31:40
Slide dimensions : (97792, 219

In [None]:
day = strftime("%Y-%m-%d", gmtime())
data_dir = os.path.join(colab_root_dir,'data_{}'.format(day))
shutil.make_archive(data_dir, 'zip', 'patches')
shutil.copy(data_dir+'.zip',project_root_dir)

'/content/drive/My Drive/deeplearning/cancer_classification/data_2019-12-08.zip'