# Data Science Bowl 2017: how to setup the competition

Link to the competition: https://www.kaggle.com/c/data-science-bowl-2017


## Agenda

1. The Data Science Bowl 2017
2. System Setup
3. Get the data and initial analysis
4. Preprocessing steps and results
5. Outlook



## 1. The Data Science Bowl 2017
### What is the Data Science Bowl 2017

Lung cancer is one of the most common types of cancer, with nearly 225,000 new cases of the disease expected in the U.S. in 2016.

#### The Challange
- Data set of high-resolution scans of lungs
- Detect lesions in the lungs that are cancerous
- Develop an artificial intelligence algorithm to reduce the false positive rate

#### The Goal
- With a lower false positive rate the low-dose CT scans could be widely used for lung cancer detection
- Results have the potential to advance our understanding of how cancer develops 


### Scope of this years competition

The Sience Bowl this year is sponsored by Laura and John Arnold Foundation, National Cancer Institute, American College of Radiology, Amazon Web Services, NVIDIA and several other

- Global, web-based competition
- Open 90 days, from January 12 to April 12, 2017
- $1 million in prize money provided by the Laura and John Arnold Foundation

### Why Lung Cancer?
![alt text](pictures/lungCancerInfo.png)


- Lung cancer is one of the most common types of cancer, with nearly 225,000 new cases of the disease expected in the U.S. in 2016.
- It also accounts for $12 billion in health care costs in the U.S. every year.
- Early detection is critical to surviving lung cancer, as it opens a range of treatment options not available when cancer is detected at later, more advanced stages.

## 2. System Setup
### The Data
- Data size
 - 66GB zipped
 - ~140GB unzipped
- Data Format
 - CT scan data
 - .dcm files ( medical file format; use https://github.com/darcymason/pydicom )
- Structure
 - Each folder represents one patient
 - 130 - 280 slices per patient

In [1]:
ls "data/sample_2_patients/"

 Datentr„ger in Laufwerk C: ist OSDisk
 Volumeseriennummer: 3465-B012

 Verzeichnis von C:\Users\jonas.leininger\DataScienceBowl2017Meetup\data\sample_2_patients

08.03.2017  10:08    <DIR>          .
08.03.2017  10:08    <DIR>          ..
08.03.2017  10:08    <DIR>          00cba091fa4ad62cc3200a657aeb957e
08.03.2017  10:08    <DIR>          0a0c32c9e08cc2ea76a71649de56be6d
               0 Datei(en),              0 Bytes
               4 Verzeichnis(se), 72.347.672.576 Bytes frei


In [9]:
ls "data/sample_2_patients/00cba091fa4ad62cc3200a657aeb957e/" | head

 Datentr„ger in Laufwerk C: ist OSDisk
 Volumeseriennummer: 3465-B012

 Verzeichnis von C:\Users\jonas.leininger\DataScienceBowl2017Meetup\data\sample_2_patients\00cba091fa4ad62cc3200a657aeb957e

08.03.2017  10:08    <DIR>          .
08.03.2017  10:08    <DIR>          ..
08.03.2017  08:28           525.448 034673134cbef5ea15ff9e0c8090500a.dcm
08.03.2017  08:28           525.438 0484f5a7f55eb7b6743cadaffcce586d.dcm
08.03.2017  08:28           525.446 053a0460fb45227bd8e0e7b514a71e8e.dcm


Import pydicom to process the .dcm files
Documentation https://pydicom.readthedocs.io/en/stable/

In conda run the following command:

``` bash
conda install -c conda-forge pydicom
```

- Information
 - Image with gray scale values
 - Z-Position of the slice 
 - Resolution of images

In [3]:
import dicom
import os

inputFolder = 'Data/sample_2_patients/'
patients = os.listdir(inputFolder)
patients.sort()

pathPatientZero = inputFolder + patients[0] + '/'
patientZero_Images = [dicom.read_file(pathPatientZero + file) for file in os.listdir(pathPatientZero)]

In [4]:
patientZero_Images[0]

(0008, 0005) Specific Character Set              CS: 'ISO_IR 100'
(0008, 0016) SOP Class UID                       UI: CT Image Storage
(0008, 0018) SOP Instance UID                    UI: 1.2.840.113654.2.55.249139163741242644304020243485943527041
(0008, 0060) Modality                            CS: 'CT'
(0008, 103e) Series Description                  LO: 'Axial'
(0010, 0010) Patient's Name                      PN: '00cba091fa4ad62cc3200a657aeb957e'
(0010, 0020) Patient ID                          LO: '00cba091fa4ad62cc3200a657aeb957e'
(0010, 0030) Patient's Birth Date                DA: '19000101'
(0018, 0060) KVP                                 DS: ''
(0020, 000d) Study Instance UID                  UI: 2.25.86208730140539712382771890501772734277950692397709007305473
(0020, 000e) Series Instance UID                 UI: 2.25.11575877329635228925808596800269974740893519451784626046614
(0020, 0011) Series Number                       IS: '3'
(0020, 0012) Acquisition Number            

In [5]:
print("Pixel array: ", patientZero_Images[0].pixel_array)
print("Pixel array shape: ",patientZero_Images[0].pixel_array.shape)
print("Scan position: ",patientZero_Images[0].ImagePositionPatient)
print("Scan location: ",patientZero_Images[0].SliceLocation)
print("Number of scans: ",len(patientZero_Images))

Pixel array:  [[-2000 -2000 -2000 ..., -2000 -2000 -2000]
 [-2000 -2000 -2000 ..., -2000 -2000 -2000]
 [-2000 -2000 -2000 ..., -2000 -2000 -2000]
 ..., 
 [-2000 -2000 -2000 ..., -2000 -2000 -2000]
 [-2000 -2000 -2000 ..., -2000 -2000 -2000]
 [-2000 -2000 -2000 ..., -2000 -2000 -2000]]
Pixel array shape:  (512, 512)
Scan position:  ['-145.500000', '-158.199997', '-316.200012']
Scan location:  -316.200012
Number of scans:  134


### Which System to use?

- Unzipped data 140GB
 - After preprocessing at least same amount added
 - Minimum of ~300GB free disk space needed
- For the artificial neural network we need a GPU
 - Without a GPU learning and prediction time to long

#### Laptop
- No GPU
- Not enough free disk space

#### Cloud computing instances
- Number of GPUs and disk space is scalable

### Which cloud scientific computing to use?

- Amazon Web Services
 - Company likes to get experience with AWS
 - Fast search for tutorials results in a lot of information for AWS
 - AWS is one of the official partners for the competition

## 3. Get the Data and Initial Analysis
### Where and how to load the data

- Download via webbrowser ( ~0.5d-1.0d)
 - No additional setup
 - Browser needs to be open the whole time
- Download via Torrent ( ~4h-8h)
 - Much faster than download from kaggle.com
 - Easier to pause the download
- Download via Kaggle-CLI (~14h)
 - CLI to download data and upload submissions via console
 - Similar to wget in console but without –load-cookies
 - https://github.com/floydwch/kaggle-cli


### Loading data to an AWS s3 bucket
#### 1. Start with the cheapest ec2 instance with additional storage space (250GB+)
#### 2. Download data via kaggle-cli
Install kaggle-cli

```bash
pip install kaggle-cli
```

Configure the kaggle-cli
```bash
kg config -u `username` -p `password` -c `competition`
```

Download data with kg
```bash
kg download -f `train.zip`
```

#### 3. Load the data to an AWS s3 bucket
Install aws-cli
```bash
sudo pip install --upgrade awscli
```

Configure aws-cli
```bash
aws configure
```
AWS user ID and security key are needed.
Location of your s3 bucket, 'eu-central-1' for example (ec2 loactions are 'eu-central-1a', s3 buckets don't have the last letter)

aws s3 mb s3://wmc-kaggle-data-upload-dsb

aws s3api get-bucket-location --bucket bdckaggledatasciencebowl

aws s3 cp --recursive stage1/ s3://bdckaggledatasciencebowl/DataKaggle/RawData/

sudo apt-get install p7zip-full

7za e stage1.7z