Since this project is based on doing [this Kaggle competition](https://www.kaggle.com/c/rsna-pneumonia-detection-challenge), my data gathering wasn't exactly that difficult to accomplish. Downloading files from Kaggle is pretty easy, a 3.3GB zip file that unzipped into ~3.6GB of mostly pictures. 

## Data Description Directly from Kaggle

### What files do I need?

This is a two-stage challenge. You will need the images for the current stage - provided as stage_1_train_images.zip and stage_1_test_images.zip. You will also need the training data - stage_1_train_labels.csv - and the sample submission stage_1_sample_submission.csv, which provides the IDs for the test set, as well as a sample of what your submission should look like. The file stage_1_detailed_class_info.csv contains detailed information about the positive and negative classes in the training set, and may be used to build more nuanced models.
<br>
### What should I expect the data format to be?

The training data is provided as a set of patientIds and bounding boxes. Bounding boxes are defined as follows: x-min y-min width height

There is also a binary target column, Target, indicating pneumonia or non-pneumonia.

There may be multiple rows per patientId.

#### DICOM Images
All provided images are in DICOM format.

<span style='color:green'>(My thoughts) The pictures are all `.dcm` files which don't open in normal picture viewing applications. The files are scans of the person's lungs but have a lot of other metadata (or at least has the ability to hold a lot of other meta data as it is a medically formated patient file) as opposed to a single image. This adds a level of complexity I was not expecting. I've found an application to open the files but you have to pay for the full version. The trial version allows you to use 10k files and I have over 20k so that isn't really enough. I'm looking into other applications that will open the files. Along with that, I'm not entirely sure how to open the image itself in python. This will require some more thought.</span>
<br>
### What am I predicting?
In this challenge competitors are predicting whether pneumonia exists in a given image. They do so by predicting bounding boxes around areas of the lung. Samples without bounding boxes are negative and contain no definitive evidence of pneumonia. Samples with bounding boxes indicate evidence of pneumonia.

When making predictions, competitors should predict as many bounding boxes as they feel are necessary, in the format: confidence x-min y-min width height

There should be only ONE predicted row per image. This row may include multiple bounding boxes.

A properly formatted row may look like any of the following.

For patientIds with no predicted pneumonia / bounding boxes: 0004cfab-14fd-4e49-80ba-63a80b6bddd6,

For patientIds with a single predicted bounding box: 0004cfab-14fd-4e49-80ba-63a80b6bddd6,0.5 0 0 100 100

For patientIds with multiple predicted bounding boxes: 0004cfab-14fd-4e49-80ba-63a80b6bddd6,0.5 0 0 100 100 0.5 0 0 100 100, etc.

### File descriptions
 - stage_1_train.csv - the training set. Contains patientIds and bounding box / target information.
 - stage_1_sample_submission.csv - a sample submission file in the correct format. Contains patientIds for the test set. Note that the sample submission contains one box per image, but there is no limit to the number of bounding boxes that can be assigned to a given image.
 - stage_1_detailed_class_info.csv - provides detailed information about the type of positive or negative class for each image.

### Data fields
 - patientId _- A patientId. Each patientId corresponds to a unique image.
 - x_ - the upper-left x coordinate of the bounding box.
 - y_ - the upper-left y coordinate of the bounding box.
 - width_ - the width of the bounding box.
 - height_ - the height of the bounding box.
 - Target_ - the binary Target, indicating whether this sample has evidence of pneumonia.

In [4]:
import pandas as pd
import numpy as np
import pandas_profiling as pp

#### stage_1_detailed_class_info

In [2]:
stage_1_detailed_class_info = pd.read_csv('./data/stage_1_detailed_class_info.csv', )

In [6]:
stage_1_detailed_class_info.shape

(28989, 2)

In [8]:
stage_1_detailed_class_info.describe()

Unnamed: 0,patientId,class
count,28989,28989
unique,25684,3
top,3239951b-6211-4290-b237-3d9ad17176db,No Lung Opacity / Not Normal
freq,4,11500


In [9]:
stage_1_detailed_class_info.head(5)

Unnamed: 0,patientId,class
0,0004cfab-14fd-4e49-80ba-63a80b6bddd6,No Lung Opacity / Not Normal
1,00313ee0-9eaa-42f4-b0ab-c148ed3241cd,No Lung Opacity / Not Normal
2,00322d4d-1c29-4943-afc9-b6754be640eb,No Lung Opacity / Not Normal
3,003d8fa0-6bf1-40ed-b54c-ac657f8495c5,Normal
4,00436515-870c-4b36-a041-de91049b9ab4,Lung Opacity


In [10]:
stage_1_detailed_class_info.isnull().sum()

patientId    0
class        0
dtype: int64

In [17]:
stage_1_detailed_class_info['class'].value_counts()

No Lung Opacity / Not Normal    11500
Lung Opacity                     8964
Normal                           8525
Name: class, dtype: int64

In [11]:
# Strange that there is a problem with this ..
pp.ProfileReport(stage_1_detailed_class_info)

in singular transformations; automatically expanding.
left=-0.5, right=-0.5
  'left=%s, right=%s') % (left, right))
in singular transformations; automatically expanding.
bottom=-0.5, top=-0.5
  'bottom=%s, top=%s') % (bottom, top))


ZeroDivisionError: float division by zero

#### stage_1_train_labels


In [12]:
stage_1_train_labels = pd.read_csv('./data/stage_1_train_labels.csv')

In [13]:
stage_1_train_labels.shape

(28989, 6)

In [14]:
stage_1_train_labels.describe()

Unnamed: 0,x,y,width,height,Target
count,8964.0,8964.0,8964.0,8964.0,28989.0
mean,391.456158,363.135877,220.845382,334.174364,0.309221
std,203.945378,148.607149,59.041384,158.097239,0.46218
min,2.0,2.0,40.0,45.0,0.0
25%,205.0,246.0,180.0,207.0,0.0
50%,320.0,360.0,219.0,304.0,0.0
75%,591.0,475.0,261.0,445.0,1.0
max,817.0,881.0,528.0,942.0,1.0


In [15]:
stage_1_train_labels.isnull().sum()
#There are a number of nulls here. I'm guessing that they are the ones without any pneumonia


patientId        0
x            20025
y            20025
width        20025
height       20025
Target           0
dtype: int64

In [18]:
pp.ProfileReport(stage_1_train_labels)

0,1
Number of variables,6
Number of observations,28989
Total Missing (%),46.1%
Total size in memory,1.3 MiB
Average record size in memory,48.0 B

0,1
Numeric,4
Categorical,1
Boolean,1
Date,0
Text (Unique),0
Rejected,0
Unsupported,0

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.30922

0,1
0,20025
1,8964

Value,Count,Frequency (%),Unnamed: 3
0,20025,69.1%,
1,8964,30.9%,

0,1
Distinct count,723
Unique (%),2.5%
Missing (%),69.1%
Missing (n),20025
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,334.17
Minimum,45
Maximum,942
Zeros (%),0.0%

0,1
Minimum,45.0
5-th percentile,123.0
Q1,207.0
Median,304.0
Q3,445.0
95-th percentile,627.85
Maximum,942.0
Range,897.0
Interquartile range,238.0

0,1
Standard deviation,158.1
Coef of variation,0.4731
Kurtosis,-0.41638
Mean,334.17
MAD,131.53
Skewness,0.57866
Sum,2995500
Variance,24995
Memory size,226.6 KiB

Value,Count,Frequency (%),Unnamed: 3
181.0,35,0.1%,
201.0,34,0.1%,
158.0,34,0.1%,
282.0,34,0.1%,
233.0,33,0.1%,
279.0,32,0.1%,
199.0,32,0.1%,
202.0,31,0.1%,
194.0,31,0.1%,
257.0,31,0.1%,

Value,Count,Frequency (%),Unnamed: 3
45.0,1,0.0%,
46.0,1,0.0%,
47.0,1,0.0%,
52.0,1,0.0%,
54.0,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
847.0,1,0.0%,
851.0,1,0.0%,
852.0,1,0.0%,
867.0,2,0.0%,
942.0,2,0.0%,

0,1
Distinct count,25684
Unique (%),88.6%
Missing (%),0.0%
Missing (n),0

0,1
3239951b-6211-4290-b237-3d9ad17176db,4
31764d54-ea3b-434f-bae2-8c579ed13799,4
0ab261f9-4eb5-42ab-a9a5-e918904d6356,4
Other values (25681),28977

Value,Count,Frequency (%),Unnamed: 3
3239951b-6211-4290-b237-3d9ad17176db,4,0.0%,
31764d54-ea3b-434f-bae2-8c579ed13799,4,0.0%,
0ab261f9-4eb5-42ab-a9a5-e918904d6356,4,0.0%,
7d674c82-5501-4730-92c5-d241fd6911e7,4,0.0%,
ee820aa5-4804-4984-97b3-f0a71d69702f,4,0.0%,
8dc8e54b-5b05-4dac-80b9-fa48878621e2,4,0.0%,
1bf08f3b-a273-4f51-bafa-b55ada2c23b5,4,0.0%,
0d5bc737-03de-4bb8-98a1-45b7180c3e0f,4,0.0%,
32408669-c137-4e8d-bd62-fe8345b40e73,4,0.0%,
349f10b4-dc3e-4f3f-b2e4-a5b81448ce87,4,0.0%,

0,1
Distinct count,347
Unique (%),1.2%
Missing (%),69.1%
Missing (n),20025
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,220.85
Minimum,40
Maximum,528
Zeros (%),0.0%

0,1
Minimum,40
5-th percentile,125
Q1,180
Median,219
Q3,261
95-th percentile,319
Maximum,528
Range,488
Interquartile range,81

0,1
Standard deviation,59.041
Coef of variation,0.26734
Kurtosis,-0.038345
Mean,220.85
MAD,47.292
Skewness,0.17041
Sum,1979700
Variance,3485.9
Memory size,226.6 KiB

Value,Count,Frequency (%),Unnamed: 3
219.0,81,0.3%,
217.0,79,0.3%,
222.0,79,0.3%,
214.0,71,0.2%,
240.0,71,0.2%,
250.0,71,0.2%,
256.0,70,0.2%,
215.0,70,0.2%,
247.0,69,0.2%,
226.0,68,0.2%,

Value,Count,Frequency (%),Unnamed: 3
40.0,1,0.0%,
50.0,1,0.0%,
54.0,1,0.0%,
57.0,2,0.0%,
58.0,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
432.0,1,0.0%,
445.0,2,0.0%,
454.0,1,0.0%,
467.0,1,0.0%,
528.0,1,0.0%,

0,1
Distinct count,739
Unique (%),2.5%
Missing (%),69.1%
Missing (n),20025
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,391.46
Minimum,2
Maximum,817
Zeros (%),0.0%

0,1
Minimum,2
5-th percentile,122
Q1,205
Median,320
Q3,591
95-th percentile,679
Maximum,817
Range,815
Interquartile range,386

0,1
Standard deviation,203.95
Coef of variation,0.52099
Kurtosis,-1.569
Mean,391.46
MAD,191.71
Skewness,0.099894
Sum,3509000
Variance,41594
Memory size,226.6 KiB

Value,Count,Frequency (%),Unnamed: 3
599.0,45,0.2%,
583.0,42,0.1%,
193.0,41,0.1%,
225.0,41,0.1%,
615.0,40,0.1%,
576.0,36,0.1%,
199.0,36,0.1%,
604.0,36,0.1%,
184.0,36,0.1%,
212.0,35,0.1%,

Value,Count,Frequency (%),Unnamed: 3
2.0,3,0.0%,
3.0,1,0.0%,
6.0,3,0.0%,
10.0,1,0.0%,
12.0,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
800.0,1,0.0%,
801.0,1,0.0%,
804.0,1,0.0%,
816.0,1,0.0%,
817.0,1,0.0%,

0,1
Distinct count,723
Unique (%),2.5%
Missing (%),69.1%
Missing (n),20025
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,363.14
Minimum,2
Maximum,881
Zeros (%),0.0%

0,1
Minimum,2
5-th percentile,134
Q1,246
Median,360
Q3,475
95-th percentile,610
Maximum,881
Range,879
Interquartile range,229

0,1
Standard deviation,148.61
Coef of variation,0.40923
Kurtosis,-0.62554
Mean,363.14
MAD,124.19
Skewness,0.16134
Sum,3255200
Variance,22084
Memory size,226.6 KiB

Value,Count,Frequency (%),Unnamed: 3
257.0,37,0.1%,
420.0,34,0.1%,
436.0,34,0.1%,
498.0,33,0.1%,
378.0,31,0.1%,
246.0,30,0.1%,
398.0,30,0.1%,
292.0,30,0.1%,
445.0,29,0.1%,
356.0,29,0.1%,

Value,Count,Frequency (%),Unnamed: 3
2.0,2,0.0%,
3.0,1,0.0%,
6.0,1,0.0%,
8.0,1,0.0%,
10.0,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
836.0,1,0.0%,
845.0,1,0.0%,
856.0,1,0.0%,
859.0,1,0.0%,
881.0,1,0.0%,

Unnamed: 0,patientId,x,y,width,height,Target
0,0004cfab-14fd-4e49-80ba-63a80b6bddd6,,,,,0
1,00313ee0-9eaa-42f4-b0ab-c148ed3241cd,,,,,0
2,00322d4d-1c29-4943-afc9-b6754be640eb,,,,,0
3,003d8fa0-6bf1-40ed-b54c-ac657f8495c5,,,,,0
4,00436515-870c-4b36-a041-de91049b9ab4,264.0,152.0,213.0,379.0,1


#### Sample Submission


In [19]:
sample_submission = pd.read_csv('./data/stage_1_sample_submission.csv')

In [20]:
sample_submission.head()

Unnamed: 0,patientId,PredictionString
0,000924cf-0f8d-42bd-9158-1af53881a557,0.5 0 0 100 100
1,000db696-cf54-4385-b10b-6b16fbb3f985,0.5 0 0 100 100
2,000fe35a-2649-43d4-b027-e67796d412e0,0.5 0 0 100 100
3,001031d9-f904-4a23-b3e5-2c088acd19c6,0.5 0 0 100 100
4,0010f549-b242-4e94-87a8-57d79de215fc,0.5 0 0 100 100


In [21]:
sample_submission.shape

(1000, 2)

In [37]:
sample_submission.isnull().sum()

patientId           0
PredictionString    0
dtype: int64

In [43]:
sample_submission['PredictionString'].value_counts() #-_-

0.5 0 0 100 100    1000
Name: PredictionString, dtype: int64

In [22]:
pp.ProfileReport(sample_submission)

in singular transformations; automatically expanding.
left=-0.5, right=-0.5
  'left=%s, right=%s') % (left, right))
in singular transformations; automatically expanding.
bottom=-0.5, top=-0.5
  'bottom=%s, top=%s') % (bottom, top))


ZeroDivisionError: float division by zero

#### Credits

In [35]:
credits = open('./data/GCP Credits Request Link - RSNA.txt', 'r')

In [36]:
print(credits.read())

https://www.kaggle.com/GCP-Credits-Form-RSNA-Pneumonia

