In this notebook I am going to create a few different dataframes that consist of patient data, potentially different sizes of image pixel arrays, etc.


In [23]:
import glob, pylab, pandas as pd
import pydicom, numpy as np
import pandas_profiling as pp

In [24]:
# I am going to incorporate the rest of the target data (x, y, width, height) even though currently I am only trying to see if the person has pneumonia or not.
df = pd.read_csv('../data/stage_1_train_labels.csv')
# Example of a "Normal" person
print(df.iloc[0])

patientId    0004cfab-14fd-4e49-80ba-63a80b6bddd6
x                                             NaN
y                                             NaN
width                                         NaN
height                                        NaN
Target                                          0
Name: 0, dtype: object


In [25]:
# Example of a person with pneumonia
print(df.iloc[4])

patientId    00436515-870c-4b36-a041-de91049b9ab4
x                                             264
y                                             152
width                                         213
height                                        379
Target                                          1
Name: 4, dtype: object


In [27]:
# Retrieving the data out of the .dcm file
patientId = df['patientId'][0]
dcm_file = '../data/stage_1_train_images/%s.dcm' % patientId
dcm_data = pydicom.read_file(dcm_file)

print(dcm_data)

(0008, 0005) Specific Character Set              CS: 'ISO_IR 100'
(0008, 0016) SOP Class UID                       UI: Secondary Capture Image Storage
(0008, 0018) SOP Instance UID                    UI: 1.2.276.0.7230010.3.1.4.8323329.28530.1517874485.775526
(0008, 0020) Study Date                          DA: '19010101'
(0008, 0030) Study Time                          TM: '000000.00'
(0008, 0050) Accession Number                    SH: ''
(0008, 0060) Modality                            CS: 'CR'
(0008, 0064) Conversion Type                     CS: 'WSD'
(0008, 0090) Referring Physician's Name          PN: ''
(0008, 103e) Series Description                  LO: 'view: PA'
(0010, 0010) Patient's Name                      PN: '0004cfab-14fd-4e49-80ba-63a80b6bddd6'
(0010, 0020) Patient ID                          LO: '0004cfab-14fd-4e49-80ba-63a80b6bddd6'
(0010, 0030) Patient's Birth Date                DA: ''
(0010, 0040) Patient's Sex                       CS: 'F'
(0010, 1010) Patient'

The data retrieved from the `.dcm` is important. Not only does it have the pixel array that we want to test against but it also has things like Age, Sex, and others that we might want to filter over.

In [28]:
# Creating a DataFrame of all of the information minus the pictures in the .dcm files
all_info_cols = ['patientId', 'specificCharacterSet', 'modality', 'conversionType', 'sex', 'age', 'viewPosition', 'photometricInterpretation']

In [30]:
all_info = []
for i in df['patientId']:
    dcm_file = '../data/stage_1_train_images/%s.dcm' % i
    dcm_data = pydicom.read_file(dcm_file)
    all_info.append([i, dcm_data.SpecificCharacterSet, dcm_data.Modality, dcm_data.ConversionType, dcm_data.PatientSex, dcm_data.PatientAge, dcm_data.ViewPosition, dcm_data.PhotometricInterpretation])


# Looking at the different unique values for the potentially interesting datapoints

#print(dcm_data.SpecificCharacterSet.unique())

KeyboardInterrupt: 

In [None]:
patient_data = pd.DataFrame(data=all_info, columns=all_info_cols)
patient_data.head()

In [None]:
all_info_cols

In [None]:
columns = all_info_cols
columns.remove('patientId')
columns.remove('age')
columns.append('age')

In [None]:
for col in columns:
    print(col+": "+str(patient_data[col].value_counts()))
    print('\n\n\n')
    

Potential things to look into: viewPosition, sex, age


In [None]:
# I want to see the difference between a PA and an AP xray
def show_image(patientId):
    dcm_file = '../data/stage_1_train_images/%s.dcm' % patientId
    dcm_data = pydicom.read_file(dcm_file)
    im = dcm_data.pixel_array
    pylab.imshow(im, cmap=pylab.cm.gist_gray)
    pylab.axis('off')

In [None]:
pa_patient = patient_data[patient_data['viewPosition']=='PA'].iloc[1]['patientId']
show_image(pa_patient)

In [None]:
ap_patient = patient_data[patient_data['viewPosition']=='AP'].iloc[6]['patientId']
show_image(ap_patient)

In [None]:
set(patient_data['age'].unique()) - set([str(i) for i in range(1,101)]) 
#That can't be right...

The oldest living person was 122 years old and died in 1997 in France. I'm thinking these xrays with the patient's age as over 100 aren't good data and could be misdiagnosed. Since the data isn't good I'm deleting everything over 100

In [None]:
test = patient_data[patient_data['age'] =='155'].iloc[0]['patientId']
show_image(test)

In [None]:
child = patient_data[patient_data['age']=='24'].iloc[0]['patientId']
show_image(pa_patient)

- Change age into an int in the df. 
- Take out the very old people. 
- Look into taking out the young ones as well
- create a model that uses only M AP and train/test. See if there's a difference
- more EDA? you found the incorrect ages, is there any other potential for incorrect data? I'm not looking through 29k images...
- stochastic gradient descent? look into and see if it is useful for a NN


In [None]:
patient_data['age'] = patient_data['age'].astype(int)

In [None]:
patient_data.dtypes


In [None]:
pp.ProfileReport(patient_data)

In [None]:
feats_i_dont_need = ['conversionType', 'modality', 'photometricInterpretation', 'specificCharacterSet']

In [None]:
patient_data = patient_data.drop(columns=feats_i_dont_need)

In [None]:
# Getting rid of the overly ambitiously aged patient data
patient_data = patient_data[patient_data['age']<=100]

In [None]:
adult_patient_data = patient_data[patient_data['age']>=18]

In [None]:
adult_patient_data.columns

In [None]:
male_adult_patient_data = adult_patient_data[adult_patient_data['sex']=='M']
male_adult_patient_data.head(1)

In [None]:
#Going to save this to see if the age / sex had any major effects
male_adult_patient_data.to_csv('./male_adult_patient_data.csv', index=False)

In [33]:
testing_save = pd.read_csv('../male_adult_patient_data.csv')
testing_save.head(1)

Unnamed: 0,patientId,sex,age,viewPosition
0,00322d4d-1c29-4943-afc9-b6754be640eb,M,19,AP


In [34]:
testing_save.dtypes

patientId       object
sex             object
age              int64
viewPosition    object
dtype: object

In [37]:
testing_save.duplicated().sum()

1763

In [44]:
testing_save[testing_save.duplicated()].sort_values('patientId')

Unnamed: 0,patientId,sex,age,viewPosition
5,00704310-78a8-4b38-8475-49f4573b2dbb,M,75,PA
10,00f08de1-517e-4652-a04f-d1dc9ee48593,M,58,AP
14,010ccb9f-6d46-4380-af11-84f87397a1b8,M,21,AP
17,012a5620-d082-4bb8-9b3b-e72d8938000c,M,60,AP
21,0174c4bb-28f5-41e3-a13f-a396badc18bd,M,23,AP
28,01b9e362-4950-40f5-88fa-7557ac2a45bb,M,47,AP
34,01cad8d0-45cd-4603-b099-94055d322310,M,42,AP
47,02002619-3dea-4038-8d4d-458db30ed8de,M,60,AP
49,020a16e3-baf9-4cf0-859c-c79b5253d717,M,61,AP
58,0294b4aa-5614-4b33-80a4-2c31b639e5b3,M,38,AP
