# Map X-ray Barcode to OAI Site

The DICOM data for each xray lists a clinical site ID. There are about 16 of these. These must be different locations of xray machines across the 5 research sites that make up OAI. This notebook creates a map of Xray ID to the actual hospital the patient attended (5 sites).

# Imports

In [1]:
import os
import pandas as pd
import pickle

import OAI_Utilities as utils # ln -s ../../OAI/notebooks/OAI_Utilities.py

# Constants

In [12]:
OAI_PKL_PATH = '/Users/brandong.hill/code/OAI/notebooks/pkl/'
idxSlc = pd.IndexSlice

VARIABLES_OF_INTEREST = [
    "(0008, 0070) Manufacturer",
    "(0008, 1090) Manufacturer's Model Name",
    "(0012, 0030) Clinical Trial Site ID",
    "(0018, 1000) Device Serial Number",
]

# Import data

In [3]:
# Read in the dataframes for analysis 
metadata_df = pd.read_pickle(open('../../OAI_DICOM/pkl/dicom_metadata_df.pkl', 'rb'))
allclinical_df = pd.read_pickle(open(os.path.join(OAI_PKL_PATH, 'allclinical_values.pkl'), 'rb' ))

xray_df = pd.read_pickle(open(os.path.join(OAI_PKL_PATH, 'xray_values.pkl'), 'rb' ))
xray_df = xray_df[xray_df['XRBARCD'] != '']
xray_df = xray_df[xray_df['EXAMTP'] == 'Bilateral PA Fixed Flexion Knee'] # Lets only consider xrays that we might use in the deep learning

enrollees_df = pd.read_pickle(open(os.path.join(OAI_PKL_PATH, 'enrollees_values.pkl'), 'rb' ))

# Create unified dataframe

In [161]:
# Map of patient IDs to barcodes, also drop 4 extraneous digits in barcode
# Result= ID: XRBARCD
barcode_site_id_df = pd.DataFrame(xray_df['XRBARCD'].str[4:].reset_index('Visit'))
print('{:,}'.format(len(barcode_site_id_df)))

# Result= XRBARCD: ID, SITE, RACE, Visit, INCOME, INCOME2, 'COMORB', '(0008, 1090) Manufacturer's Model Name', '(0012, 0030) Clinical Trial Site ID', ....
barcode_site_id_df = barcode_site_id_df.join(enrollees_df[['SITE', 'RACE']], how='left')  # Add hospital site and patient race
barcode_site_id_df = barcode_site_id_df.join(allclinical_df.loc[idxSlc[:, 'V00'], :][['INCOME', 'INCOME2', 'COMORB']].reset_index('Visit', drop=True), on='ID', how='left') # Add starting income and comborbidities
barcode_site_id_df = barcode_site_id_df.reset_index('ID').set_index('XRBARCD')  # Switch to index by barcode
barcode_site_id_df = barcode_site_id_df.join(metadata_df[VARIABLES_OF_INTEREST]) # Add Mfg model, and Clinical Site ID (xray machine location)
print('{:,}'.format(len(barcode_site_id_df)))  # Sanity check, the joins shouldn't be increase the number of entries

26,522
26,522


# Dump dict of barcode to hospital site

In [7]:
pickle.dump(barcode_site_id_df['SITE'].to_dict(), open('pkl/barcode_to_site_id.dict.pkl', 'wb'))

# Look at patterns in missing values

In [14]:
barcode_site_id_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 26522 entries, 00839603 to 03983201
Data columns (total 11 columns):
 #   Column                                  Non-Null Count  Dtype   
---  ------                                  --------------  -----   
 0   ID                                      26522 non-null  uint64  
 1   SITE                                    26522 non-null  category
 2   RACE                                    26522 non-null  category
 3   Visit                                   26522 non-null  category
 4   INCOME                                  26522 non-null  category
 5   INCOME2                                 26522 non-null  category
 6   COMORB                                  26236 non-null  UInt8   
 7   (0008, 0070) Manufacturer               24785 non-null  category
 8   (0008, 1090) Manufacturer's Model Name  20612 non-null  category
 9   (0012, 0030) Clinical Trial Site ID     26345 non-null  object  
 10  (0018, 1000) Device Serial Number       1

In [65]:
def get_site_distribution(col, df):
    mfr_cnts = df[col].value_counts(dropna=False)

    mfr_by_site_df = df.groupby('SITE')[col].value_counts(dropna=False).unstack()  # Get site counts
    mfr_by_site_df = mfr_by_site_df[mfr_cnts.index]  # Reorder by total frequency
    mfr_by_site_df = pd.concat([mfr_by_site_df, pd.DataFrame(mfr_cnts).T.rename({'count':'Total'}, axis=0)], axis=0) # Add mfr totals
    mfr_by_site_df = pd.concat([mfr_by_site_df, mfr_by_site_df.sum(axis=1)], axis=1).rename(columns={0: 'Total'}) # Add site totals
    mfr_by_site_df = mfr_by_site_df.rename(columns={'': "''"}) # Make the empty string obvious
    mfr_by_site_df = mfr_by_site_df.replace({0: ''}) # Remove clutter of 0's
    return mfr_by_site_df

In [132]:
# How many for each brand of machine?
var = "(0008, 0070) Manufacturer"
print('# of mfrs.: ', len(barcode_site_id_df[var].unique()))
get_site_distribution(var, barcode_site_id_df)

# of mfrs.:  12


Unnamed: 0,AGFA,Swissray,Agfa-Gevaert AG,"""GE Healthcare""",FUJIFILM Corporation,LS100,NaN,"FUJI PHOTO FILM Co., ltd.",GE Healthcare,SIEMENS,Philips Medical Systems,'',Total
A,4.0,3959.0,2.0,3.0,,1.0,5.0,1.0,,,65.0,,4040
B,,,,,2208.0,1887.0,1635.0,,,,,1.0,5731
C,,,3936.0,3883.0,,,,,,,,,7819
D,4727.0,,,,1495.0,,,,406.0,,,,6628
E,,402.0,,,,249.0,97.0,1201.0,,325.0,30.0,,2304
Total,4731.0,4361.0,3938.0,3886.0,3703.0,2137.0,1737.0,1202.0,406.0,325.0,95.0,1.0,26522


It was reasonable to expect a manufacturer at more than one site. What is odd, is cases like AGFA where only 4 x-rays came from that mfr name. It seem like such a small number that it smells more like an odd situation than a site that uses more than one brand. Several small counts exist for site A. It could be patients move between cities during the study, but would everyone move to Site A?

Except for FUJIFILM Corporation, all each manufacturer is favored by a given site. Even a missing name is largely indicative.

In [138]:
# 1,737 xrays don't list a mfr. When they are missing this do they have other info?
tmp = barcode_site_id_df[barcode_site_id_df[var].isna()]["(0008, 1090) Manufacturer's Model Name"].value_counts()  # all NaN
print(tmp[tmp > 0])
tmp = barcode_site_id_df[barcode_site_id_df[var].isna()]["(0018, 1000) Device Serial Number"].value_counts()  # all NaN
print(tmp[tmp > 0])
tmp = barcode_site_id_df[barcode_site_id_df[var].isna()]["(0012, 0030) Clinical Trial Site ID"].value_counts() # not all NaN
print(tmp[tmp > 0])

Series([], Name: count, dtype: int64)
Series([], Name: count, dtype: int64)
(0012, 0030) Clinical Trial Site ID
11    1248
10     385
23      95
22       2
35       2
34       1
 1       1
I1       1
Name: count, dtype: int64


Reasonable that the more detailed machine info also isn't available, but the site ID is.

In [139]:
# How many for each model name?
var = "(0008, 1090) Manufacturer's Model Name"
print('# of models.: ', len(barcode_site_id_df[var].unique()))
get_site_distribution(var, barcode_site_id_df)

# of models.:  16


Unnamed: 0,NaN,ADC_51xx,ADC_5146,"""Definium 5000""",ddR Modulaire System,Lumisys,ddR Formula System,'',"""Thunder Platform""",Discovery XR656,SIEMENS FD-X,DigitalDiagnost,ddR Multi System,ddR Combi System,digital DIAGNOST,FLUOROSPOT_COMPACT,Total
A,6.0,4.0,2.0,2.0,2693.0,1.0,1233.0,,1.0,,,64.0,29.0,4.0,1.0,,4040
B,3844.0,,,,,1785.0,,102.0,,,,,,,,,5731
C,,,3936.0,3468.0,,,,,415.0,,,,,,,,7819
D,1495.0,4727.0,,,,,,,,406.0,,,,,,,6628
E,565.0,,,,,180.0,402.0,802.0,,,324.0,29.0,,,1.0,1.0,2304
Total,5910.0,4731.0,3938.0,3470.0,2693.0,1966.0,1635.0,904.0,416.0,406.0,324.0,93.0,29.0,4.0,2.0,1.0,26522


In [145]:
# Map brand names to model
for val in barcode_site_id_df['(0008, 0070) Manufacturer'].value_counts().index:
    print(val, list(barcode_site_id_df[barcode_site_id_df['(0008, 0070) Manufacturer'] == val][var].unique()))

AGFA ['ADC_51xx']
Swissray ['ddR Formula System', 'ddR Modulaire System', 'ddR Multi System', 'ddR Combi System']
Agfa-Gevaert AG ['ADC_5146']
"GE Healthcare" ['"Definium 5000"', '"Thunder Platform"']
FUJIFILM Corporation [nan]
LS100 ['Lumisys', '']
FUJI PHOTO FILM Co., ltd. [nan, '']
GE Healthcare ['Discovery XR656']
SIEMENS ['SIEMENS FD-X', 'FLUOROSPOT_COMPACT']
Philips Medical Systems ['DigitalDiagnost', 'digital DIAGNOST']
 [nan]


In [146]:
var = "(0018, 1000) Device Serial Number"
print('# of ser num.: ', len(barcode_site_id_df[var].unique()))
get_site_distribution(var, barcode_site_id_df)

# of ser num.:  19


Unnamed: 0,NaN,2205,'',1134,S402607,1018,5094,1522,1845,963334016841,3677,1844,S401504/,5434,5052,1003,08.02.366,S401504,08.02.399,Total
A,36,2.0,2930.0,2.0,985.0,,,,,64.0,,,19.0,,,,,1.0,1.0,4040
B,5731,,,,,,,,,,,,,,,,,,,5731
C,3883,3899.0,,,,,,,,,37.0,,,,,,,,,7819
D,1895,,6.0,2580.0,,1334.0,476.0,322.0,,,,,,14.0,1.0,,,,,6628
E,1549,,,,400.0,,,,304.0,29.0,,20.0,,,,1.0,1.0,,,2304
Total,13094,3901.0,2936.0,2582.0,1385.0,1334.0,476.0,322.0,304.0,93.0,37.0,20.0,19.0,14.0,1.0,1.0,1.0,1.0,1.0,26522


Um, the same serial number showing up at more than one site is curious. If E shares a serial number it is always with A. That is likely because A & E are two different locations at Johns Hopkins (I suspect). Did two from C and two from D move to A? 

In [147]:
# Map brand name to serial numbers
for val in barcode_site_id_df['(0008, 0070) Manufacturer'].value_counts().index:
    print(val, list(barcode_site_id_df[barcode_site_id_df['(0008, 0070) Manufacturer'] == val][var].unique()))

AGFA ['1134', '5094', '1018', '1522', '5434', '', '5052']
Swissray ['S402607', '', 'S401504/', nan, 'S401504']
Agfa-Gevaert AG ['2205', '3677']
"GE Healthcare" [nan]
FUJIFILM Corporation [nan]
LS100 [nan]
FUJI PHOTO FILM Co., ltd. [nan]
GE Healthcare [nan, '']
SIEMENS ['1845', '1844', '1003']
Philips Medical Systems ['963334016841', '08.02.366', '08.02.399']
 [nan]


In [149]:
var = "(0012, 0030) Clinical Trial Site ID"
print('# of Xray sites: ', len(barcode_site_id_df[var].unique()))
get_site_distribution(var, barcode_site_id_df)

# of Xray sites:  17


Unnamed: 0,46,58,11,35,23,10,34,NaN,22,I1,40,38,33,1,5,41,55,Total
A,5.0,2.0,,3626.0,2.0,,281.0,120.0,,,,1.0,1.0,,1.0,,1.0,4040
B,,,4479.0,,,1247.0,,,,4.0,,,,1.0,,,,5731
C,7814.0,,,,,,,1.0,,,3.0,,,,,1.0,,7819
D,,6572.0,,,,,,56.0,,,,,,,,,,6628
E,,,,1.0,2211.0,,,,92.0,,,,,,,,,2304
Total,7819.0,6574.0,4479.0,3627.0,2213.0,1247.0,281.0,177.0,92.0,4.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,26522


In [151]:
# Map clinical trial sites to serial numbers
for val in barcode_site_id_df[var].value_counts().index:
    print(val, list(barcode_site_id_df[barcode_site_id_df[var] == val]["(0018, 1000) Device Serial Number"].unique()))

46 ['2205', nan, '3677']
58 ['1134', '5094', '1018', nan, '1522', '5434', '', '5052']
11 [nan]
35 ['', 'S402607', 'S401504/', '963334016841', nan, 'S401504', '08.02.399']
23 [nan, 'S402607', '1845', '1844', '963334016841', '08.02.366', '1003']
10 [nan]
34 ['', 'S402607', '963334016841', nan, 'S401504/']
22 [nan, 'S402607']
I1 [nan]
40 ['2205', '3677']
 1 [nan]
41 ['2205']
 5 ['']
33 ['']
38 ['']
55 ['']


In [168]:
# Map clinical trial sites to mfr
for val in barcode_site_id_df['(0008, 0070) Manufacturer'].value_counts().index:
    print(val, list(barcode_site_id_df[barcode_site_id_df['(0008, 0070) Manufacturer'] == val][var].unique()))

AGFA ['58', '35', nan]
Swissray ['23', '35', nan, '34', ' 5', '22', '33', '38', '55']
Agfa-Gevaert AG ['46', '41', nan, '40']
"GE Healthcare" ['46']
FUJIFILM Corporation ['10', '58', '11', 'I1']
LS100 ['10', '23', '11', '22']
FUJI PHOTO FILM Co., ltd. ['23', '22']
GE Healthcare ['58']
SIEMENS ['23']
Philips Medical Systems ['35', '34', '23']
 ['11']


In [163]:
barcode_site_id_df[(barcode_site_id_df['SITE'] == 'A') & (barcode_site_id_df["(0008, 0070) Manufacturer"] == 'AGFA')]

Unnamed: 0_level_0,ID,Visit,SITE,RACE,INCOME,INCOME2,COMORB,"(0008, 0070) Manufacturer","(0008, 1090) Manufacturer's Model Name","(0012, 0030) Clinical Trial Site ID","(0018, 1000) Device Serial Number"
XRBARCD,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
232004,9091787,V00,A,1: White or Caucasian,5: $100K or greater,2: > $50K,0,AGFA,ADC_51xx,58,1134.0
1206404,9091787,V01,A,1: White or Caucasian,5: $100K or greater,2: > $50K,0,AGFA,ADC_51xx,58,1134.0
1645304,9141244,V01,A,1: White or Caucasian,4: $50K to < $100K,2: > $50K,0,AGFA,ADC_51xx,35,
1645204,9645082,V01,A,1: White or Caucasian,4: $50K to < $100K,2: > $50K,2,AGFA,ADC_51xx,35,


In [164]:
barcode_site_id_df[barcode_site_id_df['ID'] == 9091787]

Unnamed: 0_level_0,ID,Visit,SITE,RACE,INCOME,INCOME2,COMORB,"(0008, 0070) Manufacturer","(0008, 1090) Manufacturer's Model Name","(0012, 0030) Clinical Trial Site ID","(0018, 1000) Device Serial Number"
XRBARCD,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
232004,9091787,V00,A,1: White or Caucasian,5: $100K or greater,2: > $50K,0,AGFA,ADC_51xx,58,1134.0
1206404,9091787,V01,A,1: White or Caucasian,5: $100K or greater,2: > $50K,0,AGFA,ADC_51xx,58,1134.0
2443702,9091787,V05,A,1: White or Caucasian,5: $100K or greater,2: > $50K,0,Swissray,ddR Modulaire System,35,


In [165]:
barcode_site_id_df[barcode_site_id_df['ID'] == 9141244]

Unnamed: 0_level_0,ID,Visit,SITE,RACE,INCOME,INCOME2,COMORB,"(0008, 0070) Manufacturer","(0008, 1090) Manufacturer's Model Name","(0012, 0030) Clinical Trial Site ID","(0018, 1000) Device Serial Number"
XRBARCD,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
744304,9141244,V00,A,1: White or Caucasian,4: $50K to < $100K,2: > $50K,0,Swissray,ddR Modulaire System,35,
1645304,9141244,V01,A,1: White or Caucasian,4: $50K to < $100K,2: > $50K,0,AGFA,ADC_51xx,35,
2259301,9141244,V03,A,1: White or Caucasian,4: $50K to < $100K,2: > $50K,0,Swissray,ddR Modulaire System,35,
2662701,9141244,V05,A,1: White or Caucasian,4: $50K to < $100K,2: > $50K,0,Swissray,ddR Modulaire System,35,
3443003,9141244,V06,A,1: White or Caucasian,4: $50K to < $100K,2: > $50K,0,Swissray,ddR Modulaire System,35,
3755001,9141244,V08,A,1: White or Caucasian,4: $50K to < $100K,2: > $50K,0,Swissray,ddR Formula System,35,S402607
4054602,9141244,V10,A,1: White or Caucasian,4: $50K to < $100K,2: > $50K,0,Swissray,ddR Formula System,35,S402607


In [166]:
# Map clinical trial sites to patient sites
for val in barcode_site_id_df['SITE'].value_counts().index:
    print(val, list(barcode_site_id_df[barcode_site_id_df['SITE'] == val][var].unique()))

C ['46', '41', nan, '40']
D ['58', nan]
B ['10', '11', ' 1', 'I1']
A ['35', nan, '34', '23', '58', ' 5', '46', '33', '38', '55']
E ['23', '22', '35']


This is the same as seen above, just different format. 

In [78]:
# Let's look at only those who have a mfr name
var = "(0012, 0030) Clinical Trial Site ID"
get_site_distribution(var, barcode_site_id_df[~barcode_site_id_df["(0008, 0070) Manufacturer"].isna()])

Unnamed: 0,46,58,35,11,23,10,34,NaN,22,I1,40,41,5,33,38,55,Total
A,5.0,2.0,3624.0,,2.0,,280.0,118.0,,,,,1.0,1.0,1.0,1.0,4035
B,,,,3231.0,,862.0,,,,3.0,,,,,,,4096
C,7814.0,,,,,,,1.0,,,3.0,1.0,,,,,7819
D,,6572.0,,,,,,56.0,,,,,,,,,6628
E,,,1.0,,2116.0,,,,90.0,,,,,,,,2207
Total,7819.0,6574.0,3625.0,3231.0,2118.0,862.0,280.0,175.0,90.0,3.0,3.0,1.0,1.0,1.0,1.0,1.0,24785


In [79]:
# Let's look at only those who LACK a mfr name
var = "(0012, 0030) Clinical Trial Site ID"
get_site_distribution(var, barcode_site_id_df[barcode_site_id_df["(0008, 0070) Manufacturer"].isna()])

Unnamed: 0,11,10,23,NaN,22,35,34,1,I1,Total
A,,,,2.0,,2.0,1.0,,,5.0
B,1248.0,385.0,,,,,,1.0,1.0,1635.0
C,,,,,,,,,,
D,,,,,,,,,,
E,,,95.0,,2.0,,,,,97.0
Total,1248.0,385.0,95.0,2.0,2.0,2.0,1.0,1.0,1.0,1737.0
