# Feature Extraction

The dataset provided contains CT-scans of different lung patients. In order to extract some features out of these, we used two different python libraries, pylidc and pyradiomics. Before merging the two, we also turned our target variable, which took values from 1 to 5, into a binary variable (benign vs malign).

We start by importing relevant libraries.

In [4]:
import pylidc
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

### PyLIDC

We create a dataframe to store all the information we aim to extract

In [2]:
df_pylidc = pd.DataFrame(columns= [
                            'patient_id', 
                            'annotation_id',
                            'scan_id',
                            
                            'slice_thickness',
                            'pixel_spacing',

                            'subtlety', 
                            'internalStructure', 
                            'calcification', 
                            'sphericity', 
                            'margin', 
                            'lobulation', 
                            'spiculation', 
                            'texture', 
                        
                            'diameter',
                            'surface_area',
                            'volume',

                            'malignancy',
                            ])

And then we estract the data and save it to a CSV file.

In [3]:
ann = pylidc.query(pylidc.Annotation).all()

for i in range(len(ann)):
    att  = dict((col, "") for col in df_pylidc.columns)

    # patient and annotation identification
    att['patient_id'] = ann[i].scan.patient_id
    att['annotation_id'] = ann[i].id  
    att['scan_id'] = ann[i].scan.id

    # features
    st = pylidc.query(pylidc.Scan.slice_thickness).filter(pylidc.Scan.id == att['scan_id'])
    s = str(st[0])
    att['slice_thickness'] = float(s[1:4])
    ps = pylidc.query(pylidc.Scan.pixel_spacing).filter(pylidc.Scan.id == att['scan_id'])
    p = str(ps[0])
    att['pixel_spacing'] = float(p[1:5])

    att['subtlety'] = ann[i].subtlety
    att['internalStructure'] = ann[i].internalStructure 
    att['calcification'] = ann[i].calcification 
    att['sphericity'] = ann[i].sphericity
    att['margin'] = ann[i].margin  
    att['lobulation'] = ann[i].lobulation
    att['spiculation'] = ann[i].spiculation 
    att['texture'] = ann[i].texture

    att['diameter'] = ann[i].diameter
    att['surface_area'] = ann[i].surface_area
    att['volume'] = ann[i].volume

    # target
    att['malignancy'] = ann[i].malignancy  

    df_pylidc = df_pylidc.append(att, ignore_index=True)

df_pylidc.to_csv('pylidc_features.csv', sep=',', index=False)

  df_pylidc = df_pylidc.append(att, ignore_index=True)
  df_pylidc = df_pylidc.append(att, ignore_index=True)
  df_pylidc = df_pylidc.append(att, ignore_index=True)
  df_pylidc = df_pylidc.append(att, ignore_index=True)
  df_pylidc = df_pylidc.append(att, ignore_index=True)
  df_pylidc = df_pylidc.append(att, ignore_index=True)
  df_pylidc = df_pylidc.append(att, ignore_index=True)
  df_pylidc = df_pylidc.append(att, ignore_index=True)
  df_pylidc = df_pylidc.append(att, ignore_index=True)
  df_pylidc = df_pylidc.append(att, ignore_index=True)
  df_pylidc = df_pylidc.append(att, ignore_index=True)
  df_pylidc = df_pylidc.append(att, ignore_index=True)
  df_pylidc = df_pylidc.append(att, ignore_index=True)
  df_pylidc = df_pylidc.append(att, ignore_index=True)
  df_pylidc = df_pylidc.append(att, ignore_index=True)
  df_pylidc = df_pylidc.append(att, ignore_index=True)
  df_pylidc = df_pylidc.append(att, ignore_index=True)
  df_pylidc = df_pylidc.append(att, ignore_index=True)
  df_pylid

Unexpected exception formatting exception. Falling back to standard exception


Traceback (most recent call last):
  File "c:\Users\anton\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3460, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "C:\Users\anton\AppData\Local\Temp\ipykernel_27376\2568502329.py", line 29, in <module>
    att['surface_area'] = ann[i].surface_area
  File "c:\Users\anton\anaconda3\lib\site-packages\pylidc\Annotation.py", line 642, in surface_area
    mask = self.boolean_mask()
  File "c:\Users\anton\anaconda3\lib\site-packages\pylidc\Annotation.py", line -1, in boolean_mask
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\Users\anton\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2057, in showtraceback
    stb = self.InteractiveTB.structured_traceback(
  File "c:\Users\anton\anaconda3\lib\site-packages\IPython\core\ultratb.py", line 1118, in structured_traceback
    return FormattedTB.structur

#### Exploring the data attained

In [None]:
df_pylidc = pd.read_csv('pylidc_features.csv', index_col=False)
df_pylidc.head()

Unnamed: 0,patient_id,annotation_id,scan_id,slice_thickness,pixel_spacing,subtlety,internalStructure,calcification,sphericity,margin,lobulation,spiculation,texture,diameter,surface_area,volume,malignancy
0,LIDC-IDRI-0078,1,1,3.0,0.65,5,1,6,3,4,1,1,5,20.840585,1124.125177,2439.30375,3
1,LIDC-IDRI-0078,2,1,3.0,0.65,4,1,6,4,4,1,2,5,19.5,1135.239277,2621.82375,3
2,LIDC-IDRI-0078,3,1,3.0,0.65,5,1,4,3,5,2,3,5,23.300483,1650.898027,4332.315,4
3,LIDC-IDRI-0078,4,1,3.0,0.65,5,1,6,4,2,4,1,5,32.810517,1994.684094,5230.33875,5
4,LIDC-IDRI-0078,5,1,3.0,0.65,4,1,6,4,2,3,1,4,20.891206,1130.172711,2443.74,4


In [None]:
# number of individual patients assessed in the data
patients = np.unique(df_pylidc.patient_id)
print(f"Total number of patients: {len(patients)} \n")

df_pylidc.info()

Total number of patients: 875 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6859 entries, 0 to 6858
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   patient_id         6859 non-null   object 
 1   annotation_id      6859 non-null   int64  
 2   scan_id            6859 non-null   int64  
 3   slice_thickness    6859 non-null   float64
 4   pixel_spacing      6859 non-null   float64
 5   subtlety           6859 non-null   int64  
 6   internalStructure  6859 non-null   int64  
 7   calcification      6859 non-null   int64  
 8   sphericity         6859 non-null   int64  
 9   margin             6859 non-null   int64  
 10  lobulation         6859 non-null   int64  
 11  spiculation        6859 non-null   int64  
 12  texture            6859 non-null   int64  
 13  diameter           6859 non-null   float64
 14  surface_area       6859 non-null   float64
 15  volume             6859 non-null   float

#### Standardizing Indexes

We start by sorting the dataframe by 'patient_id' and 'annotation_id' and grouping it by 'patient_id' in a new index column.

In [None]:
df_pylidc.sort_values(by=['patient_id', 'annotation_id'], inplace=True)

df_pylidc['Ann_id'] = df_pylidc.groupby('patient_id').cumcount() + 1
df_pylidc = df_pylidc[['Ann_id'] + [col for col in df_pylidc.columns if col != 'Ann_id']]

We then format the new ID to match with data we will get from PyRadiomics and drop the redundant columns, saving the dataframe to a new CSV file.

In [None]:
for i in range(len(df_pylidc)):
    df_pylidc.at[i,'Id'] = df_pylidc.at[i,'patient_id'] + '-' + str(df_pylidc.at[i,'Ann_id'])

df_pylidc = df_pylidc[['Id'] + [col for col in df_pylidc.columns if col != 'Id']]
df_pylidc = df_pylidc.drop(columns=['Ann_id', 'annotation_id', 'scan_id'])

df_pylidc.to_csv('pylidc_features_fixed.csv', sep=',', index=False)

### PyRadiomics

Pyradiomics doesn't deal with DICOM files directly. In their documentation, though, it is possible to find a experimental script (pyradiomics-dcm.py) that supports its use with DICOM data, using plastimatch and dcmqi.

We modified this script to append new lines to an already existing, but containing only the headers, CSV file and used it for our feature extraction.

This script was given as input:
- a directory with the input DICOM series
- the file name pointing to a DICOM Segmentation Image (DICOM SEG) object
- the file name pointing to a yaml with the desired extraction parameters
- the file name pointing to a tsv dictionary that maps pyradiomics feature names to the IBSI defined features
- the ID for that segmentation

Like this, we only needed to iterate through all of the patients and respective segmentations to process the extraction.

In [None]:
main_directory = os.listdir()

for patient in main_directory:
    # enter each patients folder, ignoring the folders for storing output and other files that may be inside the main directory
    if (not os.path.isdir(patient)) or patient == "OutputSR" or patient == "TempDir":
        continue
    
    folders = os.listdir(patient) # list of folders inside patient (CT-scan and X-ray)
    path = ""
    content = 0
    
    # finding the CT-scan folder (has the biggest content - directories with each segmentation and directory with the input DICOM series)
    for folder in folders:
        segmentations_or_series = os.listdir(patient + "\\" + folder)
        if (len(segmentations_or_series) > content):
            content = len(segmentations_or_series)
            path = folder
    
    folders = os.listdir(patient + "\\" + path)
    main = ""

    # finding the series folder
    for folder in folders:
        if not "Annotation" in folder:
            main = folder
            break
        
    seg_index = 1
    
    # extract features for each segmentation and naming each line with Patient_Name-Segmentation_Number
    for folder in folders:
        if "Segmentation" in folder:
            print("\n\nPACIENTE NUMERO " + patient + " - SEGMENTAÇÃO " + str(seg_index))
            os.system(f'cmd /c """python pyradiomics-dcm.py --input-image-dir "{patient}\{path}\{main}" --input-seg-file "{patient}\{path}\{folder}\\1-1.dcm" --output-dir OutputSR --temp-dir TempDir --parameters Pyradiomics_Params.yaml --features-dict featuresDict.tsv --name {patient}-{seg_index}"""')
            seg_index+=1


The previous script takes days to run for all patients, which is why we found it unwise to run it again inside the notebook.

The attained CSV file didn't process some of the features due to some images being too small to apply the Log filter and didn't process some segmentations due to the mask having too few dimensions or voxels.

Regardless, our final Pyradiomics CSV was as following.

#### Exploring the data attained

In [5]:
df_pyradiomics = pd.read_csv('pyradiomics_features.csv', index_col=False)
df_pyradiomics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4696 entries, 0 to 4695
Columns: 1599 entries, id to logarithm_ngtdm_Strength
dtypes: float64(1565), int64(4), object(30)
memory usage: 57.3+ MB


  df_pyradiomics = pd.read_csv('pyradiomics_features.csv', index_col=False)


In [6]:
# number of individual patients assessed in the data
patients = np.unique(df_pyradiomics.id)
print(f"Total number of patients: {len(patients)} \n")

df_pyradiomics.info()

Total number of patients: 4688 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4696 entries, 0 to 4695
Columns: 1599 entries, id to logarithm_ngtdm_Strength
dtypes: float64(1565), int64(4), object(30)
memory usage: 57.3+ MB


Looking at the whole dataframe, we were able to see that the lines where the Log Filter wasn't correctly applied were misplaced.

#### Fixing misplaced cells

In [7]:
df_pyradiomics.iloc[:, 143:] = df_pyradiomics.iloc[:, 143:].apply(lambda x: x.shift(273) if x.isna().any() else x, axis=1)
df_pyradiomics.to_csv('pyradiomics_extraction_fixed.csv', index=False)

### Bibliographic References

#### PyLIDC

Documentation: https://pylidc.github.io 

#### PyRadiomics

Documentation: https://pyradiomics.readthedocs.io/en/latest/

Publication: Van Griethuysen, J. J. M., Fedorov, A., Parmar, C., Hosny, A., Aucoin, N., Narayan, V., Beets-Tan, R. G. H., Fillon-Robin, J. C., Pieper, S., Aerts, H. J. W. L. (2017). 
Computational Radiomics System to Decode the Radiographic Phenotype. Cancer Research, 77(21), e104–e107. at https://doi.org/10.1158/0008-5472.CAN-17-0339.
