<a href="https://colab.research.google.com/github/crystaljwang/tm10007_group_3/blob/preprocessing/feature_extraction_Crystal.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Loading the GIST data from GitHub

In [131]:
# Run this to use from colab environment
!pip install -q --upgrade git+https://github.com/jveenland/tm10007_ml.git

  Preparing metadata (setup.py) ... [?25l[?25hdone


In [132]:
# Run this to use from colab environment
!git clone https://github.com/jveenland/tm10007_ml.git

fatal: destination path 'tm10007_ml' already exists and is not an empty directory.


In [133]:
%cd /content/tm10007_ml/worcgist

/content/tm10007_ml/worcgist


In [134]:
# Import necessary libraries
import pandas as pd
import numpy as np

from pathlib import Path
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold

In [157]:
dir = Path('.') / 'GIST_radiomicFeatures.csv'
data = pd.read_csv(dir, index_col=0)

print(f'The number of samples: {len(data.index)}')
print(f'The number of columns: {len(data.columns)}')
data.info()

The number of samples: 246
The number of columns: 494
<class 'pandas.core.frame.DataFrame'>
Index: 246 entries, GIST-001_0 to GIST-246_0
Columns: 494 entries, label to PREDICT_original_phasef_phasesym_entropy_WL3_N5
dtypes: float64(468), int64(25), object(1)
memory usage: 951.3+ KB


Splitting the data

In [158]:
# ----- SPLITTING THE DATA -----

# Replace label values from string to binary
data['label'] = data['label'].replace({'GIST': 1, 'non-GIST': 0})

# Separate the features and labels
X = data.drop(['label'], axis=1)
y = data['label']

# Split the data into random train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape)
print(X_test.shape)

(196, 493)
(50, 493)


Preprocessing

The following code blocks perform some preprocessing actions on the data, such as scaling and removing constant features.

In [159]:
# ----- PREPROCESSING -----

# Data scaling
X_train = MinMaxScaler().fit_transform(X_train)

In [160]:
# Remove all constant (zero-variance) features
X_train = pd.DataFrame(X_train)

zero_var_filter = VarianceThreshold(threshold=0)
zero_var_filter.fit(X_train)
zero_var_columns = [column for column in X_train.columns if column not in X_train.columns[zero_var_filter.get_support()]]
X_train = zero_var_filter.transform(X_train)

removed_features = [data.columns[index] for index in zero_var_columns]
print('The following constant features were removed:')
for feature in removed_features:
    print(f'- {feature}')

The following constant features were removed:
- PREDICT_original_tf_LBP_min_R3_P12
- PREDICT_original_tf_LBP_kurtosis_R3_P12
- PREDICT_original_tf_LBP_peak_R3_P12
- PREDICT_original_tf_LBP_min_R8_P24
- PREDICT_original_tf_LBP_peak_R8_P24
- PREDICT_original_tf_LBP_min_R15_P36
- PREDICT_original_tf_LBP_peak_R15_P36
- PREDICT_original_phasef_monogenic_entropy_WL3_N5
- PREDICT_original_phasef_phasecong_kurtosis_WL3_N5
- PREDICT_original_phasef_phasecong_peak_WL3_N5
- PREDICT_original_phasef_phasecong_entropy_WL3_N5
- PREDICT_original_phasef_phasesym_kurtosis_WL3_N5
- PREDICT_original_phasef_phasesym_peak_WL3_N5


Feature selection

The following code block uses Lasso to select all features with a weight above the median.

In [147]:
# ----- Feature selection -----

from sklearn.linear_model import Lasso

lasso_selector = SelectFromModel(estimator=Lasso(alpha=10**(-10), max_iter=1000), threshold='median')
lasso_selector.fit(X_train, y_train)
lasso_list = [column for column in pd.DataFrame(X_train).columns[lasso_selector.get_support()]]
n_original = X_train.shape[1]

X_train = selector.transform(X_train)
n_selected = X_train.shape[1]
print(f"Selected {n_selected} from {n_original} features.")

Selected 240 from 480 features.


In [161]:
X_train.shape

(196, 480)

Feature extraction

The following code block uses principal component analysis (PCA) for dimensionality reduction of the selected optimal number of features.

In [162]:
# ----- Feature extraction -----

from sklearn.decomposition import PCA

# 95% variance
pca = PCA(n_components = 0.95)
pca.fit(X_train)
X_train = pca.transform(X_train)

print(f"Selected {X_train.shape[1]} features to be used for classification.")

Selected 52 features to be used for classification.


Feature selection method: SelectKBest.

This method selects features according to the 10 highest scores using f_classif.

In [141]:
# Select features according to the k highest scores
kbest_filter = SelectKBest(f_classif, k=10)
kbest_filter.fit(X_new, y_train)

data_new = data.drop(data.columns[zero_var_columns], axis=1)
del data_new[data_new.columns[0]]
kbest_list = kbest_filter.get_feature_names_out(input_features=data_new.columns)

print('The selected 10 best features are:')
for kbest in kbest_list:
    print(f'- {kbest}')

X_kbest = kbest_filter.transform(X_new)

The selected 10 best features are:
- PREDICT_original_tf_GLCMMS_homogeneityd3.0A0.0mean
- PREDICT_original_tf_GLCMMS_homogeneityd3.0A0.0std
- PREDICT_original_tf_GLCMMS_homogeneityd3.0A0.79mean
- PREDICT_original_tf_GLCMMS_homogeneityd3.0A0.79std
- PREDICT_original_tf_GLCMMS_homogeneityd3.0A1.57mean
- PREDICT_original_tf_GLCMMS_homogeneityd3.0A1.57std
- PREDICT_original_tf_GLCMMS_homogeneityd3.0A2.36mean
- PREDICT_original_tf_GLCMMS_homogeneityd3.0A2.36std
- PREDICT_original_tf_Gabor_mean_F0.2_A0.79
- PREDICT_original_tf_Gabor_mean_F0.5_A0.79


In [142]:
X_kbest.shape

(196, 10)