# Feature extraction

For this project I will use the VGG16 network pre-trained on ImageNet to compute meaningful numeric descriptions

of the contents of each image. We will use these later to classify the images with K-means.

## Module imports

I will use keras to access the VGG16 network. 

In [1]:
import numpy as np
import os
from pathlib import Path
import pickle
import tensorflow as tf
import tensorflow.keras as keras
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.vgg16 import preprocess_input

import sys
sys.path.append('..')
from helper.classification_tools import CustomLabelEncoder


2025-01-29 19:07:42.528626: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-01-29 19:07:42.688570: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1738166862.752516   21995 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1738166862.769712   21995 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-29 19:07:42.922241: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instr

## Loading files
First, I need to get the file paths of the pre-processed images I saved in 01_preprocess.ipynb. 

In [2]:
img_root = Path('..','data','images_preprocessed','images_histeq_resize')  # directory where images are stored.
assert img_root.is_dir() # make sure directory exists and is properly found
files = sorted(img_root.glob("*.bmp"))  # returns a list of all of the images in the directory, sorted by filename.

## Shuffle the filenames so they appear randomly in the dataset. 
rs = np.random.RandomState(seed=749976)
rs.shuffle(files)

assert len(files) == 1500  # make sure all files are found.
print('first 10 filenames: {}'.format([x.name for x in files[:10]]))

first 10 filenames: ['Sc_68.bmp', 'Cr_152.bmp', 'PS_298.bmp', 'Pa_216.bmp', 'PS_255.bmp', 'Sc_185.bmp', 'RS_95.bmp', 'Pa_125.bmp', 'RS_263.bmp', 'PS_169.bmp']


## Extracting the labels from filenames
The labels are determined from the characteris in the filename before the first "_". 

I could also just take the first two characters of the filename, but this does not generalize to cases where the labels have different numbers of characters.

In [3]:
def extract_labels(f): return [x.stem.split('_')[0] for x in f]
labels = extract_labels(files)
print('first 10 labels: {}'.format(labels[:10]))

first 10 labels: ['Sc', 'Cr', 'PS', 'Pa', 'PS', 'Sc', 'RS', 'Pa', 'RS', 'PS']


## Label encoding
One step that will make the life easier throughout the analysis is standardizing the
encoding of labels. The labels are stored as strings in the filenames, but it will be more
convenient to convert them to numeric values for more convenient calculations of statistics like accuracy, precision, recall, etc.
I can create one LabelEncoder model and save it for reuse throughout the project.

sklearn has a LabelEncoder object, but it doesn't let you sort the labels alphabetically. Therefore I have found a simple helper function that will do the job.

In [4]:
le = CustomLabelEncoder()
le.fit(labels, sorter=lambda x: x.upper())

labels_int = le.transform(labels[:10])
labels_str = le.inverse_transform(labels_int)

# save the label encoder so it can be used throughout the rest of this project
with open(Path('..','models','label_encoder.pickle'), 'wb') as f:
    pickle.dump(le, f)

print('label encodings: {}'.format(le.mapper))
print('first 10 integer labels: {}'.format(labels_int))
print('first 10 string labels: {}'.format(labels_str))

label encodings: {np.str_('Cr'): 0, np.str_('Pa'): 1, np.str_('PS'): 2, np.str_('RS'): 3, np.str_('Sc'): 4}
first 10 integer labels: [4 0 2 1 2 4 3 1 3 2]
first 10 string labels: ['Sc' 'Cr' 'PS' 'Pa' 'PS' 'Sc' 'RS' 'Pa' 'RS' 'PS']


## Loading Images
For feature extraction to work correctly, the images have to be in the correct format for the network weights.
Keras gives us functions for loading and formatting these images. Note the function is called 'preprocessing,'
but it does not actually change the properties of the image like the preprocessing we did before. Instead, it 
ensures that the images are represented the correct way.

In [5]:
def load_images(paths):
    """
    Loads images in the correct format for use with the Keras VGG16 model
    
    Images are loaded as PIL image objects, converted to numpy array, and then formatted
    with the appropriate VGG16.preprocess_input() function. Note that this only changes
    how the images are represented, it does not change the actual visual properties of the
    images like preprocessing did before.
    
    Parameters
    ----------
    paths: list(Path)
        list of Paths to each file where the image is stored. Note that the images should 
        have the same height, width in pixels so they can be stored in one array.
    
    Returns
    ----------
    images: ndarray
        n_images x r x c x 3 array of pixel values that is compatible with the Keras model.
    
    """
    
    images = [image.load_img(file) for file in paths] # load images
    # convert images to an array with shape consistent for the vgg16 input
    images = np.asarray([image.img_to_array(img) for img in images]) 
    # normalizes the pixel values to match the imagenet format (and therefore the pre-trained weights)
    images = preprocess_input(images) 
    
    return images

    

In [6]:
images = load_images(files)
assert len(images) == 1500
print(images.shape)

(1500, 224, 224, 3)


# Feature extraction
We will use the VGG16 network as a signal processor, generating a feature descriptor for each image that we can use later for classification.

Get the weights of the VGG16 model

In [7]:
vgg16_path = Path('..','models','VGG16.h5')
if not vgg16_path.is_file():
    vgg16 = keras.applications.VGG16(include_top=True,  # include fully connected layers
                                     weights='imagenet') # use pre-trained model
    vgg16.save(vgg16_path) # save model so we don't have to download it everytime
    
else:   
    vgg16 = keras.models.load_model(vgg16_path) # use saved model

I0000 00:00:1738166866.517970   21995 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 5728 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 4060 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.9


The warning indicates that the model hasn't been compiled with an optimizer/loss function for training. Since we are 
not training the model, and are just using it as a feature extractor, this is not a problem.

We can see the strutcure of the VGG16 model here.

In [8]:
vgg16.summary()

The pre-trained model will run data through the entire network and return the output of the classification layer. 

Howevever, I only want the output of the intermediate layer so that I can use it as a feature descriptor. 

By default, I will use the fc1 (first fully-connected layer).

In [9]:
def layer_extractor(model=vgg16, layer='fc1'):
    """
    returns a model that will extract the outputs of *layer* from *model*.
    
    Parameters
    -------------
    model: keras model
        full model from which intermediate layer will be extracted
    layer: string
        name of layer from which to extract outputs
    
    Returns
    -------------
    new_model: keras model
        feature extractor model which takes the same inputs as *model* and returns the outputs
        of the intermediate layer specified by *layer* by calling new_model.predict(inputs)
    """
    assert layer in [x.name for x in model.layers]  # make sure the layer exists

    new_model = keras.Model(inputs = vgg16.input, outputs=[vgg16.get_layer(layer).output])
    
    return new_model




## FC1 features

This extracts the features, and then saves them to the target directory, along with the filenames and extracted labels.

In [10]:
fc1_extractor = layer_extractor()
fc1 = fc1_extractor.predict(images, verbose=True)

# save results
results = {'filename' : files,
           'features': fc1,
          'labels': labels,
           'layer_name': 'fc1'
          }

feature_dir = Path('..','data','features')
os.makedirs(feature_dir, exist_ok=True)
with open(feature_dir / 'VGG16_fc1_features_std.pickle', 'wb') as f:
    pickle.dump(results, f)

print(fc1.shape)

I0000 00:00:1738166868.949312   22081 service.cc:148] XLA service 0x709488004ba0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1738166868.949601   22081 service.cc:156]   StreamExecutor device (0): NVIDIA GeForce RTX 4060 Laptop GPU, Compute Capability 8.9
2025-01-29 19:07:48.961720: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1738166869.009882   22081 cuda_dnn.cc:529] Loaded cuDNN version 90700
2025-01-29 19:07:51.463217: W external/local_xla/xla/tsl/framework/bfc_allocator.cc:306] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.04GiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.


[1m 1/47[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m9:41[0m 13s/step

I0000 00:00:1738166881.357506   22081 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


[1m47/47[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 390ms/step
(1500, 4096)


In [11]:
# import pandas as pd

# obj = pd.read_pickle('../data/features/VGG16_fc1_features_std.pickle')
# print (obj)
