# Clasificación de fitolítos

Phytolith classification

This notebook is an experimental code that performs feature extraction and classification of Phytoliths. The raw images are found in the repository of a previous work. https://github.com/alvarag/AutomaticPhytolithClassification

The notebook runs on Colab and saves the data to drive. It is possible to run it locally.

Saves the datasets, resulting from the processes of extracting features from the images.
- Saved datasets are Pandas DataFrames, which contain:
    - Image file name.
    - Phytolith class name.
    - An array with 10 "test fold ids". In this experiment we are doing 10 x 10 cross validation. For each image, the test fold in which the image is in each repetition is stored. It makes it possible to ensure that all the methods have been evaluated in the same way (trained and tested always with the same sets of images).

- Dictionary of results. A dictionary with a key (IdAtributes, IdClassifier), and as values dictionaries that contain all the predictions made in each of the folds of each of the repetitions.

# Instalations

In [None]:
'''
Experimental. 
Installing pandarallel to parallelize feature extraction
https://towardsdatascience.com/pandaral-lel-a-simple-and-efficient-tool-to-parallelize-your-pandas-operations-on-all-your-cpus-bb5ff2a409ae
Not used in this version
'''
!pip install pandarallel
from pandarallel import pandarallel
from IPython.display import clear_output
# Initialization
pandarallel.initialize()
clear_output()
print("Installed")

In [None]:
'''
Installing pyedf, for computing elliptic fourier descriptors
'''
!pip install pyefd
from pyefd import elliptic_fourier_descriptors
clear_output()
print("Installed") 


In [None]:
'''
Installing mahotas, for computing Zernique moments
Not used in this version
'''
!pip install mahotas
clear_output()
print("Installed")

In [None]:
#Force to use pickle version 5 
!pip3 install pickle5
import pickle5 as pickle

# Imports and variables

Importación de la bibliotecas y variables principales


In [None]:
'''
Import libraries
'''

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt   

import json
import math
import glob
import os
import pickle

from skimage.draw import polygon
from skimage.measure import regionprops, find_contours, label
from skimage.transform import resize, rotate
from skimage.util import montage, img_as_ubyte
from skimage.morphology import convex_hull_image
from skimage import io
from skimage.io import imread, imshow
from skimage.color import rgb2gray, gray2rgb
from skimage.feature.texture import local_binary_pattern

from sklearn.model_selection import StratifiedKFold

from IPython.display import clear_output

In [None]:
'''
Configuration data

csvs_path = Directory with multiple subfolders, one for each morphotype, 
each of these subfolders with a .csv file per image.

imgs_path = Directory with multiple subfolders, one for each morphotype, 
each of these subfolders contains multiple images.

Watch out: delimiter separator depends on the operating system.
'''
# change to use the notebook in local
colab = True


path = "AutomaticPhytolithClassification/phytoliths"

# Path to csv files
csvs_path = path + os.sep + "csvs"

# Path to image files
imgs_path = path + os.sep + "imgs"

# Output_folder
output_folder = "phyto_output"

complete_output_folder_path = ""
if colab:
  complete_output_folder_path = f"/content/drive/MyDrive/{output_folder}/"
else:
  complete_output_folder_path = "."+os.sep

# Datasets serialized file
datasets_file = "datasets.obj"

# Results serialized file
results_file = "results.obj"

# Size normalization
target_size = (300,300)

# Google Drive Setup

The datasets and results are saved on the drive chosen by the user. Only drives owned by the user can be accessed.

In [None]:
import os
from google.colab import drive
import shutil


def mount():
  drive.mount('/content/drive')

mount()

# Import datasets


Download the images from the phytolith repository.

(Modify the following cells to use the code with your own images)

In [None]:
# Download AutomaticPhytolithClassification repository
!git clone https://github.com/alvarag/AutomaticPhytolithClassification.git
clear_output()

# Configurations

In [None]:
'''
Features to be recalculated

It allows repeating the generation of the datasets (if you want to test new parameters or modifications).
If set to false, it does not recalculate the datasets, but retrieves them from drive.

'''
repeat_dict = {
    "LBP": False, # LBP 
    "Morpho": False, # Morphological
    "EFD": False, # Elliptic fourier desc
    "Hu": False, # Hu moments
    "pftas":False,
    "Haralick":False,
    
    "InceptionV3":False, "Xception":False, 
    "VGG16":False, "VGG19":False, 
    "ResNet50":False, "ResNet101":False, "ResNet152":False, 
    "InceptionResNet":False, "MobileNet":False,
    "InceptionV3":False, "InceptionV3":False, "InceptionV3":False,
    "NASMobile":False, "NASLarge":False, 
    "Dense121":False, "Dense169":False, "Dense201":False
}



# Utility functions (main)

Functions to generate data sets in DataFrame format.

These functions take a string (a row of a dataframe) that has information about the name of the image etc and return another string, with the desired attributes.

Using these functions together with "apply" you can get an features dataframe from another dataframe with the image metadata.

In [None]:
def crop_square(img, rectangle):
  """
  Clips a square image that circumscribes the rectangle passed as an argument
        
  """
  (y1,x1),(y2,x2) = rectangle
  height = y2-y1
  width = x2-x1
  size = max(height,width)
  center_y, center_x = y1+int(height/2),x1+int(width/2)
  new_y1, new_y2 = center_y-int(size/2), center_y+int(size/2)
  new_x1, new_x2 = center_x-int(size/2), center_x+int(size/2)


  img_detail_centered = img[new_y1:new_y2,new_x1:new_x2].copy()
  return img_detail_centered


def center_and_crop_square(img, rectangle):
  """
  Clips a square image that circumscribes the rectangle passed as an argument
  Center the area to be cropped "rolling the image"
        
  """
  (y1,x1),(y2,x2) = rectangle
  height = y2-y1
  width = x2-x1
  size = max(height,width)
  center_y, center_x = y1+int(height/2),x1+int(width/2)

  img_center_y = int(img.shape[0]/2)
  img_center_x = int(img.shape[1]/2)

  diff_y, diff_x = int(img_center_y- center_y), int(img_center_x - center_x)

  rolled = np.roll(img,diff_y,axis=0)
  rolled = np.roll(rolled,diff_x,axis=1)

  new_y1, new_y2 = img_center_y-int(size/2), img_center_y+int(size/2)
  new_x1, new_x2 = img_center_x-int(size/2), img_center_x+int(size/2)

  img_detail_centered = rolled[new_y1:new_y2,new_x1:new_x2].copy()

  return img_detail_centered


def crop_image(img,rectangle):
  """
  Clips a circumscribed square to the rectangle passed as an argument
   If necessary "roll the image"
  """
  crop = crop_square(img, rectangle)
  (y1,x1),(y2,x2) = rectangle
  h,w = 0,0
  if len(crop.shape)==3:
    h,w,_ = crop.shape
  else:
    h,w = crop.shape
  size = max(y2-y1,x2-x1)
  #print(size,h,w)
  if size<h*0.95 or size <w*0.95:
    crop = center_and_crop_square(img, rectangle)

  return crop

In [None]:
def string_to_dict(dict_string):
    """
    Convert a string to a dictionary
    
    Parameters
    ----------
    dict_string : string
        string containing a dictionary encoded as text
    
    Returns
    -------
    dictionay : dict
        A dictionary containing the same information
    """
    dict_string = dict_string.replace("'", '"').replace('u"', '"')
    return json.loads(dict_string)

def process_csv(file):
    """
    Process the csv files obtained by the image labeler    
    
    Parameters
    ----------
    file : string
        string containing the path to the csv file
    
    Returns
    -------
    data : tuple
        A tuple containing the bounding box coordinates, 
        the full list of contour points and the image name
    """
    data = pd.read_csv(file) 
    img_name = data.filename
    points = data.region_shape_attributes[0]
    points_dict = string_to_dict(points)
    
    xs = points_dict["all_points_x"]
    ys = points_dict["all_points_y"]
    
    coords = list(zip(ys,xs)) 
    
    return ((min(ys),min(xs)),(max(ys),max(xs))) ,coords, img_name[0]

In [None]:
def create_image_info_df(csvs_path,num_folds=10,
                         num_repetitions = 10,
                         ignore_classes = []):
  """
  Process all csv files obtained by the image labeler.
  Obtains a dataset with "Image","Rectangle","Coords","Class", and
  cross validation partitions for future experiments
    
    
    Parameters
    ----------
    csvs_path : string
        string containing the path to the csv file
    num_folds: integer
        number of cross validation folds
    num_repetitions: integer
        number of repetitions of cross validation procedure
    ignore_classes: list
        class names to be excluded from the dataset
    
    Returns
    -------
    df : Dataframe
        A dataframe containing image name, the bounding box coordinates, 
        the full list of contour points, the class and the the test partition 
        to which the image will belong
        
    """
  url = csvs_path+os.sep
  clases = []
  rects = []
  coords = []
  image_files = []

  # Listing all .csv files
  files = [f for f in glob.glob(url+"**"+os.sep+"*.csv", recursive=True)]

  for file in files:    
    
      clases.append(file.split(os.sep)[-2])
          
      rect, coord, image_file = process_csv(file)
      rects.append(rect)
      coords.append(coord)
      image_files.append(image_file)
      
  # intialise data of lists. 
  data = {'Image':image_files, 
          'Rectangle':rects,
          'Coords': coords,
          'Class':clases} 
    
  # Create DataFrame 
  df = pd.DataFrame(data)[["Image","Rectangle","Coords","Class"]] 

  # Remove small classes
  for class_name in ignore_classes:
    df = df[~(df.Class==class_name)]

  
  y = df.Class

  # Add test folds
  for i in range(num_repetitions):

    test_fold = np.full(len(y),-1)
    skf = StratifiedKFold(n_splits=num_folds, 
                          shuffle=True,random_state=i)
    
    skf.get_n_splits(y, y)

                                                                #X
    for fold_idx, (train_index, test_index) in enumerate(skf.split(y, y)):
        test_fold[test_index] = fold_idx

    df[f"Test_Fold{i}"] = test_fold

  return df




In [None]:
def get_mask(img,r,c):
    """
    Obtains the image mask    
    
    Parameters
    ----------
    img : ndarray
        The image
    r ndarray
        Row coordinates of vertices of polygon.
    c ndarray
        Column coordinates of vertices of polygon.

    
    Returns
    -------
    mask : ndarray of type ‘bool’.

    The mask that corresponds to the input polygon.

    """
    image_shape = img.shape[:-1]    
    mask = np.zeros(image_shape, dtype=np.uint8)
    rr, cc = polygon(r, c)
    mask[rr, cc] = 1
    
    return mask

In [None]:
def register_to_info(register):
  img_name = register.Image
  #print(img_name)
  Class = register.Class    
    
  img_dir = imgs_path+os.sep+Class+os.sep    
  Name_str = Class+"_"+img_name    
    
  img_path = img_dir+img_name 
    
    
  # the polygon function needs the rows on one var and the columns on other
  r,c = zip(*register.Coords)
  (y1,x1),(y2,x2) = register.Rectangle
    
  img = imread(img_path)

  return img_name, Class, Name_str, (r,c), ((y1,x1),(y2,x2)), img


In [None]:
def get_min_max_feret(mask):
    """
    Compute min_feret and max_feret    
    
    Parameters
    ----------
    mask : ndarray
        Binary image 
   
    Returns
    -------
    feret_max : integer 
    feret_min : integer

    """
    feret_max = 0
    feret_min = 999999

    # the idea is to make 360 rotations and take out the size of the bounding box
    for i in range(360):
        mask_r = rotate(mask,i,preserve_range=True,resize=True)

        label_image = label(mask_r)

        region = regionprops(label_image)[0]
        minr, minc, maxr, maxc = region.bbox
        lengths = (maxc-minc, maxr - minr)
        max_l = max(lengths)
        min_l = min(lengths)

        if max_l > feret_max:
            feret_max = max_l
        if min_l < feret_min:
            feret_min = min_l
        
    return feret_max, feret_min 

In [None]:
def get_efd(mask):
    """
    Compute elliptic fourier descriptors    
    
    Parameters
    ----------
    mask : ndarray
        Binary image 
   
    Returns
    -------
    Edfs : ndarray 
        Array of elliptic fourier descriptors

    """
    contours = find_contours(mask, 0.5)
    
    coeffs = elliptic_fourier_descriptors(contours[0],order=10,normalize=True)
    return coeffs.flatten()[3:]

In [None]:
def register_to_morpho_features(register):
    """
    
    Compute basic morphologic features
    
    Parameters
    ----------
    mask : ndarray
        Binary image 
   
    Returns
    -------
    Edfs : ndarray 
        Array of morphologic features

    """
    
    img_name, Class, Name_str, (r,c), ((y1,x1),(y2,x2)), img = register_to_info(register)

    test_folds = register.filter(regex='Test_Fold')

    img_gray = rgb2gray(img)
    
    
    mask = get_mask(img,r,c)    
        
    # min feret and max feret
    Length,Width = get_min_max_feret(mask[y1:y2,x1:x2])
    
    ## Properties of the mask
    region = regionprops(mask,intensity_image=img_gray)[0]
    Perimeter = region.perimeter
    Area = region.area
    ConvexArea = region.convex_area
    MajorAxisLength = region.major_axis_length
    MinorAxisLength = region.minor_axis_length
    EquivDiam = region.equivalent_diameter
    
    ## convex hull of the mask.
    chull = convex_hull_image(mask)
    regionPerimConvexHull = regionprops(chull.astype(int))[0]
    perimeterHull = regionPerimConvexHull.perimeter
    
    Convexity = perimeterHull/Perimeter    
    Solidity = Area/ConvexArea
    AspectRatio = Length/Width
    Roundness = (4*Area*(math.pi))/((Length)**2)
    Compactness = EquivDiam/Length
    
    FormFactor = (4*Area*(math.pi))/((Perimeter)**2)
    
    basic_values = pd.Series([Name_str,Class],["Name","Class"]) 
    morfo_values = pd.Series([Perimeter,perimeterHull,Area,ConvexArea,
                              MajorAxisLength,MinorAxisLength,
                               EquivDiam,FormFactor,Length,Width,
                               Convexity,Solidity, AspectRatio,Roundness,Compactness],
                              ["Perimeter","PerimeterHull","Area","Convex Area",
                               "Major axis length","Minor axis length",
                               "Equivalent diameter","Form factor","Length","Width",
                               "Convexity","Solidity", "AspectRatio","Roundness","Compactness"])
        
    return pd.concat((basic_values,test_folds,morfo_values))

In [None]:
def register_to_efd_features(register):
  """
    Compute Elliptic Fourier Descriptors    
    
    Parameters
    ----------
    register : Pandas Series
        Image Info 
   
    Returns
    -------
    lbps : Pandas Series 
        Serie with Elliptic Fourier Descriptors descriptors

  """

  img_name, Class, Name_str, (r,c), ((y1,x1),(y2,x2)), img = register_to_info(register)
  test_folds = register.filter(regex='Test_Fold')
  img_gray = rgb2gray(img)
    
    
  mask = get_mask(img,r,c)    
  efds = get_efd(mask)  
    
    
    
  basic_values = pd.Series([Name_str,Class],["Name","Class"])
  efd_values = pd.Series(efds,["edf"+str(i) for i in range(len(efds))])
    
  return pd.concat((basic_values,test_folds,efd_values))

In [None]:
def register_to_lbp_features(register,radious=3,use_mask = True):
    """
    Compute Local Binary Patterns    
    
    Parameters
    ----------
    register : Pandas Series
        Image Info 
   
    Returns
    -------
    lbps : Pandas Series 
        Serie with Local Binary Patterns descriptors

    """
    img_name, Class, Name_str, (r,c), ((y1,x1),(y2,x2)), img = register_to_info(register)
    test_folds = register.filter(regex='Test_Fold')
    img_gray = img_as_ubyte(rgb2gray(img))  
    img_gray=img_gray[y1:y2,x1:x2] 
    
    mask = get_mask(img,r,c)  
    mask=mask[y1:y2,x1:x2] 
    
    lbp = None
    P = 8

    dim = 2**P # Number of grey-scale levels

    #h_bins = np.arange(dim+1)

    h_bins = np.arange(0,dim+1,16) # Quantization levels
    h_range = (0, dim)

    
    codes = local_binary_pattern(img_gray, P, radious, method="default")
    #print("codes",codes[0])
    
    h_img, _ = np.histogram(codes.ravel(), bins=h_bins, range=h_range)
    h_masked, _ = np.histogram(codes[mask], bins=h_bins, range=h_range)
    h_img = h_img/h_img.sum(dtype=np.float)
    h_masked = h_masked/h_masked.sum(dtype=np.float)

    if use_mask:
      lbp = h_masked.copy()
    else:
      lbp = h_img.copy()
    
    
    basic_values = pd.Series([Name_str,Class],["Name","Class"])
    lbp_values = pd.Series(lbp,["LBP"+str(i) for i in range(len(lbp))])
    
    return pd.concat((basic_values,test_folds,lbp_values))

In [None]:
def get_hu(img):
  label_image = label(img)
  regions = regionprops(label_image)
  
  region = regions[0]
  hu = region.moments_hu
  #(min_row, min_col, max_row, max_col) = region.bbox
  return hu

In [None]:
import mahotas

'''
Not used
'''
def get_zernike(img,radious):
  label_image = label(img)
  regions = regionprops(label_image)
  
  # obtengo la primera región (solo funciona bien si hay una)
  region = regions[0]
  (min_row, min_col, max_row, max_col) = region.bbox
  
  img_crop = img[min_row:max_row,min_col:max_col]
  return mahotas.features.zernike_moments(img_crop, radious)

In [None]:
import mahotas


def register_to_haralick_features(register,distance=1):
  """
    Compute haralick features
    
    Parameters
    ----------
    register : Pandas Series
        Image Info 
   
    Returns
    -------
    lbps : Pandas Series 
        Serie with Haralick descriptors

  """

  img_name, Class, Name_str, (r,c), ((y1,x1),(y2,x2)), img = register_to_info(register)
  test_folds = register.filter(regex='Test_Fold')
    
  
  img_crop = crop_image(img,((y1,x1),(y2,x2)))

  haralick_features =  mahotas.features.haralick(img_crop,distance=distance,
                                                 return_mean=True,
                                                 ignore_zeros = False)

  basic_values = pd.Series([Name_str,Class],["Name","Class"])
  haralick_values = pd.Series(haralick_features,["haralick"+str(i) for i in range(len(haralick_features))])
    
  return pd.concat((basic_values,test_folds,haralick_values))



def register_to_pftas_features(register):
  """
    Compute PFTAS features
    
    Parameters
    ----------
    register : Pandas Series
        Image Info 
   
    Returns
    -------
    lbps : Pandas Series 
        Serie with PFTAS descriptors

  """
  img_name, Class, Name_str, (r,c), ((y1,x1),(y2,x2)), img = register_to_info(register)
  test_folds = register.filter(regex='Test_Fold')
  
  img_crop = crop_image(img,((y1,x1),(y2,x2)))

  pftas_features =  mahotas.features.pftas(img_crop)

  basic_values = pd.Series([Name_str,Class],["Name","Class"])
  pftas_values = pd.Series(pftas_features,["pftas"+str(i) for i in range(len(pftas_features))])
    
  return pd.concat((basic_values,test_folds,pftas_values))




In [None]:
def register_to_hu_features(register):
  """
    Compute Hu Moments
    
    Parameters
    ----------
    register : Pandas Series
        Image Info 
   
    Returns
    -------
    lbps : Pandas Series 
        Serie with Hu Moments descriptors

  """

  img_name, Class, Name_str, (r,c), ((y1,x1),(y2,x2)), img = register_to_info(register)
  test_folds = register.filter(regex='Test_Fold')
    
  mask = get_mask(img,r,c)    
  hu_features = get_hu(mask)  
    
    
    
  basic_values = pd.Series([Name_str,Class],["Name","Class"])
  hu_values = pd.Series(hu_features,["hu"+str(i) for i in range(len(hu_features))])
    
  return pd.concat((basic_values,test_folds,hu_values))


def register_to_zernike_features(register,radious):
  """
    Compute Zernike Moments (Not Used)
    
    Parameters
    ----------
    register : Pandas Series
        Image Info 
   
    Returns
    -------
    lbps : Pandas Series 
        Serie with Zernike Moments descriptors

  """

  img_name, Class, Name_str, (r,c), ((y1,x1),(y2,x2)), img = register_to_info(register)
  test_folds = register.filter(regex='Test_Fold')
    
  mask = get_mask(img,r,c)    
  zernike_features = get_zernike(mask,radious)  
    
    
    
  basic_values = pd.Series([Name_str,Class],["Name","Class"])
  zernike_values = pd.Series(zernike_features,["zernike"+str(i) for i in range(len(zernike_features))])
    
  return pd.concat((basic_values,test_folds,zernike_values))

# Pretrained models

In [None]:
'''
Importing, loading and configure pretrained models
'''

from tensorflow.keras.models import clone_model
from tensorflow.keras.applications.inception_v3 import preprocess_input as inception_v3_preprocessor
from tensorflow.keras.applications.xception import preprocess_input as xception_preprocessor
from tensorflow.keras.applications.vgg16 import preprocess_input as vgg16_preprocessor
from tensorflow.keras.applications.vgg19 import preprocess_input as vgg19_preprocessor
from tensorflow.keras.applications.resnet_v2 import preprocess_input as resnet_v2_preprocessor
from tensorflow.keras.applications.inception_resnet_v2 import preprocess_input as incept_res_v2_preprocessor
from tensorflow.keras.applications.mobilenet_v2 import preprocess_input as mobilenet_preprocessor
from tensorflow.keras.applications.densenet import preprocess_input as densenet_preprocessor
from tensorflow.keras.applications.nasnet import preprocess_input as nasnet_preprocessor



from tensorflow.keras.models import Model
from tensorflow.keras.preprocessing import image


from tensorflow.keras.applications import InceptionV3
from tensorflow.keras.applications import Xception
from tensorflow.keras.applications import VGG16
from tensorflow.keras.applications import VGG19
from tensorflow.keras.applications import ResNet50V2
from tensorflow.keras.applications import ResNet101V2
from tensorflow.keras.applications import ResNet152V2
from tensorflow.keras.applications import InceptionResNetV2
from tensorflow.keras.applications import MobileNetV2
from tensorflow.keras.applications import DenseNet121
from tensorflow.keras.applications import DenseNet169
from tensorflow.keras.applications import DenseNet201
from tensorflow.keras.applications import NASNetMobile
from tensorflow.keras.applications import NASNetLarge

from IPython.display import clear_output



models_dict = {}
inception_v3_dict, xception_dict, vgg16_dict, vgg19_dict = {} ,{} ,{} ,{}
resnet50_dict, resnet101_dict, resnet152_dict = {} ,{} ,{}
incepres_dict, mobile_dict, nasmobile_dict, naslarge_dict = {} ,{} ,{}, {}
dense121_dict, dense169_dict, dense201_dict = {} ,{} ,{}

 
if repeat_dict["InceptionV3"]:
  print("Loading Inception v3")
  model = InceptionV3(weights='imagenet')
  model = Model(model.input, model.layers[-2].output)
  inception_v3_dict["model"] = clone_model(model)
  inception_v3_dict["preprocesor"] = inception_v3_preprocessor
  inception_v3_dict["target_size"] = model.input_shape[1],model.input_shape[2]


if repeat_dict["Xception"]:
  print("Loading Xception")
  model = Xception(weights='imagenet')
  model = Model(model.input, model.layers[-2].output)
  xception_dict["model"] = clone_model(model)
  xception_dict["preprocesor"] = xception_preprocessor
  xception_dict["target_size"] = model.input_shape[1],model.input_shape[2]

if repeat_dict["VGG16"]:
  print("Loading VGG16")
  model = VGG16(weights='imagenet')
  model = Model(model.input, model.layers[-2].output)
  vgg16_dict["model"] = clone_model(model)
  vgg16_dict["preprocesor"] = vgg16_preprocessor
  vgg16_dict["target_size"] = model.input_shape[1],model.input_shape[2]

if repeat_dict["VGG19"]:
  print("Loading VGG19")
  model = VGG19(weights='imagenet')
  model = Model(model.input, model.layers[-2].output)
  vgg19_dict["model"] = clone_model(model)
  vgg19_dict["preprocesor"] = vgg19_preprocessor
  vgg19_dict["target_size"] = model.input_shape[1],model.input_shape[2]


if repeat_dict["ResNet50"]:
  print("Loading ResNet 50")
  model = ResNet50V2(weights='imagenet')
  model = Model(model.input, model.layers[-2].output)
  resnet50_dict["model"] = clone_model(model)
  resnet50_dict["preprocesor"] = resnet_v2_preprocessor
  resnet50_dict["target_size"] = model.input_shape[1],model.input_shape[2]


if repeat_dict["ResNet101"]:
  print("Loading ResNet 101")
  model = ResNet101V2(weights='imagenet')
  model = Model(model.input, model.layers[-2].output)
  resnet101_dict["model"] = clone_model(model)
  resnet101_dict["preprocesor"] = resnet_v2_preprocessor
  resnet101_dict["target_size"] = model.input_shape[1],model.input_shape[2]


if repeat_dict["ResNet152"]:
  print("Loading ResNet 152")
  model = ResNet152V2(weights='imagenet')
  model = Model(model.input, model.layers[-2].output)
  resnet152_dict["model"] = clone_model(model)
  resnet152_dict["preprocesor"] = resnet_v2_preprocessor
  resnet152_dict["target_size"] = model.input_shape[1],model.input_shape[2]


if repeat_dict["InceptionResNet"]:
  print("Loading InceptionResNetV2")
  model = InceptionResNetV2(weights='imagenet')
  model = Model(model.input, model.layers[-2].output)
  incepres_dict["model"] = clone_model(model)
  incepres_dict["preprocesor"] = incept_res_v2_preprocessor
  incepres_dict["target_size"] = model.input_shape[1],model.input_shape[2]

if repeat_dict["MobileNet"]:
  print("Loading MobileNet")
  model = MobileNetV2(weights='imagenet')
  model = Model(model.input, model.layers[-2].output)
  mobile_dict["model"] = clone_model(model)
  mobile_dict["preprocesor"] = mobilenet_preprocessor
  mobile_dict["target_size"] = model.input_shape[1],model.input_shape[2]


if repeat_dict["NASMobile"]:
  print("Loading NASMobile")
  model = NASNetMobile(weights='imagenet')
  model = Model(model.input, model.layers[-2].output)
  nasmobile_dict["model"] = clone_model(model)
  nasmobile_dict["preprocesor"] = nasnet_preprocessor
  nasmobile_dict["target_size"] = model.input_shape[1],model.input_shape[2]

if repeat_dict["NASLarge"]:
  print("Loading NASLarge")
  model = NASNetLarge(weights='imagenet')
  model = Model(model.input, model.layers[-2].output)
  naslarge_dict["model"] = clone_model(model)
  naslarge_dict["preprocesor"] = nasnet_preprocessor
  naslarge_dict["target_size"] = model.input_shape[1],model.input_shape[2]

if repeat_dict["Dense121"]:
  print("Loading Dense121")
  model = DenseNet121(weights='imagenet')
  model = Model(model.input, model.layers[-2].output)
  dense121_dict["model"] = clone_model(model)
  dense121_dict["preprocesor"] = densenet_preprocessor
  dense121_dict["target_size"] = model.input_shape[1],model.input_shape[2]

if repeat_dict["Dense169"]:
  print("Loading Dense169")
  model = DenseNet169(weights='imagenet')
  model = Model(model.input, model.layers[-2].output)
  dense169_dict["model"] = clone_model(model)
  dense169_dict["preprocesor"] = densenet_preprocessor
  dense169_dict["target_size"] = model.input_shape[1],model.input_shape[2]

if repeat_dict["Dense201"]:
  print("Loading Dense201")
  model = DenseNet201(weights='imagenet')
  model = Model(model.input, model.layers[-2].output)
  dense201_dict["model"] = clone_model(model)
  dense201_dict["preprocesor"] = densenet_preprocessor
  dense201_dict["target_size"] = model.input_shape[1],model.input_shape[2]



models_dict["InceptionV3"] = inception_v3_dict
models_dict["Xception"] = xception_dict
models_dict["VGG16"] = vgg16_dict
models_dict["VGG19"] = vgg19_dict
models_dict["ResNet50"] = resnet50_dict
models_dict["ResNet101"] = resnet101_dict
models_dict["ResNet152"] = resnet152_dict
models_dict["InceptionResNet"] = incepres_dict
models_dict["MobileNet"] = mobile_dict
models_dict["NASMobile"] = nasmobile_dict
models_dict["NASLarge"] = naslarge_dict
models_dict["Dense121"] = dense121_dict
models_dict["Dense169"] = dense169_dict
models_dict["Dense201"] = dense201_dict



clear_output()

In [None]:
import os
from skimage.io import imsave
import uuid

import matplotlib.pyplot as plt

def extract_features(np_image,model_name):

  model_dict = models_dict[model_name]
  model = model_dict["model"]
  preprocessor = model_dict["preprocesor"]
  target_size = model_dict["target_size"]

  #Create temp file
  tmp_path = str(uuid.uuid4())+".png"
  imsave(tmp_path,np_image)

  img = image.load_img(tmp_path, target_size=target_size)
  # Delete temp file
  os.remove(tmp_path)
  # 
  img_data = image.img_to_array(img)

  # for testing
  #plt.imshow(img_data/255.) #Remove or comment
  
  img_data = np.expand_dims(img_data, axis=0)
  img_data = preprocessor(img_data)

  

  features = model.predict(img_data)
  return features[0]

In [None]:
def register_to_deepfeatures(register,model_name):
    """
    Compute all CNN features    
    
    Parameters
    ----------
    register : Series
        Serie containing metadata of the image
    model_name: str
        Name of the pretrained model
   
    Returns
    -------
    results : Series 
        A Pandas Serie contanining all of the features

    """
    img_name = register.Image
    Class = register.Class 
    test_folds = register.filter(regex='Test_Fold')   
    
    img_dir = imgs_path+os.sep+Class+os.sep    
    Name_str = Class+"_"+img_name    
    
    img_path = img_dir+img_name

        
    img = imread(img_path)

    crop = crop_image(img,register.Rectangle)

    
    basic_values = pd.Series([Name_str,Class],["Name","Class"])
    
    features = extract_features(crop,model_name)
    names = [f"{model_name}_{i}" for i in range(len(features))]

    

    results = pd.Series(features,names)

    return pd.concat((basic_values,test_folds,results))

# Pretrained models (VT)



In [None]:
# Installing PyTorch Image Models
!pip install timm
clear_output()

In [None]:
import torch
import timm

# vit_large_patch16_224, image_size 224, features 1024
model_vt_large = timm.create_model('vit_large_patch16_224', pretrained=True)
model_vt_base = timm.create_model('vit_base_patch16_224', pretrained=True)

In [None]:
from skimage.transform import resize


def register_image_resized(register):
    """
    Resizes and reformats data to be compatible with TIMM

    """

    img_name = register.Image
    Class = register.Class 
    test_folds = register.filter(regex='Test_Fold')   
    
    img_dir = imgs_path+os.sep+Class+os.sep    
    Name_str = Class+"_"+img_name    
    
    img_path = img_dir+img_name

        
    img = imread(img_path)

    img = crop_image(img,register.Rectangle)
    img = resize(img,(224,224,3))

    ## formatear de (224,224,3) a (3,224,224) 
    img = np.concatenate((img[:,:,0],img[:,:,1],img[:,:,2])).reshape((3,224,224))
    #Tensor images with a float dtype are expected to have values in [0, 1)
    img = img.astype(np.float32)
    

    return img

# Check and generate new datasets
  

In [None]:
import pickle5 as pickle

def check_files(datasets_file,results_file):
  """
    Check if the datasets and the results files have been generated  
    
    Parameters
    ----------
    output_folder : str
        Path of the folder where datasets and results are saved
    datasets_file: str
        Name of dataset file
    results_file: str
        Name results file
   
    Returns
    -------
    None

  """
  datasets_file_path = complete_output_folder_path+datasets_file
  results_file_path = complete_output_folder_path+results_file

  # if output_folder dont exits, it is created
  # empty datasets and results files are also created
  if not os.path.isdir(complete_output_folder_path):
    os.makedirs(complete_output_folder_path)
    print(f"Created {complete_output_folder_path}")
    datasets_dict = dict()
    results_dict = dict()
    with open(datasets_file_path, 'wb') as handle:
      pickle.dump(datasets_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)
    with open(results_file_path, 'wb') as handle:
      pickle.dump(results_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)
  # if the output_folder exists, the datasets and results files are checked
  else:
    check_status(datasets_file_path, results_file_path)

  
def check_status(datasets_file_path, results_file_path):
  """
    Check the status of datasets and the results files  
    
    Parameters
    ----------
    datasets_file_path: str
        full path of dataset file
    datasets_file_path: str
        full path results file
   
    Returns
    -------
    None

  """

  print("Checkeando")
  with open(datasets_file_path, 'rb') as handle:
      datasets_dict = pickle.load(handle)

      for key in datasets_dict:
        print(key,datasets_dict[key].shape)

  with open(results_file_path, 'rb') as handle:
      results_dict = pickle.load(handle)



check_files(datasets_file,results_file)


In [None]:
def generate_datasets(type_dataset,df_info):
  """
    Generate datasets  
    
    Parameters
    ----------
    type_dataset: str
        Type of features to be computed
    df_info: DataFrame
        Dataframe with image metadata
   
    Returns
    -------
    Dataset : DataFrame

  """

  deep_features = ["InceptionV3", "Xception", "VGG16","VGG19","ResNet50",
               "ResNet101","ResNet152","InceptionResNet","MobileNet",
               "NASMobile","NASLarge","Dense121","Dense169","Dense201"]

  if type_dataset == "LBP":
    return df_info.apply(register_to_lbp_features,axis=1)
  elif type_dataset == "Morpho":
    return df_info.apply(register_to_morpho_features,axis=1)
  elif type_dataset == "EFD":
    return df_info.apply(register_to_efd_features,axis=1)
  elif type_dataset == "Hu":
    return df_info.apply(register_to_hu_features,axis=1)
  elif type_dataset == "Zernike5":
    return df_info.apply(lambda x: register_to_zernike_features(x, radious=5),axis=1)
  elif type_dataset == "Zernike10":
    return df_info.apply(lambda x: register_to_zernike_features(x, radious=10),axis=1)
  elif type_dataset == "pftas":
    return df_info.apply(register_to_pftas_features,axis=1)
  elif type_dataset == "Haralick":
    return df_info.apply(register_to_haralick_features,axis=1)

  elif type_dataset in deep_features:
      return df_info.apply(lambda x: register_to_deepfeatures(x, type_dataset),
                                        axis=1)



In [None]:

# Comenta mejor
def generate_save_datasets(datasets_file,repeat_dict,test=False):
  """
    Generate datasets and serialize them in disks 
    
    Parameters
    ----------
    output_folder: str
        Path of the output folder
    datasets_file: str
        Name of the datasets file
    repeat_dict: Dict
        Dictionary that defines which features are to be recalculated
   
    Returns
    -------
    None

  """

  dataset_types = list(repeat_dict.keys())

  datasets_file_path = complete_output_folder_path+datasets_file
  
  datasets_dict = None
  
  with open(datasets_file_path, 'rb') as handle:
      datasets_dict = pickle.load(handle)
    
  # Removes classes with to few examples
  classes_with_few_samples = ["Trilobate"]
  df_info = datasets_dict.get("info") 
  #df_info = None
  if df_info is None:
      df_info = create_image_info_df(csvs_path,
                                     ignore_classes=classes_with_few_samples)
      datasets_dict["info"] = df_info

      if not test:
        with open(datasets_file_path, 'wb') as handle:
            pickle.dump(datasets_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)
            print(f"Saving info")
  
  if test:
    df_info = df_info.head(10)

  
  # checks if a feature type is previously computed or is needed to computed again
  for data_type in dataset_types:
    if not data_type in datasets_dict or repeat_dict[data_type]:
      print(f"Computing {data_type}")
      datasets_dict[data_type] = generate_datasets(data_type,df_info)
      if not test:
        with open(datasets_file_path, 'wb') as handle:
          pickle.dump(datasets_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)
          print(f"Saving {data_type}")
    elif data_type in datasets_dict and not repeat_dict[data_type]:
      print(f"{data_type} already computed")
     
    


In [None]:
generate_save_datasets(datasets_file,repeat_dict)
clear_output()
print("Done!")

Adding Vision Transformer Features

in July 2022 this modification allows to obtain features calculated by the last layer of a pre-trained Vision Transformer.

Two new datasets (df_large, df_base) are generated, which can be added to the dataset dictionary and used in the rest of the experiments.

In [None]:
df_info = None
with open(complete_output_folder_path+datasets_file, 'rb') as handle:
    datasets_dict = pickle.load(handle)
    df_info = datasets_dict.get("info")

In [None]:
'''
Extract Features using pretrained vision transformers
'''
larges = []
bases = []

for i in range(df_info.shape[0]):
  if i % 50==0 and not i ==0:
    print(i)
  imgs = []
  imgs.append(register_image_resized(df_info.iloc[i]))
  imgs = np.array(imgs)

  print("(L)",end="")
  large_feats = model_vt_large.forward_features(torch.from_numpy(imgs))
  print("(B)",end=" ")
  base_feats = model_vt_base.forward_features(torch.from_numpy(imgs))
  
  large_feats = large_feats.cpu().detach().numpy()
  base_feats = base_feats.cpu().detach().numpy()

  larges.append(large_feats)
  bases.append(base_feats)

In [None]:
# convert into dataframe
df_base = pd.DataFrame(np.array(list(map(lambda x: x[0], bases))))
df_large = pd.DataFrame(np.array(list(map(lambda x: x[0], larges))))

In [None]:
# Adding metadata info
print("Creating dataframes (base)")
df_base.columns = [f"VT_base{i}" for i in range(df_base.shape[1])]

df_info0 = df_info.copy()
df_info0["Name"] = df_info0["Class"]+"_"+df_info0["Image"]
df_info1 = df_info0[["Name","Class"]]
df_info2 = df_info0.filter(regex='Test_Fold')

df_base = pd.concat((df_info1,df_info2,df_base),axis=1)

print("Creating dataframes (large)")
df_large.columns = [f"VT_large{i}" for i in range(df_large.shape[1])]
df_large = pd.concat((df_info1,df_info2,df_large),axis=1)

In [None]:
# Save in drive. 
# If the dataset is saved it is not necessary to execute the previous cells
handle = open("df_base.obj", 'wb')
pickle.dump(df_base, handle)
handle = open("df_large.obj", 'wb')
pickle.dump(df_large, handle)

# Experiments

In [None]:
import pickle5 as pickle


# Load computed datasets
def load_datasets():
  datasets_file_path = complete_output_folder_path+datasets_file
  datasets_dict = None
  with open(datasets_file_path, 'rb') as handle:
      datasets_dict = pickle.load(handle)

  features = list(datasets_dict.keys())
  if "info" not in features:
    print("Metadata not calculated")
  else:
    features.remove("info")
    if len(features) == 0:
      print("There is not features computed")
    else:
      for clave in datasets_dict:
        print(clave,datasets_dict[clave].shape)
  return datasets_dict
    
  

data_dict = load_datasets()     

Adding ViT features to the data_dict

The rest of the code works with datasets stored in the dataset dictionary.

If we want to work with the attributes extracted by Vision Transformers we must add these datasets to the datasets dictionary

In [None]:
handle = open("df_base.obj", 'rb')
df_base = pickle.load(handle)
handle = open("df_large.obj", 'rb')
df_large = pickle.load(handle)

In [None]:
data_dict['vt_large']=df_large 
data_dict['vt_base']=df_base

# Combine datasets

In [None]:
'''
Create dataset concatenations
'''

def combine_datasets(df1,df2):
  columns1 = df1.columns
  columns2 = df2.columns
  att_columns1 = [c for c in columns1]
  att_columns2 = [c for c in columns2 if not ("Test_Fold" in c or c in ["Name","Class"])]

  df_comb = pd.concat((df1[att_columns1],df2[att_columns2]),axis=1)

  return df_comb


df_morpho_hu = combine_datasets(data_dict["Morpho"],data_dict["Hu"])
df_morpho_hu_EFD = combine_datasets(df_morpho_hu,data_dict["EFD"])
df_morpho_EFD = combine_datasets(data_dict["Morpho"],data_dict["EFD"])

df_nas_inc = combine_datasets(data_dict["NASLarge"],data_dict["InceptionResNet"])
df_nas_inc_dense = combine_datasets(df_nas_inc,data_dict["Dense169"])

df_morpho_nas = combine_datasets(data_dict["Morpho"],data_dict["NASLarge"])
df_morpho_pftas = combine_datasets(data_dict["Morpho"],data_dict["pftas"])

data_dict["Morpho/Hu"] = df_morpho_hu
data_dict["Morpho/EFD"] = df_morpho_EFD
data_dict["Morpho/Hu/EFD"] = df_morpho_hu_EFD

data_dict["NASLarge/InceptionResNet"] = df_nas_inc
data_dict["NASLarge/InceptionResNet/Dense169"] = df_nas_inc_dense

data_dict["Morpho/NASLarge"] = df_morpho_nas
data_dict["Morpho/pftas"] = df_morpho_pftas


df_morpho_hu_EFD_NAS = combine_datasets(data_dict["Morpho/Hu/EFD"],
                                        data_dict["NASLarge"])
df_morpho_hu_EFD_pftas = combine_datasets(data_dict["Morpho/Hu/EFD"],
                                        data_dict["pftas"])
data_dict["Morpho/Hu/EFD/NASLarge"] = df_morpho_hu_EFD_NAS
data_dict["Morpho/Hu/EFD/pftas"] = df_morpho_hu_EFD_pftas

# vtb/nas 
df_vtb_nas = combine_datasets(data_dict["vt_base"],data_dict["NASLarge"])
data_dict["vt_base/NASLarge"] = df_vtb_nas

# vtb/nas/morpho 
df_vtb_nas_morpho = combine_datasets(data_dict["vt_base/NASLarge"],data_dict["Morpho"])
data_dict["vt_base/NASLarge/Morpho"] = df_vtb_nas_morpho


# vtb/nas/inc
df_vtb_nas_inc = combine_datasets(data_dict["vt_base/NASLarge"],data_dict["InceptionResNet"])
data_dict["vt_base/NASLarge/InceptionResNet"] = df_vtb_nas_inc
# vtb/nas/inc/morpho/hu
df_vtb_nas_inc_morpho_hu = combine_datasets(data_dict["vt_base/NASLarge/InceptionResNet"],data_dict["Morpho/Hu"])
data_dict["vt_base/NASLarge/InceptionResNet/Morpho/Hu"] = df_vtb_nas_inc_morpho_hu


df_vtb_morpho = combine_datasets(data_dict["vt_base"],data_dict["Morpho"])
data_dict["vt_base/Morpho"] = df_vtb_morpho



df_vtb_morpho_hu = combine_datasets(data_dict["vt_base/Morpho"],data_dict["Hu"])
data_dict["vt_base/Morpho/Hu"] = df_vtb_morpho_hu
df_vtb_morpho_hu_efd = combine_datasets(data_dict["vt_base/Morpho/Hu"],data_dict["EFD"])
data_dict["vt_base/Morpho/Hu/EFD"] = df_vtb_morpho_hu_efd




datasets_file_path = complete_output_folder_path+datasets_file
  
with open(datasets_file_path, 'wb') as handle:
  pickle.dump(data_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)
  print(f"Saving info")

In [None]:
# More July 2022

df_vtbs = combine_datasets(data_dict["vt_large"],data_dict["vt_base"])
data_dict["vt_large/vt_base"] = df_vtbs

df_vts_morpho = combine_datasets(data_dict["vt_large/vt_base"],data_dict["Morpho"])
data_dict["vt_large/vt_base/Morpho"] = df_vts_morpho





df_vts_morpho_hu = combine_datasets(data_dict["vt_large/vt_base/Morpho"],data_dict["Hu"])
data_dict["vt_large/vt_base/Morpho/Hu"] = df_vts_morpho_hu
df_vts_morpho_hu_efd = combine_datasets(data_dict["vt_large/vt_base/Morpho/Hu"],data_dict["EFD"])
data_dict["vt_large/vt_base/Morpho/Hu/EFD"] = df_vts_morpho_hu_efd


datasets_file_path = complete_output_folder_path+datasets_file
  
with open(datasets_file_path, 'wb') as handle:
  pickle.dump(data_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)
  print(f"Saving info")

In [None]:
'''
Load computed results 

All the predictions made are stored in a results dictionary.
'''
import pickle5 as pickle
import os
def load_results():
  results_file_path = complete_output_folder_path+results_file
  results_dict = None
  if os.path.isfile(results_file_path):
    with open(results_file_path, 'rb') as handle:
        results_dict = pickle.load(handle)

    results = list(results_dict.keys())
    for clave in results_dict:
        print(clave)
    return results_dict
  else:
    return {} 

res_dict = load_results()

Classifiers: creation and parameterization

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.model_selection import cross_val_score, cross_val_predict, GridSearchCV
from sklearn.pipeline import Pipeline

In [None]:
'''
Definition of the SVM parameter search
'''
C_range = np.logspace(-2, 10, 13)
gamma_range = np.logspace(-9, 3, 13)
param_grid_svm = dict(gamma=gamma_range, C=C_range)
nested_cv = 5

grid_svm = GridSearchCV(SVC(), param_grid=param_grid_svm, cv=nested_cv)

In [None]:
'''
Definition of the MLP parameter search
'''
alpha_range = np.logspace(-5, -1, 5)
hidden_layer_sizes_range=[(50,),(100,),(200,),(500,),(1000,)]

param_grid_mlp = dict(alpha=alpha_range, hidden_layer_sizes=hidden_layer_sizes_range)


grid_mlp = GridSearchCV(MLPClassifier(max_iter=1000,
                                      early_stopping=True), param_grid=param_grid_mlp, cv=nested_cv)

In [None]:
def compute_predictions_per_fold_repetition(df, model,
                                        dict_all_results,dict_model_results,
                                        model_name,data_name):
  '''
  Uses a dataset previously divided into several train and test partitions
  Trains a model with the train_cv part of the dataset 
  and obtains the predictions with the test_cv part
  
  Parameters
  ----------
  df: DataFrame
      DataFrame containing, features, classes and partitions
  model: scikit_model
      model to be trained
  num_folds: int
      number of folds in the cross validation
  
  Return
  -------
  List 
      Dict of number_of_fold:(y_true,y_pred).
      
  '''
    
  results_file_path = complete_output_folder_path+results_file
  
  
  if dict_model_results is None:
    dict_model_results = {}
  
  columns = df.columns 
  att_columns = [c for c in columns if not ("Test_Fold" in c or c in ["Name","Class"])]
  repetitions = [int(c.replace('Test_Fold','')) for c in columns if"Test_Fold" in c]

  # refactoriza lo de abajo
  for repetition in repetitions:

    partitions = np.sort(df[f"Test_Fold{repetition}"].unique())
    for partition in partitions:
      # ojo checkear

      if not repetition in dict_model_results or not partition in dict_model_results[repetition]:
        ######################################################
        # Data split
        df_train = df[df[f"Test_Fold{repetition}"]!=partition]
        df_test = df[df[f"Test_Fold{repetition}"]==partition]

        # Remove not relevant attributes
        X_train = df_train[att_columns].values
        y_train = df_train.Class.values

        X_test = df_test[att_columns].values
        y_test = df_test.Class.values


        model.fit(X_train,y_train)
        y_preds = model.predict(X_test)

        if dict_model_results.get(repetition) is None:
          dict_model_results[repetition] = {}

        dict_model_results[repetition][partition]=(y_test,y_preds)
        dict_all_results[model_name,data_name]=dict_model_results
        with open(results_file_path, 'wb') as handle:
              pickle.dump(dict_all_results, handle, protocol=pickle.HIGHEST_PROTOCOL)
              print(f"Sav R{repetition}F{partition}", end=' - ')
        ##########################################################
      else:
        print(f"Rec R{repetition}F{partition}", end=' - ')
    print(".")
  
  
  print('OK')



List of classifiers that will be evaluated with each and every one of the previously stored datasets.

In [None]:
cls_names = [
             "Nearest Neighbors",
             "SVM", 
             "MLP",
             "LogisticRegression",
             "Decision Tree", 
             "Random Forest",
             "Gradient Boosting Trees"             
             ]

classifiers = [
    make_pipeline(StandardScaler(), KNeighborsClassifier(3)),
    make_pipeline(StandardScaler(), grid_svm),
    make_pipeline(StandardScaler(), grid_mlp),
    make_pipeline(StandardScaler(), LogisticRegression(max_iter=5000)),    
    DecisionTreeClassifier(random_state=0),
    RandomForestClassifier(random_state=0, n_estimators=100),
    GradientBoostingClassifier(random_state=0, n_estimators=100)    
]


# res_dict = {} # To force the repetition of the experiments delete the results

Get the predictions by cross validation.

execution of experiments

In [None]:
datasets = []

features = list(data_dict.keys())
features.remove("info")

for feature in features:
  datasets.append((feature,data_dict[feature]))


for dataset_name,dataset in datasets:
  print(dataset_name)
  for cls_name,cls in zip(cls_names,classifiers):
    print(cls_name)
    # obtiene resultados parciales si los hubiese
    results = res_dict.get((cls_name, dataset_name))
    # actualiza los resultados y serializa resultados
    compute_predictions_per_fold_repetition(dataset, cls,res_dict,results,cls_name,dataset_name)
    
res_dict = load_results()

# Stacking.

In [None]:
from sklearn.ensemble import StackingClassifier
from sklearn.model_selection import train_test_split

'''
Helper functions to implement multi-view stacking
'''

from sklearn.base import TransformerMixin
class ColumnExtractor(TransformerMixin):

    def __init__(self, cols):
        self.cols = cols

    def transform(self, X):
        col_list = []
        for c in self.cols:
            col_list.append(X[:, c:c+1])
        return np.concatenate(col_list, axis=1)

    def fit(self, X, y=None):
        return self

df1 = data_dict["Morpho/Hu"]

def get_num_atts(df):
  columns = df.columns 
  att_columns = [c for c in columns if not ("Test_Fold" in c or c in ["Name","Class"])]
  return len(att_columns)

def get_X(df):
  columns = df.columns 
  att_columns = [c for c in columns if not ("Test_Fold" in c or c in ["Name","Class"])]
  return df[att_columns].values

def get_y(df):
  return df["Class"].values


n_atts_morpho = get_num_atts(data_dict["Morpho"])
n_atts_hu = get_num_atts(data_dict["Hu"])
n_atts_efd = get_num_atts(data_dict["EFD"])
n_atts_nas = get_num_atts(data_dict["NASLarge"])
n_atts_inc = get_num_atts(data_dict["InceptionResNet"])
n_atts_dense = get_num_atts(data_dict["Dense169"])
n_atts_pftas = get_num_atts(data_dict["pftas"])
n_atts_vtbase = get_num_atts(data_dict["vt_base"])
n_atts_vtlarge = get_num_atts(data_dict["vt_large"])


In [None]:
def get_selected_cls():
  return GridSearchCV(SVC(), param_grid=param_grid_svm, cv=nested_cv)
  

def get_meta_cls():
  return LogisticRegression(max_iter=5000)
  

In [None]:
######################Morpho/Hu###################
estimators_morpho_hu = [('solo_morpho', make_pipeline(ColumnExtractor(range(0,
                                                                            n_atts_morpho)), get_selected_cls())),
                         ('solo_hu', make_pipeline(ColumnExtractor(range(n_atts_morpho,
                                                                         n_atts_morpho+n_atts_hu)),get_selected_cls()))
                         ]


###################Morpho/Hu/EFD######################
estimators_morpho_hu_efd = [('solo_morpho', make_pipeline(ColumnExtractor(range(0,
                                                                            n_atts_morpho)), get_selected_cls())),
                         ('solo_hu', make_pipeline(ColumnExtractor(range(n_atts_morpho,
                                                                         n_atts_morpho+n_atts_hu)),get_selected_cls())),
                        ('solo_efd', make_pipeline(ColumnExtractor(range(n_atts_morpho+n_atts_hu,
                                                                         n_atts_morpho+n_atts_hu+n_atts_efd)),get_selected_cls()))
                         ]


######################Morpho/pftas###################
estimators_morpho_pftas = [('solo_morpho', make_pipeline(ColumnExtractor(range(0,
                                                                            n_atts_morpho)), get_selected_cls())),
                         ('solo_pftas', make_pipeline(ColumnExtractor(range(n_atts_morpho,
                                                                         n_atts_morpho+n_atts_pftas)),get_selected_cls()))
                         ]


######################Morpho/NasLarge###################
estimators_morpho_nas = [('solo_morpho', make_pipeline(ColumnExtractor(range(0,
                                                                            n_atts_morpho)), get_selected_cls())),
                         ('solo_nas', make_pipeline(ColumnExtractor(range(n_atts_morpho,
                                                                         n_atts_morpho+n_atts_nas)),get_selected_cls()))
                         ]


######################NasLarge/InceptionResNet###################
estimators_nas_inc = [('solo_nas', make_pipeline(ColumnExtractor(range(0,
                                                                            n_atts_nas)), get_selected_cls())),
                         ('solo_inc', make_pipeline(ColumnExtractor(range(n_atts_nas,
                                                                         n_atts_nas+n_atts_inc)),get_selected_cls()))
                         ]


######################NasLarge/InceptionResNet/Dense###################
estimators_nas_inc_dense = [('solo_nas', make_pipeline(ColumnExtractor(range(0,
                                                                            n_atts_nas)), get_selected_cls())),
                         ('solo_inc', make_pipeline(ColumnExtractor(range(n_atts_nas,
                                                                         n_atts_nas+n_atts_inc)),get_selected_cls())),
                      ('solo_dense', make_pipeline(ColumnExtractor(range(n_atts_nas+n_atts_inc,
                                                                         n_atts_nas+n_atts_inc+n_atts_dense)),get_selected_cls()))
                         ]



######################Morpho/Hu###################
clf_morpho_hu_stack = StackingClassifier(estimators=estimators_morpho_hu, 
                                         final_estimator=get_meta_cls())

#####################Morpho/Hu/EFD####################
clf_morpho_hu_efd_stack = StackingClassifier(estimators=estimators_morpho_hu_efd, 
                                         final_estimator=get_meta_cls())

######################Morpho/pftas###################
clf_morpho_pftas_stack = StackingClassifier(estimators=estimators_morpho_pftas, 
                                         final_estimator=get_meta_cls())

######################Morpho/NasLarge###################
clf_morpho_nas_stack = StackingClassifier(estimators=estimators_morpho_nas, 
                                         final_estimator=get_meta_cls())

######################NasLarge/InceptionResNet###################
clf_nas_inc_stack = StackingClassifier(estimators=estimators_nas_inc, 
                                         final_estimator=get_meta_cls())

######################NasLarge/InceptionResNet/Dense###################
clf_nas_inc_dense_stack = StackingClassifier(estimators=estimators_nas_inc_dense, 
                                         final_estimator=get_meta_cls())




In [None]:
res_dict = load_results()

######################Morpho/Hu###################
# gets partial results from previous runs
results = res_dict.get(("(Morpho/Hu)Stack", "Morpho/Hu"))
# update and save results
print("(Morpho/Hu)Stack")
compute_predictions_per_fold_repetition(data_dict["Morpho/Hu"], 
                                        clf_morpho_hu_stack,
                                        res_dict,
                                        results,
                                        "(Morpho/Hu)Stack","Morpho/Hu")

In [None]:
######################Morpho/Hu/EFD###################
# gets partial results from previous runs
results = res_dict_stacking.get(("(Morpho/Hu/EFD)Stack", "Morpho/Hu/EFD"))
# update and save results
print("(Morpho/Hu/EFD)Stack")
compute_predictions_per_fold_repetition(data_dict["Morpho/Hu/EFD"], 
                                        clf_morpho_hu_efd_stack,
                                        res_dict,
                                        results,
                                        "(Morpho/Hu/EFD)Stack","Morpho/Hu/EFD")

######################Morpho/pftas###################
# gets partial results from previous runs
results = res_dict_stacking.get(("(Morpho/pftas)Stack", "Morpho/pftas"))
# update and save results
print("(Morpho/pftas)Stack")
compute_predictions_per_fold_repetition(data_dict["Morpho/pftas"], 
                                        clf_morpho_pftas_stack,
                                        res_dict,
                                        results,
                                        "(Morpho/pftas)Stack","Morpho/pftas")

######################Morpho/NasLarge###################
results = res_dict_stacking.get(("(Morpho/NASLarge)Stack", "Morpho/NASLarge"))
print("(Morpho/NASLarge)Stack")
compute_predictions_per_fold_repetition(data_dict["Morpho/NASLarge"], 
                                        clf_morpho_nas_stack,
                                        res_dict,
                                        results,
                                        "(Morpho/NASLarge)Stack","Morpho/NASLarge")

######################NasLarge/InceptionResNet###################
results = res_dict_stacking.get(("(NASLarge/InceptionResNet)Stack", "NASLarge/InceptionResNet"))
print("(NASLarge/InceptionResNet)Stack")
compute_predictions_per_fold_repetition(data_dict["NASLarge/InceptionResNet"], 
                                        clf_nas_inc_stack,
                                        res_dict,
                                        results,
                                        "(NASLarge/InceptionResNet)Stack","NASLarge/InceptionResNet")

######################NasLarge/InceptionResNet/Dense###################
results = res_dict_stacking.get(("(NASLarge/InceptionResNet/Dense169)Stack", "NASLarge/InceptionResNet/Dense169"))
print("(NASLarge/InceptionResNet/Dense169)Stack")
compute_predictions_per_fold_repetition(data_dict["NASLarge/InceptionResNet/Dense169"], 
                                        clf_nas_inc_dense_stack,
                                        res_dict,
                                        results,
                                        "(NASLarge/InceptionResNet/Dense169)Stack","NASLarge/InceptionResNet/Dense169")


In [None]:
###################Morpho/Hu/EFD/NAS######################
estimators_morpho_hu_efd_nas = [('solo_morpho', make_pipeline(ColumnExtractor(range(0,
                                                                            n_atts_morpho)), get_SVM_cls())),
                         ('solo_hu', make_pipeline(ColumnExtractor(range(n_atts_morpho,
                                                                         n_atts_morpho+n_atts_hu)),get_SVM_cls())),
                        ('solo_efd', make_pipeline(ColumnExtractor(range(n_atts_morpho+n_atts_hu,
                                                                         n_atts_morpho+n_atts_hu+n_atts_efd)),get_RF_cls())),
                        ('solo_nas', make_pipeline(ColumnExtractor(range(n_atts_morpho+n_atts_hu+n_atts_efd,
                                                                         n_atts_morpho+n_atts_hu+n_atts_efd+n_atts_nas)),get_Log_cls()))
                         ]




###################Morpho/Hu/EFD/pftas ######################
estimators_morpho_hu_efd_pftas = [('solo_morpho', make_pipeline(ColumnExtractor(range(0,
                                                                            n_atts_morpho)), get_SVM_cls())),
                         ('solo_hu', make_pipeline(ColumnExtractor(range(n_atts_morpho,
                                                                         n_atts_morpho+n_atts_hu)),get_SVM_cls())),
                        ('solo_efd', make_pipeline(ColumnExtractor(range(n_atts_morpho+n_atts_hu,
                                                                         n_atts_morpho+n_atts_hu+n_atts_efd)),get_RF_cls())),
                        ('solo_pftas', make_pipeline(ColumnExtractor(range(n_atts_morpho+n_atts_hu+n_atts_efd,
                                                                         n_atts_morpho+n_atts_hu+n_atts_efd+n_atts_pftas)),get_SVM_cls()))
                         ]



clf_morpho_hu_efd_nas_stack = StackingClassifier(estimators=estimators_morpho_hu_efd_nas, 
                                         final_estimator=get_Log_cls())

clf_morpho_hu_efd_pftas_stack = StackingClassifier(estimators=estimators_morpho_hu_efd_pftas, 
                                         final_estimator=get_Log_cls())


In [None]:
######################Morpho/Hu/EFD/NASLarge###################
results = res_dict.get(("(Morpho/Hu/EFD/NASLarge)Stack", "Morpho/Hu/EFD/NASLarge"))
print("(Morpho/Hu/EFD/NASLarge)Stack")
compute_predictions_per_fold_repetition(data_dict["Morpho/Hu/EFD/NASLarge"], 
                                        clf_morpho_hu_efd_nas_stack,
                                        res_dict,
                                        results,
                                        "(Morpho/Hu/EFD/NASLarge)Stack","Morpho/Hu/EFD/NASLarge")




results = res_dict.get(("(Morpho/Hu/EFD/pftas)Stack", "Morpho/Hu/EFD/pftas"))
print("(Morpho/Hu/EFD/pftas)Stack")
compute_predictions_per_fold_repetition(data_dict["Morpho/Hu/EFD/pftas"], 
                                        clf_morpho_hu_efd_pftas_stack,
                                        res_dict,
                                        results,
                                        "(Morpho/Hu/EFD/pftas)Stack","Morpho/Hu/EFD/pftas")


In [None]:
# Revisión Julio 22 #######################################################
###########################################################################
#1) vt_base/Morpho           Log/SVM                     
#2) vt_base/Morpho/Hu        Log/SVM/SVM                     
#3) vt_base/Morpho/Hu/EFD    Log/SVM/SVM/RF                     
#4) vt_base/NASLarge         Log/Log
#5) vt_base/NASLarge/Morpho  Log/Log/SVM                     
estimators_vt_morpho = [('solo_vt', make_pipeline(ColumnExtractor(range(0,
                                                                        n_atts_vtbase)), 
                                                  get_Log_cls())),
                        ('solo_morpho', make_pipeline(ColumnExtractor(range(n_atts_vtbase,
                                                                            n_atts_vtbase+n_atts_morpho)), 
                                                      get_SVM_cls()))]

estimators_vt_morpho_hu = [('solo_vt', make_pipeline(ColumnExtractor(range(0,
                                                                        n_atts_vtbase)), 
                                                     get_Log_cls())),
                           ('solo_morpho', make_pipeline(ColumnExtractor(range(n_atts_vtbase,
                                                                            n_atts_vtbase+n_atts_morpho)), 
                                                      get_SVM_cls())),
                           ('solo_hu', make_pipeline(ColumnExtractor(range(n_atts_vtbase+n_atts_morpho,
                                                                            n_atts_vtbase+n_atts_morpho+n_atts_hu)), 
                                                      get_SVM_cls()))]
estimators_vt_morpho_hu_efd = [('solo_vt', make_pipeline(ColumnExtractor(range(0,
                                                                        n_atts_vtbase)), 
                                                         get_Log_cls())),
                               ('solo_morpho', make_pipeline(ColumnExtractor(range(n_atts_vtbase,
                                                                            n_atts_vtbase+n_atts_morpho)), 
                                                      get_SVM_cls())),
                               ('solo_hu', make_pipeline(ColumnExtractor(range(n_atts_vtbase+n_atts_morpho,
                                                                            n_atts_vtbase+n_atts_morpho+n_atts_hu)), 
                                                      get_SVM_cls())),
                               ('solo_efd', make_pipeline(ColumnExtractor(range(n_atts_vtbase+n_atts_morpho+n_atts_hu,
                                                                            n_atts_vtbase+n_atts_morpho+n_atts_hu+n_atts_efd)), 
                                                      get_RF_cls()))
                              ]

estimators_vt_nas = [('solo_vt', make_pipeline(ColumnExtractor(range(0,
                                                                        n_atts_vtbase)), 
                                                  get_Log_cls())),
                    ('solo_nas', make_pipeline(ColumnExtractor(range(n_atts_vtbase,
                                                                            n_atts_vtbase+n_atts_nas)), 
                                                      get_Log_cls()))]
estimators_vt_nas_morpho = [('solo_vt', make_pipeline(ColumnExtractor(range(0,
                                                                        n_atts_vtbase)), 
                                                  get_Log_cls())),
                            ('solo_nas', make_pipeline(ColumnExtractor(range(n_atts_vtbase,
                                                                            n_atts_vtbase+n_atts_nas)), 
                                                      get_Log_cls())),
                            ('solo_morpho', make_pipeline(ColumnExtractor(range(n_atts_vtbase+n_atts_nas,
                                                                            n_atts_vtbase+n_atts_nas+n_atts_morpho)), 
                                                      get_SVM_cls()))
                           ]

clf_vt_morpho_stack = StackingClassifier(estimators=estimators_vt_morpho,final_estimator=get_Log_cls())
clf_vt_morpho_hu_stack = StackingClassifier(estimators=estimators_vt_morpho_hu,final_estimator=get_Log_cls())
clf_vt_morpho_hu_efd_stack = StackingClassifier(estimators=estimators_vt_morpho_hu_efd,final_estimator=get_Log_cls())
clf_vt_nas_stack = StackingClassifier(estimators=estimators_vt_nas,final_estimator=get_Log_cls())
clf_vt_nas_morpho_stack = StackingClassifier(estimators=estimators_vt_nas_morpho,final_estimator=get_Log_cls())

res_dict = load_results()


######################vt_base/Morpho###################
results = res_dict.get(("(vt_base/Morpho)Stack", "vt_base/Morpho"))
print("(vt_base/Morpho)Stack")
compute_predictions_per_fold_repetition(data_dict["vt_base/Morpho"], 
                                        clf_vt_morpho_stack,
                                        res_dict,
                                        results,
                                        "(vt_base/Morpho)Stack","vt_base/Morpho")

######################vt_base/Morpho/hu###################
results = res_dict.get(("(vt_base/Morpho/Hu)Stack", "vt_base/Morpho/Hu"))
print("(vt_base/Morpho/Hu)Stack")
compute_predictions_per_fold_repetition(data_dict["vt_base/Morpho/Hu"], 
                                        clf_vt_morpho_hu_stack,
                                        res_dict,
                                        results,
                                        "(vt_base/Morpho/Hu)Stack","vt_base/Morpho/Hu")

######################vt_base/Morpho/Hu/EFD###################
results = res_dict.get(("(vt_base/Morpho/Hu/EFD)Stack", "vt_base/Morpho/Hu/EFD"))
print("(vt_base/Morpho/Hu/EFD)Stack")
compute_predictions_per_fold_repetition(data_dict["vt_base/Morpho/Hu/EFD"], 
                                        clf_vt_morpho_hu_efd_stack,
                                        res_dict,
                                        results,
                                        "(vt_base/Morpho/Hu/EFD)Stack","vt_base/Morpho/Hu/EFD")

######################vt_base/NAS###################
results = res_dict.get(("(vt_base/NASLarge)Stack", "vt_base/NASLarge"))
print("(vt_base/NASLarge)Stack")
compute_predictions_per_fold_repetition(data_dict["vt_base/NASLarge"], 
                                        clf_vt_nas_stack,
                                        res_dict,
                                        results,
                                        "(vt_base/NASLarge)Stack","vt_base/NASLarge")

######################vt_base/NASLargeMorpho###################
results = res_dict.get(("(vt_base/NASLarge/Morpho)Stack", "vt_base/NASLarge/Morpho"))
print("(vt_base/NASLarge/Morpho)Stack")
compute_predictions_per_fold_repetition(data_dict["vt_base/NASLarge/Morpho"], 
                                        clf_vt_nas_morpho_stack,
                                        res_dict,
                                        results,
                                        "(vt_base/NASLarge/Morpho)Stack","vt_base/NASLarge/Morpho")